Re: [HACKERS] tsvector pg_stats seems quite a bit off.

Jan Urbański Sun, 30 May 2010 11:02:24 -0700

On 30/05/10 09:08, Jesper Krogh wrote:
> On 2010-05-29 15:56, Jan Urbański wrote:
>> On 29/05/10 12:34, Jesper Krogh wrote:
>>> I can "fairly easy" try out patches or do other kind of testing.
>>>      
>> I'll try to come up with a patch for you to try and fiddle with these
>> values before Monday.


Here's a patch against recent git, but should apply to 8.4 sources as
well. It would be interesting to measure the memory and time needed to
analyse the table after applying it, because we will be now using a lot
bigger bucket size and I haven't done any performance impact testing on
it. I updated the initial comment block in compute_tsvector_stats, but
the prose could probably be improved.

> testdb=# explain select id from testdb.reference where document_tsvector
> @@ plainto_tsquery('where') order by id limit 50;
> NOTICE:  text-search query contains only stop words or doesn't contain
> lexemes, ignored

That's orthogonal to the issue with the statistics collection, you just
need to modify your stopwords list (for instance make it empty).

Cheers,
Jan

*** src/backend/tsearch/ts_typanalyze.c
--- src/backend/tsearch/ts_typanalyze.c	2010-05-30 19:52:28.000000000 +0200
***************
*** 92,112 ****
   *	http://www.vldb.org/conf/2002/S10P03.pdf
   *
   *	The Lossy Counting (aka LC) algorithm goes like this:
!  *	Let D be a set of triples (e, f, d), where e is an element value, f is
!  *	that element's frequency (occurrence count) and d is the maximum error in
!  *	f.	We start with D empty and process the elements in batches of size
!  *	w. (The batch size is also known as "bucket size".) Let the current batch
!  *	number be b_current, starting with 1. For each element e we either
!  *	increment its f count, if it's already in D, or insert a new triple into D
!  *	with values (e, 1, b_current - 1). After processing each batch we prune D,
!  *	by removing from it all elements with f + d <= b_current. Finally, we
!  *	gather elements with largest f.  The LC paper proves error bounds on f
!  *	dependent on the batch size w, and shows that the required table size
!  *	is no more than a few times w.
   *
!  *	We use a hashtable for the D structure and a bucket width of
!  *	statistics_target * 10, where 10 is an arbitrarily chosen constant,
!  *	meant to approximate the number of lexemes in a single tsvector.
   */
  static void
  compute_tsvector_stats(VacAttrStats *stats,
--- 92,126 ----
   *	http://www.vldb.org/conf/2002/S10P03.pdf
   *
   *	The Lossy Counting (aka LC) algorithm goes like this:
! 
!  *  Let s be a threshold frequency for an item and epsilon the error margin for
!  *  the frequency. Let D be a set of triples (e, f, d), where e is an element
!  *  value, f is that element's frequency (occurrence count) and d is the maximum
!  *  error in f. We start with D empty and process the elements in batches of
!  *  size w. (The batch size is also known as "bucket size" and is equal to
!  *  1/epsilon.) Let the current batch number be b_current, starting with 1. For
!  *  each element e we either increment its f count, if it's already in D, or
!  *  insert a new triple into D with values (e, 1, b_current - 1). After
!  *  processing each batch we prune D, by removing from it all elements with f +
!  *  d <= b_current.  After the algorithm finishes we suppress all elements from
!  *  D that do not satisfy f >= (s - e) * N, where N is the total number of
!  *  lexemes in the input.  The LC paper proves error bounds on f dependent on
!  *  the batch size w, and shows that the required table size is no more than a
!  *  few times w.
!  *
!  *  We set s to be the estimated frequency of the K'th word in a natural
!  *  language's frequency table, where K is the target number of entries in the
!  *  MCELEM array plus an arbitrary constant, meant to reflect the fact that the
!  *  top words in any language would usually be stopwords and we will not ever
!  *  see them in the input.  We assume that the distribution of word frequencies
!  *  follows Zipf's law with an exponent of 1.
   *
!  *  Assuming Zipfian distribution, thet frequency of the K'th element is equal
!  *  to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of
!  *  words in the language.  Putting one million as W, we get roughly 0.07/K,
!  *  assuming top 10 words are stopwords gives s = 0.07/(K + 10).  We set epsilon
!  *  to s/10 = 0.007/(K + 10), which gives a bucket width of (K + 10)/0.007 and
!  *  use a hashtable for the D structure.
   */
  static void
  compute_tsvector_stats(VacAttrStats *stats,
***************
*** 114,120 ****
  					   int samplerows,
  					   double totalrows)
  {
! 	int			num_mcelem;
  	int			null_cnt = 0;
  	double		total_width = 0;
  
--- 128,134 ----
  					   int samplerows,
  					   double totalrows)
  {
! 	int			num_mcelem = stats->attr->attstattarget;
  	int			null_cnt = 0;
  	double		total_width = 0;
  
***************
*** 133,147 ****
  	LexemeHashKey hash_key;
  	TrackItem  *item;
  
- 	/* We want statistics_target * 10 lexemes in the MCELEM array */
- 	num_mcelem = stats->attr->attstattarget * 10;
- 
  	/*
! 	 * We set bucket width equal to the target number of result lexemes. This
! 	 * is probably about right but perhaps might need to be scaled up or down
! 	 * a bit?
  	 */
! 	bucket_width = num_mcelem;
  
  	/*
  	 * Create the hashtable. It will be in local memory, so we don't need to
--- 147,157 ----
  	LexemeHashKey hash_key;
  	TrackItem  *item;
  
  	/*
! 	 * We set bucket width equal to (stats target + 10) / 0.007 as per the
! 	 * comment above.
  	 */
! 	bucket_width = (num_mcelem + 10) * 1000 / 7;
  
  	/*
  	 * Create the hashtable. It will be in local memory, so we don't need to
***************
*** 252,257 ****
--- 262,268 ----
  		int			i;
  		TrackItem **sort_table;
  		int			track_len;
+ 		int			cutoff_freq;
  		int			minfreq,
  					maxfreq;
  
***************
*** 282,291 ****
  		qsort(sort_table, track_len, sizeof(TrackItem *),
  			  trackitem_compare_frequencies_desc);
  
! 		/* Suppress any single-occurrence items */
  		while (track_len > 0)
  		{
! 			if (sort_table[track_len - 1]->frequency > 1)
  				break;
  			track_len--;
  		}
--- 293,307 ----
  		qsort(sort_table, track_len, sizeof(TrackItem *),
  			  trackitem_compare_frequencies_desc);
  
! 		/*
! 		 * Suppress any items will less occurrences than (s - epsilon)N. Since
! 		 * epsilon = s/10 and bucket_width = 1/epsilon, the cutoff frequency is
! 		 * 9*N / bucket_width
! 		 */
! 		cutoff_freq = 9 * lexeme_no / bucket_width;
  		while (track_len > 0)
  		{
! 			if (sort_table[track_len - 1]->frequency > cutoff_freq)
  				break;
  			track_len--;
  		}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

Reply via email to