Corrections: On Thu, Dec 29, 2011 at 11:35:00AM -0500, Noah Misch wrote: > On Wed, Nov 09, 2011 at 08:49:35PM +0400, Alexander Korotkov wrote: > > + * We set s to be the estimated frequency of the K'th element in a > > natural > > + * language's frequency table, where K is the target number of > > entries in > > + * the MCELEM array. We assume that the distribution of element > > frequencies > > + * follows Zipf's law with an exponent of 1. > > + * > > + * Assuming Zipfian distribution, the frequency of the K'th > > element is equal > > + * to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is > > the number of > > + * elements in the language. Putting W as one million, we > > get roughly > > + * 0.07/K. This gives s = 0.07/K. We set epsilon = s/10, which > > gives bucket > > + * width w = K/0.007 and maximum expected hashtable size of about > > 1000 * K. > > These last two paragraphs, adapted from ts_typanalyze.c, assume natural > language documents. To what extent do these parameter choices remain sensible > for arbitrary data such as users may place in arrays? In any event, we need a > different justification, even if it's just a hand-wavy justification. > > If I'm following this correctly, this choice of "s" makes the algorithm > guaranteed to find only elements constituting >= 7% of the input elements. > Incidentally, isn't that far too high for natural language documents? If the > English word "the" only constitutes 7% of typical documents, then this "s" > value would permit us to discard practically every word; we'd be left with > words read while filling the last bucket and words that happened to repeat > exceedingly often in the column. I haven't tried to make a test case to > observe this problem; am I missing something? (This question is largely > orthogonal to your patch.)
No, we'll find elements of frequency at least 0.07/(default_statistics_target * 10) -- in the default configuration, 0.007%. Also, ts_typanalyze() counts the number of documents that contain one or more instances of each lexeme, ignoring the number of appearances within each document. The word "the" may constitute 7% of a typical document, but it will appear at least once in nearly 100% of documents. Therefore, this "s" value is adequate even for the pathological case of each "document" containing just one lexeme. > > + * > > + * Note: in the above discussion, s, epsilon, and f/N are in terms > > of a > > + * element's frequency as a fraction of all elements seen in the > > input. > > + * However, what we actually want to store in the finished > > pg_statistic > > + * entry is each element's frequency as a fraction of all rows > > that it occurs > > + * in. Elements might be repeated in the same array. Since > > operators > > + * <@, &&, @> takes care only about element occurence itself and > > not about > > + * occurence count, function takes additional care about > > uniqueness of > > + * counting. Also we need to change the divisor from N to > > nonnull_cnt to get > > + * the number we want. > > On the same tangent, why does ts_typanalyze() not deduplicate the same way? > The @@ operator has the same property. Answer: to_tsvector() will have already done so. > > + /* > > + * We set bucket width equal to (num_mcelem + 10) / 0.007 as per the > > + * comment above. > > + */ > > + bucket_width = num_mcelem * 1000 / 7; > > The addend mentioned is not present in the code or discussed in "the comment > above". (I see the comment is copied verbatim from ts_typanalyze(), where the > addend *is* present, though again the preceding comment says nothing of it.) The addend rationale in fact does appear in the ts_typanalyze() comment. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers