Alexander Korotkov <aekorot...@gmail.com> writes: > On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> I looked at this patch a bit. It seems like this: >> + * BLANK_COLOR_SIZE - How much blank character is more frequent than >> + * other character in average >> + #define BLANK_COLOR_SIZE 32 >> is a drastic overestimate.
> OK, I would like to make more reasoning for penalty. > Let's consider word of length n. > It contains n+1 trigrams including: > 1 __x trigram > 1 _xx trigram > 1 xx_ trigram > n - 2 xxx trigrams > Assume alphabet of size m those trigrams have following average frequencies: > __x 1/((n+1)*m) > _xx 1/((n+1)*m^2) > xx_ 1/((n+1)*m^2) > xxx (n-2)/((n+1)*m^3) Hmm, but you're assuming that the m letters are all equally frequent, which is surely not true in normal text. I didn't have a machine-readable copy of Hemingway or Tolstoy at hand, but I do have a text file with the collected works of H.P. Lovecraft, so I tried analyzing the trigrams in that. I find that * The average word length is 4.78 characters. * There are 5652 distinct trigrams (discounting some containing digits), the most common one (' t') occurring 81222 times; the average occurrence count is 500.0. * Considering only trigrams not containing any blanks, there are 4937 of them, the most common one ('the') occurring 45832 times, with an average count of 266.9. * There are (unsurprisingly) exactly 26 trigrams of the form ' x', with an average count of 19598.5. * There are 689 trigrams of the form ' xx' or 'xx ', the most common one (' th') occurring 58200 times, with an average count of 1450.1. So, we've rediscovered the fact that 'the' is the most common word in English text ;-). But I think the significant conclusion for our purposes here is that single-space trigrams are on average about 1450.1/266.9 = 5.4 times more common than the average space-free trigram, and two-space trigrams about 19598.5/266.9 = 73.4 times more common. So this leads me to the conclusion that we should be using a BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than a square-law rule for two-space trigrams). Even using your numbers, it shouldn't be 32. regards, tom lane -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers