On Mon, Feb 10, 2014 at 1:01 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Alexander Korotkov <aekorot...@gmail.com> writes: > > On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > >> I looked at this patch a bit. It seems like this: > >> + * BLANK_COLOR_SIZE - How much blank character is more frequent > than > >> + * other character in average > >> + #define BLANK_COLOR_SIZE 32 > >> is a drastic overestimate. > > > OK, I would like to make more reasoning for penalty. > > Let's consider word of length n. > > It contains n+1 trigrams including: > > 1 __x trigram > > 1 _xx trigram > > 1 xx_ trigram > > n - 2 xxx trigrams > > > Assume alphabet of size m those trigrams have following average > frequencies: > > __x 1/((n+1)*m) > > _xx 1/((n+1)*m^2) > > xx_ 1/((n+1)*m^2) > > xxx (n-2)/((n+1)*m^3) > > Hmm, but you're assuming that the m letters are all equally frequent, > which is surely not true in normal text. > > I didn't have a machine-readable copy of Hemingway or Tolstoy at hand, > but I do have a text file with the collected works of H.P. Lovecraft, > so I tried analyzing the trigrams in that. I find that > * The average word length is 4.78 characters. > * There are 5652 distinct trigrams (discounting some containing digits), > the most common one (' t') occurring 81222 times; the average > occurrence count is 500.0. > * Considering only trigrams not containing any blanks, there are > 4937 of them, the most common one ('the') occurring 45832 times, > with an average count of 266.9. > * There are (unsurprisingly) exactly 26 trigrams of the form ' x', > with an average count of 19598.5. > * There are 689 trigrams of the form ' xx' or 'xx ', the most common one > (' th') occurring 58200 times, with an average count of 1450.1. > > So, we've rediscovered the fact that 'the' is the most common word in > English text ;-). But I think the significant conclusion for our purposes > here is that single-space trigrams are on average about 1450.1/266.9 = 5.4 > times more common than the average space-free trigram, and two-space > trigrams about 19598.5/266.9 = 73.4 times more common. > > So this leads me to the conclusion that we should be using a > BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than > a square-law rule for two-space trigrams). Even using your numbers, > it shouldn't be 32. > Next revision of patch is attached. Changes are so: 1) Notion "penalty" is used instead of "size". 2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is MAX_TRGM_COUNT total trigrams count. 3) Penalties are assigned to particular color trigram classes. I.e. separate penalties for __a, _aa, _a_, aa_. It's based on analysis of trigram frequencies in Oscar Wilde writings. We can end up with different numbers, but I don't think they will be dramatically different. ------ With best regards, Alexander Korotkov.
trgm-regex-optimize.3.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers