On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Alexander Korotkov <aekorot...@gmail.com> writes:
> > Revised version of patch with necessary comments.
> I looked at this patch a bit.  It seems like this:
> +  *    BLANK_COLOR_SIZE - How much blank character is more frequent than
> +  *                       other character in average
> + #define BLANK_COLOR_SIZE      32
> is a drastic overestimate.  Using the program Erik provided to generate
> some sample data, I find that blanks in trigrams are maybe about three
> times more common than other characters.  Possibly Erik's data isn't
> very representative of real text, but if that's the case we'd better
> get a more representative sample case before we start optimizing ...
> Now maybe the above number is okay for this purpose anyway, but if so
> it needs some different justification; maybe call it a "penalty"?
> But nonetheless it seems like a darn large penalty, particularly for " x"
> type trigrams where the value would effectively be squared.  Compared to
> the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
> you might as well forget the "sizing" approach and just flat out reject
> trigrams containing COLOR_BLANK, because they'll never get through the
> size filter.

It seems to be right decision to me when we have other trigrams can reject
them. But there are cases when we can't.

>  I'm inclined to think you need a larger MAX_TRGM_SIZE;
> and WISH_TRGM_SIZE seems mighty small as well, compared to what the
> code would have done before.

OK, I would like to make more reasoning for penalty.
Let's consider word of length n.
It contains n+1 trigrams including:
1 __x trigram
1 _xx trigram
1 xx_ trigram
n - 2 xxx trigrams

Assume alphabet of size m those trigrams have following average frequencies:
__x 1/((n+1)*m)
_xx 1/((n+1)*m^2)
xx_ 1/((n+1)*m^2)
xxx (n-2)/((n+1)*m^3)

The ratio of this frequencies with blanks to frequency of xxx is:
__x m^2/(n-2)
_xx and xx_ m/(n-2)

In order to estimate n I've analyzed some classics:
Ernest Hemingway "A farewell to arms" - 3.8
Leo Tolstoy "War and Peace" - 5.0

In english with alphabet size = 26 we have

__x m^2/(n-2) = 375
_xx and xx_ m/(n-2) = 14.4

In russian with alphabet size = 33 we have

__x m^2/(n-2) = 363
_xx and xx_ m/(n-2) = 11

Assuming these calculations is approximate, we can safely use same values
for both languages.
Does such reasoning make sense?

Also, surely this bit:
> !       trgmNFA->totalTrgmCount = (int) totalTrgmSize;
> is flat out wrong?  expandColorTrigrams() expects that
> trgmNFA->totalTrgmCount is the truth, not some penalty-bloated
> abstraction.  The fact that the patch hasn't failed on you completely
> demonstrates that you've not tested any cases where blank-containing
> trigrams got through the filter, reinforcing my feeling that it's
> probably too strict now.

Oh, I wonder how can I leave such weird bug in patch :-(

> I wonder if it would be more appropriate to continue to measure the limit
> MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
> penalized "size" as the sort key while determining which color trigrams
> to discard first.

Agree. But I would like to save some equivalent of WISH_TRGM_SIZE.

Another thought is that it's not clear that you should apply the same
> penalty to blanks in all three positions.  Because of the padding
> settings, any one word will normally lead to padded strings "  a",
> " ab", "yz ", so that spaces in the first position are about twice
> as common as spaces in the other positions.  (It's a little less
> than that because single-letter words produce only "  a" and " a ";
> but I'd think that such words aren't very common in text that trigrams
> are effective for.)  So I'm thinking the penalty ought to take that
> into account.
> I'm also inclined to think that it might be worth adding a size
> field to ColorTrgmInfo rather than repeatedly calculating the
> size estimate.  We only allow a thousand and some ColorTrgmInfos
> at most, so the extra space wouldn't be that large, and I'm concerned
> by the expense the current patch adds to the sort comparator.


> Another point is that this comment:
>          * Note: per-color-trigram counts cannot overflow an int so long as
>          * COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie
> about
>          * 1290.  However, the grand total totalTrgmCount might conceivably
>          * overflow an int, so we use int64 for that within this routine.
> is no longer valid, or at least it fails to demonstrate that the result
> of getColorTrigramSize() can't overflow an int; indeed I rather fear it
> can.
> The patch failed to even update the variable name in this comment, let
> alone address the substantive question.
> There are some other cosmetic things I didn't like, for instance
> colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
> rename it.  That's about it for substantive comments though.

Thanks, will be fixed.

With best regards,
Alexander Korotkov.

Reply via email to