And by "very highly attested word pairs", I mean "any word pair with a count ≥
15" (!).
I am changing this to return
1 + Math.log(annotation.count())
and will commit this after testing.
matt
> On Oct 14, 2016, at 12:25 PM, Matt Post <[email protected]> wrote:
>
> Hi folks,
>
> There is a bug in Thrax related to floating point underflow and the
> computation of the rarity penalty. I'm training large models over Europarl
> and other datasets for the Spanish–English language pack, and in an attempt
> to filter the models down to the hundred most frequent candidates, am finding
> that often the rarity penalty is 0. For example:
>
> [X] ||| australia . ||| australia . ||| Lex(e|f)=0.49798 Lex(f|e)=0.45459
> PhrasePenalty=1 RarityPenalty=0 p(e|f)=0.05919 p(f|e)=0.09309 ||| 0-0 1-1
>
> "australia" occurs many times in the training corpus, so there is no reason
> that RarityPenalty should be 0.
>
> Note that the rarity penalty is not a raw count, but is computed as
>
> @Override
> public Writable score(RuleWritable r, Annotation annotation) {
> return new FloatWritable((float) Math.exp(1 - annotation.count()));
> }
>
> https://github.com/joshua-decoder/thrax/blob/master/src/edu/jhu/thrax/hadoop/features/annotation/RarityPenaltyFeature.java
>
> So the problem seems to be that, for very highly-attested word pairs, the
> counts are so high that Math.exp() here is negative and gets truncated to 0
> when only five decimal places are requested.
>
> I wonder, why the Math.exp(1-x) dance on this value? Why not just have the
> rarity penalty return the log count?
>
> matt