Hi folks,

There is a bug in Thrax related to floating point underflow and the computation 
of the rarity penalty. I'm training large models over Europarl and other 
datasets for the Spanish–English language pack, and in an attempt to filter the 
models down to the hundred most frequent candidates, am finding that often the 
rarity penalty is 0. For example:

[X] ||| australia . ||| australia . ||| Lex(e|f)=0.49798 Lex(f|e)=0.45459 
PhrasePenalty=1 RarityPenalty=0 p(e|f)=0.05919 p(f|e)=0.09309 ||| 0-0 1-1

"australia" occurs many times in the training corpus, so there is no reason 
that RarityPenalty should be 0.

Note that the rarity penalty is not a raw count, but is computed as

  public Writable score(RuleWritable r, Annotation annotation) {
    return new FloatWritable((float) Math.exp(1 - annotation.count()));


So the problem seems to be that, for very highly-attested word pairs, the 
counts are so high that Math.exp() here is negative and gets truncated to 0 
when only five decimal places are requested.

I wonder, why the Math.exp(1-x) dance on this value? Why not just have the 
rarity penalty return the log count?


Reply via email to