Hi folks,
There is a bug in Thrax related to floating point underflow and the computation
of the rarity penalty. I'm training large models over Europarl and other
datasets for the Spanish–English language pack, and in an attempt to filter the
models down to the hundred most frequent candidates, am finding that often the
rarity penalty is 0. For example:
[X] ||| australia . ||| australia . ||| Lex(e|f)=0.49798 Lex(f|e)=0.45459
PhrasePenalty=1 RarityPenalty=0 p(e|f)=0.05919 p(f|e)=0.09309 ||| 0-0 1-1
"australia" occurs many times in the training corpus, so there is no reason
that RarityPenalty should be 0.
Note that the rarity penalty is not a raw count, but is computed as
@Override
public Writable score(RuleWritable r, Annotation annotation) {
return new FloatWritable((float) Math.exp(1 - annotation.count()));
}
https://github.com/joshua-decoder/thrax/blob/master/src/edu/jhu/thrax/hadoop/features/annotation/RarityPenaltyFeature.java
So the problem seems to be that, for very highly-attested word pairs, the
counts are so high that Math.exp() here is negative and gets truncated to 0
when only five decimal places are requested.
I wonder, why the Math.exp(1-x) dance on this value? Why not just have the
rarity penalty return the log count?
matt