And by "very highly attested word pairs", I mean "any word pair with a count ≥ 
15" (!).

I am changing this to return

        1 + Math.log(annotation.count())

and will commit this after testing.

matt


> On Oct 14, 2016, at 12:25 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
> Hi folks,
> 
> There is a bug in Thrax related to floating point underflow and the 
> computation of the rarity penalty. I'm training large models over Europarl 
> and other datasets for the Spanish–English language pack, and in an attempt 
> to filter the models down to the hundred most frequent candidates, am finding 
> that often the rarity penalty is 0. For example:
> 
> [X] ||| australia . ||| australia . ||| Lex(e|f)=0.49798 Lex(f|e)=0.45459 
> PhrasePenalty=1 RarityPenalty=0 p(e|f)=0.05919 p(f|e)=0.09309 ||| 0-0 1-1
> 
> "australia" occurs many times in the training corpus, so there is no reason 
> that RarityPenalty should be 0.
> 
> Note that the rarity penalty is not a raw count, but is computed as
> 
>  @Override
>  public Writable score(RuleWritable r, Annotation annotation) {
>    return new FloatWritable((float) Math.exp(1 - annotation.count()));
>  }
> 
> https://github.com/joshua-decoder/thrax/blob/master/src/edu/jhu/thrax/hadoop/features/annotation/RarityPenaltyFeature.java
> 
> So the problem seems to be that, for very highly-attested word pairs, the 
> counts are so high that Math.exp() here is negative and gets truncated to 0 
> when only five decimal places are requested.
> 
> I wonder, why the Math.exp(1-x) dance on this value? Why not just have the 
> rarity penalty return the log count?
> 
> matt

Reply via email to