On second thought, this isn't a bug. The penalty only penalizes low-count 
pairs, as designed.

The problem is that I need rules counts, but I think the solution is to follow 
Moses route, and add those counts as a subsequent field.

matt


> On Oct 14, 2016, at 2:27 PM, Felix Hieber <felix.hie...@gmail.com> wrote:
> 
> Hi Matt,
> Good catch! If you go for 1 + log(count) [any reason for the '1 +'?] it
> probably shouldn't be called RarityPenalty anymore :)
> 
> Cheers,
> Felix
> 
> On Fri, 14 Oct 2016 at 18:34, Matt Post <p...@cs.jhu.edu> wrote:
> 
> And by "very highly attested word pairs", I mean "any word pair with a
> count ≥ 15" (!).
> 
> I am changing this to return
> 
>        1 + Math.log(annotation.count())
> 
> and will commit this after testing.
> 
> matt
> 
> 
>> On Oct 14, 2016, at 12:25 PM, Matt Post <p...@cs.jhu.edu> wrote:
>> 
>> Hi folks,
>> 
>> There is a bug in Thrax related to floating point underflow and the
> computation of the rarity penalty. I'm training large models over Europarl
> and other datasets for the Spanish–English language pack, and in an attempt
> to filter the models down to the hundred most frequent candidates, am
> finding that often the rarity penalty is 0. For example:
>> 
>> [X] ||| australia . ||| australia . ||| Lex(e|f)=0.49798 Lex(f|e)=0.45459
> PhrasePenalty=1 RarityPenalty=0 p(e|f)=0.05919 p(f|e)=0.09309 ||| 0-0 1-1
>> 
>> "australia" occurs many times in the training corpus, so there is no
> reason that RarityPenalty should be 0.
>> 
>> Note that the rarity penalty is not a raw count, but is computed as
>> 
>> @Override
>> public Writable score(RuleWritable r, Annotation annotation) {
>>   return new FloatWritable((float) Math.exp(1 - annotation.count()));
>> }
>> 
>> 
> https://github.com/joshua-decoder/thrax/blob/master/src/edu/jhu/thrax/hadoop/features/annotation/RarityPenaltyFeature.java
>> 
>> So the problem seems to be that, for very highly-attested word pairs, the
> counts are so high that Math.exp() here is negative and gets truncated to 0
> when only five decimal places are requested.
>> 
>> I wonder, why the Math.exp(1-x) dance on this value? Why not just have
> the rarity penalty return the log count?
>> 
>> matt

Reply via email to