Hi,

        There are a few differences, most of which I'd expect you're fine with.

- The discounts are different but you're using --discount_fallback so
you know that.

- Unknown word handling is different.  If you want an SRI's IMHO broken
behavior pass --interpolate_unigrams 0 (though if your vocabulary is
small enough then it might actually behave like the default
--interpolate_unigrams 1, since SRI switches between the two based on
p(<unk>).

- SRI's default prunes singletons of order 3 and above based on adjusted
count.  lmplz doesn't prune by default but if you turn it on then
pruning is based on count (!= adjusted count).

To summarize, the following two are equivalent up to floating-point
rounding (SRI is less precise):

lmplz -o 5 --interpolate_unigrams 0 <text >text.arpa

ngram-count -order 5 -interpolate -kndiscount -unk -gt3min 1 -gt4min 1
-gt5min 1 -text text -lm text.arpa

I do not think they can be made equivalent in a --discount_fallback
scenario.

Kenneth

On 02/18/2016 02:50 PM, jeremy wrote:
> Hi,
> 
> Does anyone know if there are any gotchas using lmplz instead of 
> ngram-count during transliteration model training? I'm trying it out 
> using the --discount_fallback option and lmplz's default behavior should 
> match the -interpolate option for ngram-count (I think?)
> 
> Thanks!
> -Jeremy
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to