Hi,
There are a few differences, most of which I'd expect you're fine with.
- The discounts are different but you're using --discount_fallback so
you know that.
- Unknown word handling is different. If you want an SRI's IMHO broken
behavior pass --interpolate_unigrams 0 (though if your vocabulary is
small enough then it might actually behave like the default
--interpolate_unigrams 1, since SRI switches between the two based on
p(<unk>).
- SRI's default prunes singletons of order 3 and above based on adjusted
count. lmplz doesn't prune by default but if you turn it on then
pruning is based on count (!= adjusted count).
To summarize, the following two are equivalent up to floating-point
rounding (SRI is less precise):
lmplz -o 5 --interpolate_unigrams 0 <text >text.arpa
ngram-count -order 5 -interpolate -kndiscount -unk -gt3min 1 -gt4min 1
-gt5min 1 -text text -lm text.arpa
I do not think they can be made equivalent in a --discount_fallback
scenario.
Kenneth
On 02/18/2016 02:50 PM, jeremy wrote:
> Hi,
>
> Does anyone know if there are any gotchas using lmplz instead of
> ngram-count during transliteration model training? I'm trying it out
> using the --discount_fallback option and lmplz's default behavior should
> match the -interpolate option for ngram-count (I think?)
>
> Thanks!
> -Jeremy
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support