Dear Moses,
Trunk revision 4247 incorporates KenLM changes from MT Marathon
(team: Hieu Hoang, Tetsuo Kiso, Marcello Federico, and myself) to
minimize left language model state for chart decoding. This resulted in
a binary file format change.
Previously, if you used e.g. a 5-gram language model, the chart
entries would be separated by their first 4 words (in addition to other
constraints). This change relaxes this to only as many words as
required for correct scoring, leading to more recombination (so
theoretically, you could lower the pop limit). Further, the left state
keeps pointers, in lieu of word indices, that make the language model
scoring faster. This change only impacts KenLM; other language models
will still keep 4 words (IRSTLM is invited to read kenlm/lm/left.hh and
implement the same interface). As a result, you should expect to see
better model scores (on average; theoretically it could not prune a
hypothesis that later kicks out what would have become single-best) when
using KenLM. Also, chart now runs 5% faster with the same pruning
settings.
When SRILM's default pruning keeps n-gram A B C D E, but removes B C
D E, this leads to several nasty corner cases. Previously, I
re-inserted B C D E with a blank probability. To avoid the corner
cases, KenLM now fully restores these entries: p(B C D E) = p(C D E) +
backoff(B C D) where p(C D E) may itself be restored. This led to major
changes in the trie builder, but it's passing tests. Since the blank
probability no longer needs to be encoded, quantization now gives you
the full 2^b probability values instead of 2^b - 1 (but backoff still
reserves two values for +/- 0).
We've tested that LM scores come out correctly and that average
model score goes up and I'm running more stuff. Tom Hoar, here's your
cue.
Kenneth
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support