I've checked in an updated kenlm as revision 3847.  This involves a
binary format change, so you'll need to rebuild from your ARPA files,
sorry.

- There's an important correctness fix.  Some models contain n-grams
like "foo bar baz quux" without their n-grams e.g. "bar baz quux" and
"baz quux" because these were pruned.  In these cases, old kenlm probing
returns an incorrect probability for "foo bar baz quux" because it
searches for n-grams of increasing order. In a stock SRI model with
default settings, 1.2% of n-grams were impacted and further these were
unlikely n-grams because pruning happened.  The trie data structure
would silently break in this case.  I've fixed both data structures to
correctly handle these n-grams.  Trie memory consumption will increase
slightly (~1.2%) as a result.

- There are now more opportunities for recombination.  Suppose "foo bar"
appears in the model with zero backoff but no trigram begins with "foo
bar".  Then a hypothesis ending with "foo bar" can recombine with other
hypotheses conditioned solely on "bar" (or even less if bar is similar).
 SRI already does this when marked in the model (this is the difference
between 0.0 and blank in SRI ARPA files).  But SRI will miss when
filtering removes n-grams blocking recombination or IRST does not leave
the backoff blank.  KenLM will catch these and allow recombination.
More recombination means you may see different results using KenLM, but
they should have better model scores.

- Models that contain <unk> inside n-grams are now supported.  These can
be made using -vocab in SRI.

- Trie building uses less memory.  It also takes longer, but that's to
support the above features.  Making trie building faster is planned for
a future release.

- The parser will no longer throw exceptions when your words contain
form-feed, carriage return, or vertical tab.  Previously, these were
interpreted as spaces.  So if you have unclean training data, the model
will still load.

- The sorted data structure is disabled.  It was slower and larger than
trie and would have been a pain to fix.

Lastly, a question to Moses developers: the "unsigned int* len"
parameter to language models: should that be the length of the n-gram
matched or the length of context that should be kept for purposes of
recombination?

Kenneth
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to