I've checked in an updated kenlm as revision 3847. This involves a binary format change, so you'll need to rebuild from your ARPA files, sorry.
- There's an important correctness fix. Some models contain n-grams like "foo bar baz quux" without their n-grams e.g. "bar baz quux" and "baz quux" because these were pruned. In these cases, old kenlm probing returns an incorrect probability for "foo bar baz quux" because it searches for n-grams of increasing order. In a stock SRI model with default settings, 1.2% of n-grams were impacted and further these were unlikely n-grams because pruning happened. The trie data structure would silently break in this case. I've fixed both data structures to correctly handle these n-grams. Trie memory consumption will increase slightly (~1.2%) as a result. - There are now more opportunities for recombination. Suppose "foo bar" appears in the model with zero backoff but no trigram begins with "foo bar". Then a hypothesis ending with "foo bar" can recombine with other hypotheses conditioned solely on "bar" (or even less if bar is similar). SRI already does this when marked in the model (this is the difference between 0.0 and blank in SRI ARPA files). But SRI will miss when filtering removes n-grams blocking recombination or IRST does not leave the backoff blank. KenLM will catch these and allow recombination. More recombination means you may see different results using KenLM, but they should have better model scores. - Models that contain <unk> inside n-grams are now supported. These can be made using -vocab in SRI. - Trie building uses less memory. It also takes longer, but that's to support the above features. Making trie building faster is planned for a future release. - The parser will no longer throw exceptions when your words contain form-feed, carriage return, or vertical tab. Previously, these were interpreted as spaces. So if you have unclean training data, the model will still load. - The sorted data structure is disabled. It was slower and larger than trie and would have been a pain to fix. Lastly, a question to Moses developers: the "unsigned int* len" parameter to language models: should that be the length of the n-gram matched or the length of context that should be kept for purposes of recombination? Kenneth _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
