Excellent, sounds good Kenneth, thanks for the work, can't wait to try it out.
Kind regards, Lee Ball Infrastructure Manager lee.b...@appliedlanguage.com Skype ID: lee.ball_appliedlanguage Tel: +44 (0)844 854 8945 Applied Language Solutions High quality language solutions delivered on time...with a smile! www.appliedlanguage.com Tel (UK): +44 (0)845 367 7000 Tel (US): +1 (800) 579-5010 Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ. UK Registered in the UK 5122429 Pride in everything we do | Respect everyone like a friend [image: An Environmentally Friendly Company]Think of the environment; please don't print this e-mail unless you really need to. [image: Fast Track 100 2009][image: Queens Award for Business] <http://twitter.com/appliedlanguage> On 25 January 2011 19:43, Kenneth Heafield <mo...@kheafield.com> wrote: > I've checked in an updated kenlm as revision 3847. This involves a > binary format change, so you'll need to rebuild from your ARPA files, > sorry. > > - There's an important correctness fix. Some models contain n-grams > like "foo bar baz quux" without their n-grams e.g. "bar baz quux" and > "baz quux" because these were pruned. In these cases, old kenlm probing > returns an incorrect probability for "foo bar baz quux" because it > searches for n-grams of increasing order. In a stock SRI model with > default settings, 1.2% of n-grams were impacted and further these were > unlikely n-grams because pruning happened. The trie data structure > would silently break in this case. I've fixed both data structures to > correctly handle these n-grams. Trie memory consumption will increase > slightly (~1.2%) as a result. > > - There are now more opportunities for recombination. Suppose "foo bar" > appears in the model with zero backoff but no trigram begins with "foo > bar". Then a hypothesis ending with "foo bar" can recombine with other > hypotheses conditioned solely on "bar" (or even less if bar is similar). > SRI already does this when marked in the model (this is the difference > between 0.0 and blank in SRI ARPA files). But SRI will miss when > filtering removes n-grams blocking recombination or IRST does not leave > the backoff blank. KenLM will catch these and allow recombination. > More recombination means you may see different results using KenLM, but > they should have better model scores. > > - Models that contain <unk> inside n-grams are now supported. These can > be made using -vocab in SRI. > > - Trie building uses less memory. It also takes longer, but that's to > support the above features. Making trie building faster is planned for > a future release. > > - The parser will no longer throw exceptions when your words contain > form-feed, carriage return, or vertical tab. Previously, these were > interpreted as spaces. So if you have unclean training data, the model > will still load. > > - The sorted data structure is disabled. It was slower and larger than > trie and would have been a pain to fix. > > Lastly, a question to Moses developers: the "unsigned int* len" > parameter to language models: should that be the length of the n-gram > matched or the length of context that should be kept for purposes of > recombination? > > Kenneth > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support >
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support