Excellent, sounds good Kenneth, thanks for the work, can't wait to try it
out.

Kind regards,

Lee Ball
Infrastructure Manager
lee.b...@appliedlanguage.com
Skype ID: lee.ball_appliedlanguage
Tel: +44 (0)844 854 8945

Applied Language Solutions
High quality language solutions delivered on time...with a smile!

www.appliedlanguage.com
Tel (UK): +44 (0)845 367 7000
Tel (US): +1 (800) 579-5010

Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ. UK
Registered in the UK 5122429

Pride in everything we do | Respect everyone like a friend
[image: An Environmentally Friendly Company]Think of the environment; please
don't print this e-mail unless you really need to.

[image: Fast Track 100 2009][image: Queens Award for Business]
<http://twitter.com/appliedlanguage>



On 25 January 2011 19:43, Kenneth Heafield <mo...@kheafield.com> wrote:

> I've checked in an updated kenlm as revision 3847.  This involves a
> binary format change, so you'll need to rebuild from your ARPA files,
> sorry.
>
> - There's an important correctness fix.  Some models contain n-grams
> like "foo bar baz quux" without their n-grams e.g. "bar baz quux" and
> "baz quux" because these were pruned.  In these cases, old kenlm probing
> returns an incorrect probability for "foo bar baz quux" because it
> searches for n-grams of increasing order. In a stock SRI model with
> default settings, 1.2% of n-grams were impacted and further these were
> unlikely n-grams because pruning happened.  The trie data structure
> would silently break in this case.  I've fixed both data structures to
> correctly handle these n-grams.  Trie memory consumption will increase
> slightly (~1.2%) as a result.
>
> - There are now more opportunities for recombination.  Suppose "foo bar"
> appears in the model with zero backoff but no trigram begins with "foo
> bar".  Then a hypothesis ending with "foo bar" can recombine with other
> hypotheses conditioned solely on "bar" (or even less if bar is similar).
>  SRI already does this when marked in the model (this is the difference
> between 0.0 and blank in SRI ARPA files).  But SRI will miss when
> filtering removes n-grams blocking recombination or IRST does not leave
> the backoff blank.  KenLM will catch these and allow recombination.
> More recombination means you may see different results using KenLM, but
> they should have better model scores.
>
> - Models that contain <unk> inside n-grams are now supported.  These can
> be made using -vocab in SRI.
>
> - Trie building uses less memory.  It also takes longer, but that's to
> support the above features.  Making trie building faster is planned for
> a future release.
>
> - The parser will no longer throw exceptions when your words contain
> form-feed, carriage return, or vertical tab.  Previously, these were
> interpreted as spaces.  So if you have unclean training data, the model
> will still load.
>
> - The sorted data structure is disabled.  It was slower and larger than
> trie and would have been a pain to fix.
>
> Lastly, a question to Moses developers: the "unsigned int* len"
> parameter to language models: should that be the length of the n-gram
> matched or the length of context that should be kept for purposes of
> recombination?
>
> Kenneth
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to