Dear Moses,

        KenLM now estimates modified Kneser-Ney language models from text. 
This is done with streaming on-disk algorithms where you pick the memory 
buffer size, enabling you to build much larger language models (i.e. all 
the data allowed by WMT 2013) without running out of RAM.

        It is in Moses master as of fc5868d and as a standalone from 
http://kheafield.com/code/kenlm.tar.gz.  The command line is relatively 
simple:

bin/lmplz -o 5 <text >text.arpa

Memory usage (-S 80%) and temporary file location (-T /tmp) options are 
compatible with GNU sort.

        There is NO PRUNING, so the comparable SRILM command line is

ngram-count -order 5 -interpolate -kndiscount -unk -gt3min 1 -gt4min 1 
-gt5min 1 -text text -lm text.arpa

For more documentation, see http://kheafield.com/code/kenlm/estimation/ .

Kenneth
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to