Dear Moses,
KenLM now estimates modified Kneser-Ney language models from text.
This is done with streaming on-disk algorithms where you pick the memory
buffer size, enabling you to build much larger language models (i.e. all
the data allowed by WMT 2013) without running out of RAM.
It is in Moses master as of fc5868d and as a standalone from
http://kheafield.com/code/kenlm.tar.gz. The command line is relatively
simple:
bin/lmplz -o 5 <text >text.arpa
Memory usage (-S 80%) and temporary file location (-T /tmp) options are
compatible with GNU sort.
There is NO PRUNING, so the comparable SRILM command line is
ngram-count -order 5 -interpolate -kndiscount -unk -gt3min 1 -gt4min 1
-gt5min 1 -text text -lm text.arpa
For more documentation, see http://kheafield.com/code/kenlm/estimation/ .
Kenneth
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support