Re: [Moses-support] KenLM distributed with Moses
Hi, I've not seen it in this list but what licenses is KenLM distributed under? Kind regards, Lee Ball Infrastructure Manager lee.b...@appliedlanguage.com Skype ID: lee.ball_appliedlanguage Tel: +44 (0)844 854 8945 Applied Language Solutions High quality language solutions delivered on time...with a smile! www.appliedlanguage.com Tel (UK): +44 (0)845 367 7000 Tel (US): +1 (800) 579-5010 Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ. UK Registered in the UK 5122429 Pride in everything we do | Respect everyone like a friend [image: An Environmentally Friendly Company]Think of the environment; please don't print this e-mail unless you really need to. [image: Fast Track 100 2009][image: Queens Award for Business] On 19 October 2010 01:31, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
Ken, Your new enhancements ROCK! Here are some numbers using rev 3675 and IRSTLM 5.50.01 Machine: Core2Quad, 2.4 Ghz, 4 GB RAM Data: EN-NL sample data, 37,500 segments (micro test sample) 3 gram LM, 3 gram tables (for fast testing) Train LM with SRILM Train tables/tune/eval with Moses/SRILM multi-threading enabled: 75 minutes BLEU Score: 0.2531 Train LM with IRSTLM Train tables/tune/eval with Moses/IRSLM, binarized memap, single thread:195 minutes BLEU Score: 0.2496 Train LM with IRSTLM (ARPA) Train tables/tune/eval with Moses/KenLM, binarized memap, multi-threaded: 50 minutes BLEU Score: 0.2514 On Wed, 27 Oct 2010 14:15:39 -0400, Kenneth Heafield mo...@kheafield.com wrote: Revision 3671 introduces an updated version of kenlm. Queries are faster now (no more string vocab lookups, state is kept so backoffs cost less). The binary format has changed as a result; please rebuild your binary files. Timing is forthcoming. Kenneth On 10/18/10 20:31, Kenneth Heafield wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
Thanks for sharing! Looks like building my Moses system from scratch finally finished, so I'll be making some memory benchmarks today too. Just so I understand, you ran separate MERT for each of your three cases? Then MERT randomness should explain the insignificant difference in BLEU between result 1 and result 3. Kenneth On 10/29/10 10:06, supp...@precisiontranslationtools.com wrote: Ken, Your new enhancements ROCK! Here are some numbers using rev 3675 and IRSTLM 5.50.01 Machine: Core2Quad, 2.4 Ghz, 4 GB RAM Data: EN-NL sample data, 37,500 segments (micro test sample) 3 gram LM, 3 gram tables (for fast testing) Train LM with SRILM Train tables/tune/eval with Moses/SRILM multi-threading enabled: 75 minutes BLEU Score: 0.2531 Train LM with IRSTLM Train tables/tune/eval with Moses/IRSLM, binarized memap, single thread:195 minutes BLEU Score: 0.2496 Train LM with IRSTLM (ARPA) Train tables/tune/eval with Moses/KenLM, binarized memap, multi-threaded: 50 minutes BLEU Score: 0.2514 On Wed, 27 Oct 2010 14:15:39 -0400, Kenneth Heafield mo...@kheafield.com wrote: Revision 3671 introduces an updated version of kenlm. Queries are faster now (no more string vocab lookups, state is kept so backoffs cost less). The binary format has changed as a result; please rebuild your binary files. Timing is forthcoming. Kenneth On 10/18/10 20:31, Kenneth Heafield wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
Yes, all scores and times were from scratch without reusing anything. Precision Translation Tools will announce a simpler solution to building a moses system from scratch next week. Essentially, from minimal server configuration to completely installed Moses system in four steps and 30 minute wait time. Stay tuned. Tom On Fri, 29 Oct 2010 10:17:54 -0400, Kenneth Heafield mo...@kheafield.com wrote: Thanks for sharing! Looks like building my Moses system from scratch finally finished, so I'll be making some memory benchmarks today too. Just so I understand, you ran separate MERT for each of your three cases? Then MERT randomness should explain the insignificant difference in BLEU between result 1 and result 3. Kenneth On 10/29/10 10:06, supp...@precisiontranslationtools.com wrote: Ken, Your new enhancements ROCK! Here are some numbers using rev 3675 and IRSTLM 5.50.01 Machine: Core2Quad, 2.4 Ghz, 4 GB RAM Data: EN-NL sample data, 37,500 segments (micro test sample) 3 gram LM, 3 gram tables (for fast testing) Train LM with SRILM Train tables/tune/eval with Moses/SRILM multi-threading enabled: 75 minutes BLEU Score: 0.2531 Train LM with IRSTLM Train tables/tune/eval with Moses/IRSLM, binarized memap, single thread:195 minutes BLEU Score: 0.2496 Train LM with IRSTLM (ARPA) Train tables/tune/eval with Moses/KenLM, binarized memap, multi-threaded: 50 minutes BLEU Score: 0.2514 On Wed, 27 Oct 2010 14:15:39 -0400, Kenneth Heafield mo...@kheafield.com wrote: Revision 3671 introduces an updated version of kenlm. Queries are faster now (no more string vocab lookups, state is kept so backoffs cost less). The binary format has changed as a result; please rebuild your binary files. Timing is forthcoming. Kenneth On 10/18/10 20:31, Kenneth Heafield wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
Thanks Ken for all your feedback, One more question. I'm using moses with boost. I uncommented the line #define USE_BOOST in kenlm/util/string_piece.hh and recompiled Moses without problems. Then, I uncommented #define USE_ICU and ./configure fails with the error log below. libicu-dev and libicu42 are is loaded on my system. Also, each compile started with a clean moses download. Is USE_ICU usable or necessary with Moses? Thanks, Tom configure: Using Boost library checking for boostlib = 1.36.0... yes configure: Building threaded moses checking whether the Boost::Thread library is available... yes checking for exit in -lboost_thread-mt... yes checking Ngram.h usability... yes checking Ngram.h presence... yes checking for Ngram.h... yes checking for trigram_init in -loolm... yes checking n_gram.h usability... yes checking n_gram.h presence... yes checking for n_gram.h... yes checking lm/ngram.hh usability... no checking lm/ngram.hh presence... yes checking for lm/ngram.hh... no configure: WARNING: lm/ngram.hh: present but cannot be compiled configure: WARNING: lm/ngram.hh: check for missing prerequisite headers? configure: WARNING: lm/ngram.hh: see the Autoconf documentation configure: WARNING: lm/ngram.hh: section Present But Cannot Be Compiled configure: WARNING: lm/ngram.hh: proceeding with the compiler's result configure: error: Cannot find KEN-LM in yes On Tue, 26 Oct 2010 12:48:13 -0400, Kenneth Heafield mo...@kheafield.com wrote: Yes, I require s and /s to appear in your ARPA. These tags are important from an output quality perspective (BLEU etc). I'll put that in the documentation when I get around to writing it, but personally think IRST should include them by default. Kenneth On 10/26/10 12:30, supp...@precisiontranslationtools.com wrote: Thanks Ken. I tested it and it works. FYI, on my first attempt there was a different error. Something about the s token (word?) was missing. I added the s/s tags and re-ran irstlm's build-lm.sh script with option -b (Include sentence boundary n-grams) and the error disappeared. It's pretty fast now. I look forward to testing the optimized code. Tom On Tue, 26 Oct 2010 10:18:17 -0400, Kenneth Heafield mo...@kheafield.com wrote: I've fixed this in revision 3657 and tested that it works with a toy IRSTLM example. Sorry about that, Kenneth P.S. a faster version is under code review and coming soon. On 10/26/10 03:57, Nicola Bertoldi wrote: the empty line after each ngram-block is not mandatory in the ARPA format (see http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html) and IRSTLM does not produce it. best regards, Nicola Bertoldi On Oct 26, 2010, at 9:42 AM, supp...@precisiontranslationtools.com supp...@precisiontranslationtools.com wrote: Hi Ken, I'm created an iARPA file with IRSTLM using the options -n 3 (2 grams), -b (include the s sentence boundary) and -d (subdictionary for ngrams). Then, I used IRSTLM's compile-lm with --text yes to convert to ARPA format. Finally, I ran build_binary to binarize the ARPA format for KenLM. I got the following error: $ build_binary arpa.en.lm arpa.en.binary Reading lm.en.lm 5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 terminate called after throwing an instance of 'lm::FormatLoadException' what(): Expected blank line after 3-grams at byte 22348989 in file arpa.en.lm Aborted What am I missing? Thanks, Tom On Fri, 22 Oct 2010 10:15:21 -0400, Kenneth Heafield mo...@kheafield.com wrote: KenLM is inference-only. It cannot create ARPA files. So you'll need to use your favorite toolkit to generate the ARPA. On 10/22/10 07:52, supp...@precisiontranslationtools.com wrote: Thanks Ken. Nice work. Is there a way to train the ARPA formatted LM with KenLM, or do we need to train with another tool, like SRILM or convert IRSTLM to full ARPA format? Thanks again, Tom On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies
Re: [Moses-support] KenLM distributed with Moses
Revision 3671 introduces an updated version of kenlm. Queries are faster now (no more string vocab lookups, state is kept so backoffs cost less). The binary format has changed as a result; please rebuild your binary files. Timing is forthcoming. Kenneth On 10/18/10 20:31, Kenneth Heafield wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
Hi Ken, I'm created an iARPA file with IRSTLM using the options -n 3 (2 grams), -b (include the s sentence boundary) and -d (subdictionary for ngrams). Then, I used IRSTLM's compile-lm with --text yes to convert to ARPA format. Finally, I ran build_binary to binarize the ARPA format for KenLM. I got the following error: $ build_binary arpa.en.lm arpa.en.binary Reading lm.en.lm 5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 terminate called after throwing an instance of 'lm::FormatLoadException' what(): Expected blank line after 3-grams at byte 22348989 in file arpa.en.lm Aborted What am I missing? Thanks, Tom On Fri, 22 Oct 2010 10:15:21 -0400, Kenneth Heafield mo...@kheafield.com wrote: KenLM is inference-only. It cannot create ARPA files. So you'll need to use your favorite toolkit to generate the ARPA. On 10/22/10 07:52, supp...@precisiontranslationtools.com wrote: Thanks Ken. Nice work. Is there a way to train the ARPA formatted LM with KenLM, or do we need to train with another tool, like SRILM or convert IRSTLM to full ARPA format? Thanks again, Tom On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
the empty line after each ngram-block is not mandatory in the ARPA format (see http://www.speech.sri.com/projects/srilm/manpages/ngram-format. 5.html) and IRSTLM does not produce it. best regards, Nicola Bertoldi On Oct 26, 2010, at 9:42 AM, supp...@precisiontranslationtools.com supp...@precisiontranslationtools.com wrote: Hi Ken, I'm created an iARPA file with IRSTLM using the options -n 3 (2 grams), -b (include the s sentence boundary) and -d (subdictionary for ngrams). Then, I used IRSTLM's compile-lm with --text yes to convert to ARPA format. Finally, I ran build_binary to binarize the ARPA format for KenLM. I got the following error: $ build_binary arpa.en.lm arpa.en.binary Reading lm.en.lm 5---10---15---20---25---30---35---40---45---50---55---60---65---70 ---75---80---85---90---95--100 terminate called after throwing an instance of 'lm::FormatLoadException' what(): Expected blank line after 3-grams at byte 22348989 in file arpa.en.lm Aborted What am I missing? Thanks, Tom On Fri, 22 Oct 2010 10:15:21 -0400, Kenneth Heafield mo...@kheafield.com wrote: KenLM is inference-only. It cannot create ARPA files. So you'll need to use your favorite toolkit to generate the ARPA. On 10/22/10 07:52, supp...@precisiontranslationtools.com wrote: Thanks Ken. Nice work. Is there a way to train the ARPA formatted LM with KenLM, or do we need to train with another tool, like SRILM or convert IRSTLM to full ARPA format? Thanks again, Tom On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
I've fixed this in revision 3657 and tested that it works with a toy IRSTLM example. Sorry about that, Kenneth P.S. a faster version is under code review and coming soon. On 10/26/10 03:57, Nicola Bertoldi wrote: the empty line after each ngram-block is not mandatory in the ARPA format (see http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html) and IRSTLM does not produce it. best regards, Nicola Bertoldi On Oct 26, 2010, at 9:42 AM, supp...@precisiontranslationtools.com supp...@precisiontranslationtools.com wrote: Hi Ken, I'm created an iARPA file with IRSTLM using the options -n 3 (2 grams), -b (include the s sentence boundary) and -d (subdictionary for ngrams). Then, I used IRSTLM's compile-lm with --text yes to convert to ARPA format. Finally, I ran build_binary to binarize the ARPA format for KenLM. I got the following error: $ build_binary arpa.en.lm arpa.en.binary Reading lm.en.lm 5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 terminate called after throwing an instance of 'lm::FormatLoadException' what(): Expected blank line after 3-grams at byte 22348989 in file arpa.en.lm Aborted What am I missing? Thanks, Tom On Fri, 22 Oct 2010 10:15:21 -0400, Kenneth Heafield mo...@kheafield.com wrote: KenLM is inference-only. It cannot create ARPA files. So you'll need to use your favorite toolkit to generate the ARPA. On 10/22/10 07:52, supp...@precisiontranslationtools.com wrote: Thanks Ken. Nice work. Is there a way to train the ARPA formatted LM with KenLM, or do we need to train with another tool, like SRILM or convert IRSTLM to full ARPA format? Thanks again, Tom On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
Thank you, Ken. I'll update my svn revision. Tom On Tue, 26 Oct 2010 10:18:17 -0400, Kenneth Heafield mo...@kheafield.com wrote: I've fixed this in revision 3657 and tested that it works with a toy IRSTLM example. Sorry about that, Kenneth P.S. a faster version is under code review and coming soon. On 10/26/10 03:57, Nicola Bertoldi wrote: the empty line after each ngram-block is not mandatory in the ARPA format (see http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html) and IRSTLM does not produce it. best regards, Nicola Bertoldi On Oct 26, 2010, at 9:42 AM, supp...@precisiontranslationtools.com supp...@precisiontranslationtools.com wrote: Hi Ken, I'm created an iARPA file with IRSTLM using the options -n 3 (2 grams), -b (include the s sentence boundary) and -d (subdictionary for ngrams). Then, I used IRSTLM's compile-lm with --text yes to convert to ARPA format. Finally, I ran build_binary to binarize the ARPA format for KenLM. I got the following error: $ build_binary arpa.en.lm arpa.en.binary Reading lm.en.lm 5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 terminate called after throwing an instance of 'lm::FormatLoadException' what(): Expected blank line after 3-grams at byte 22348989 in file arpa.en.lm Aborted What am I missing? Thanks, Tom On Fri, 22 Oct 2010 10:15:21 -0400, Kenneth Heafield mo...@kheafield.com wrote: KenLM is inference-only. It cannot create ARPA files. So you'll need to use your favorite toolkit to generate the ARPA. On 10/22/10 07:52, supp...@precisiontranslationtools.com wrote: Thanks Ken. Nice work. Is there a way to train the ARPA formatted LM with KenLM, or do we need to train with another tool, like SRILM or convert IRSTLM to full ARPA format? Thanks again, Tom On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
Thanks Ken. I tested it and it works. FYI, on my first attempt there was a different error. Something about the s token (word?) was missing. I added the s/s tags and re-ran irstlm's build-lm.sh script with option -b (Include sentence boundary n-grams) and the error disappeared. It's pretty fast now. I look forward to testing the optimized code. Tom On Tue, 26 Oct 2010 10:18:17 -0400, Kenneth Heafield mo...@kheafield.com wrote: I've fixed this in revision 3657 and tested that it works with a toy IRSTLM example. Sorry about that, Kenneth P.S. a faster version is under code review and coming soon. On 10/26/10 03:57, Nicola Bertoldi wrote: the empty line after each ngram-block is not mandatory in the ARPA format (see http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html) and IRSTLM does not produce it. best regards, Nicola Bertoldi On Oct 26, 2010, at 9:42 AM, supp...@precisiontranslationtools.com supp...@precisiontranslationtools.com wrote: Hi Ken, I'm created an iARPA file with IRSTLM using the options -n 3 (2 grams), -b (include the s sentence boundary) and -d (subdictionary for ngrams). Then, I used IRSTLM's compile-lm with --text yes to convert to ARPA format. Finally, I ran build_binary to binarize the ARPA format for KenLM. I got the following error: $ build_binary arpa.en.lm arpa.en.binary Reading lm.en.lm 5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 terminate called after throwing an instance of 'lm::FormatLoadException' what(): Expected blank line after 3-grams at byte 22348989 in file arpa.en.lm Aborted What am I missing? Thanks, Tom On Fri, 22 Oct 2010 10:15:21 -0400, Kenneth Heafield mo...@kheafield.com wrote: KenLM is inference-only. It cannot create ARPA files. So you'll need to use your favorite toolkit to generate the ARPA. On 10/22/10 07:52, supp...@precisiontranslationtools.com wrote: Thanks Ken. Nice work. Is there a way to train the ARPA formatted LM with KenLM, or do we need to train with another tool, like SRILM or convert IRSTLM to full ARPA format? Thanks again, Tom On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
Yes, I require s and /s to appear in your ARPA. These tags are important from an output quality perspective (BLEU etc). I'll put that in the documentation when I get around to writing it, but personally think IRST should include them by default. Kenneth On 10/26/10 12:30, supp...@precisiontranslationtools.com wrote: Thanks Ken. I tested it and it works. FYI, on my first attempt there was a different error. Something about the s token (word?) was missing. I added the s/s tags and re-ran irstlm's build-lm.sh script with option -b (Include sentence boundary n-grams) and the error disappeared. It's pretty fast now. I look forward to testing the optimized code. Tom On Tue, 26 Oct 2010 10:18:17 -0400, Kenneth Heafield mo...@kheafield.com wrote: I've fixed this in revision 3657 and tested that it works with a toy IRSTLM example. Sorry about that, Kenneth P.S. a faster version is under code review and coming soon. On 10/26/10 03:57, Nicola Bertoldi wrote: the empty line after each ngram-block is not mandatory in the ARPA format (see http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html) and IRSTLM does not produce it. best regards, Nicola Bertoldi On Oct 26, 2010, at 9:42 AM, supp...@precisiontranslationtools.com supp...@precisiontranslationtools.com wrote: Hi Ken, I'm created an iARPA file with IRSTLM using the options -n 3 (2 grams), -b (include the s sentence boundary) and -d (subdictionary for ngrams). Then, I used IRSTLM's compile-lm with --text yes to convert to ARPA format. Finally, I ran build_binary to binarize the ARPA format for KenLM. I got the following error: $ build_binary arpa.en.lm arpa.en.binary Reading lm.en.lm 5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 terminate called after throwing an instance of 'lm::FormatLoadException' what(): Expected blank line after 3-grams at byte 22348989 in file arpa.en.lm Aborted What am I missing? Thanks, Tom On Fri, 22 Oct 2010 10:15:21 -0400, Kenneth Heafield mo...@kheafield.com wrote: KenLM is inference-only. It cannot create ARPA files. So you'll need to use your favorite toolkit to generate the ARPA. On 10/22/10 07:52, supp...@precisiontranslationtools.com wrote: Thanks Ken. Nice work. Is there a way to train the ARPA formatted LM with KenLM, or do we need to train with another tool, like SRILM or convert IRSTLM to full ARPA format? Thanks again, Tom On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
Thanks Ken. Nice work. Is there a way to train the ARPA formatted LM with KenLM, or do we need to train with another tool, like SRILM or convert IRSTLM to full ARPA format? Thanks again, Tom On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
KenLM is inference-only. It cannot create ARPA files. So you'll need to use your favorite toolkit to generate the ARPA. On 10/22/10 07:52, supp...@precisiontranslationtools.com wrote: Thanks Ken. Nice work. Is there a way to train the ARPA formatted LM with KenLM, or do we need to train with another tool, like SRILM or convert IRSTLM to full ARPA format? Thanks again, Tom On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
Thanks, Ken. Tom On Fri, 22 Oct 2010 10:15:21 -0400, Kenneth Heafield mo...@kheafield.com wrote: KenLM is inference-only. It cannot create ARPA files. So you'll need to use your favorite toolkit to generate the ARPA. On 10/22/10 07:52, supp...@precisiontranslationtools.com wrote: Thanks Ken. Nice work. Is there a way to train the ARPA formatted LM with KenLM, or do we need to train with another tool, like SRILM or convert IRSTLM to full ARPA format? Thanks again, Tom On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] KenLM distributed with Moses
Hi, I saw that KenLM source code is distributed from the Moses svn and can set in configure. Is anybody here using it and willing to share some experiences? Is it thread-safe and can used in Moses together with SRI and IRST ? Any particular advantages? Is there any more information than just the README? any hints are very welcome Christof ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
[Moses-support] KenLM distributed with Moses
Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support