Yes, I require <s> and </s> to appear in your ARPA. These tags are important from an output quality perspective (BLEU etc). I'll put that in the documentation when I get around to writing it, but personally think IRST should include them by default.
Kenneth On 10/26/10 12:30, supp...@precisiontranslationtools.com wrote: > Thanks Ken. I tested it and it works. > > FYI, on my first attempt there was a different error. Something about the > <s> token (word?) was missing. I added the <s></s> tags and re-ran irstlm's > build-lm.sh script with option -b (Include sentence boundary n-grams) and > the error disappeared. > > It's pretty fast now. I look forward to testing the optimized code. > > Tom > > > > On Tue, 26 Oct 2010 10:18:17 -0400, Kenneth Heafield <mo...@kheafield.com> > wrote: >> I've fixed this in revision 3657 and tested that it works with a toy >> IRSTLM example. >> >> Sorry about that, >> >> Kenneth >> >> P.S. a faster version is under code review and coming soon. >> >> On 10/26/10 03:57, Nicola Bertoldi wrote: >>> the empty line after each ngram-block is not mandatory in the ARPA > format >>> (see >>> http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html) >>> and IRSTLM does not produce it. >>> >>> >>> best regards, >>> Nicola Bertoldi >>> >>> On Oct 26, 2010, at 9:42 AM, <supp...@precisiontranslationtools.com> >>> <supp...@precisiontranslationtools.com> wrote: >>> >>>> Hi Ken, >>>> >>>> I'm created an iARPA file with IRSTLM using the options -n 3 (2 >>>> grams), -b >>>> (include the <s> sentence boundary) and -d (subdictionary for ngrams). >>>> Then, I used IRSTLM's compile-lm with --text yes to convert to ARPA >>>> format. >>>> >>>> Finally, I ran build_binary to binarize the ARPA format for KenLM. I > got >>>> the following error: >>>> >>>> $ build_binary arpa.en.lm arpa.en.binary >>>> Reading lm.en.lm >>>> > ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 >>>> >>>> terminate called after throwing an instance of > 'lm::FormatLoadException' >>>> what(): Expected blank line after 3-grams at byte 22348989 in file >>>> arpa.en.lm >>>> Aborted >>>> >>>> What am I missing? >>>> >>>> Thanks, >>>> Tom >>>> >>>> >>>> On Fri, 22 Oct 2010 10:15:21 -0400, Kenneth Heafield >>>> <mo...@kheafield.com> >>>> wrote: >>>>> KenLM is inference-only. It cannot create ARPA files. So you'll > need >>>>> to use your favorite toolkit to generate the ARPA. >>>>> >>>>> On 10/22/10 07:52, supp...@precisiontranslationtools.com wrote: >>>>>> Thanks Ken. Nice work. >>>>>> >>>>>> Is there a way to train the ARPA formatted LM with KenLM, or do we >>>>>> need >>>>>> to >>>>>> train with another tool, like SRILM or convert IRSTLM to full ARPA >>>>>> format? >>>>>> >>>>>> Thanks again, >>>>>> Tom >>>>>> >>>>>> >>>>>> >>>>>> On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield >>>>>> <mo...@kheafield.com> >>>>>> wrote: >>>>>>> Hi Moses, >>>>>>> >>>>>>> Introducing kenlm in Moses trunk. You no longer need to >>>>>>> download a >>>>>>> separate language model to use Moses; it's distributed with Moses > and >>>>>>> compiled in by default on UNIX. This is threadsafe language model >>>>>>> inference code that returns the same probabilities as SRI (up to >>>>>>> floating point rounding). It loads APRA files in 2/3 the time SRI >>>> takes >>>>>>> and uses less memory too. Using kenlm is simple: in your >>>> [lmodel-file] >>>>>>> section, change the first digit to 8. For example, >>>>>>> >>>>>>> "0 0 2 foo.arpa" changes to "8 0 2 foo.arpa" >>>>>>> >>>>>>> For even faster loading, use the binary format: >>>>>>> >>>>>>> kenlm/build_binary foo.arpa foo.binary >>>>>>> >>>>>>> then simply provide the binary filename in your moses.ini e.g. >>>>>>> "8 0 2 foo.binary"; it auto detects binary files using magic bytes > at >>>>>>> the beginning. >>>>>>> >>>>>>> The code is ready for use and provides correct results. >>>>>>> Inference is >>>>>>> slower than it should be due to inefficiencies in the Moses-side >>>> wrapper >>>>>>> code (it does a vocab lookup for all 5 words every time). I'm >>>>>>> working >>>>>>> on it and once this is done I'll post some benchmarks against SRI > and >>>>>>> IRST. The binary format is subject to change, but contains a > version >>>>>>> number so on very rare occasions after, new versions will tell you > to >>>>>>> rebuild your binary files. Windows is currently not supported (it >>>> uses >>>>>>> mmap) though I welcome contributions using #ifdef and >>>> CreateFileMapping. >>>>>>> >>>>>>> Have fun and let me know about your experiences with it. >>>>>>> >>>>>>> "Ken" >>>>>>> _______________________________________________ >>>>>>> Moses-support mailing list >>>>>>> Moses-support@mit.edu >>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> _______________________________________________ >>>> Moses-support mailing list >>>> Moses-support@mit.edu >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support