Yes, I require <s> and </s> to appear in your ARPA.  These tags are
important from an output quality perspective (BLEU etc).  I'll put that
in the documentation when I get around to writing it, but personally
think IRST should include them by default.

Kenneth

On 10/26/10 12:30, supp...@precisiontranslationtools.com wrote:
> Thanks Ken. I tested it and it works. 
> 
> FYI, on my first attempt there was a different error. Something about the
> <s> token (word?) was missing. I added the <s></s> tags and re-ran irstlm's
> build-lm.sh script with option -b (Include sentence boundary n-grams) and
> the error disappeared.
> 
> It's pretty fast now. I look forward to testing the optimized code.
> 
> Tom
> 
> 
> 
> On Tue, 26 Oct 2010 10:18:17 -0400, Kenneth Heafield <mo...@kheafield.com>
> wrote:
>> I've fixed this in revision 3657 and tested that it works with a toy
>> IRSTLM example.
>>
>> Sorry about that,
>>
>> Kenneth
>>
>> P.S. a faster version is under code review and coming soon.
>>
>> On 10/26/10 03:57, Nicola Bertoldi wrote:
>>> the empty line after each ngram-block is not mandatory in the ARPA
> format
>>> (see
>>> http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html)
>>> and IRSTLM does not produce it.
>>>
>>>
>>> best regards,
>>> Nicola Bertoldi
>>>
>>> On Oct 26, 2010, at 9:42 AM, <supp...@precisiontranslationtools.com>
>>> <supp...@precisiontranslationtools.com> wrote:
>>>
>>>> Hi Ken,
>>>>
>>>> I'm created an iARPA file with IRSTLM using the options -n 3 (2
>>>> grams), -b
>>>> (include the <s> sentence boundary) and -d (subdictionary for ngrams).
>>>> Then, I used IRSTLM's compile-lm with --text yes to convert to ARPA
>>>> format.
>>>>
>>>> Finally, I ran build_binary to binarize the ARPA format for KenLM. I
> got
>>>> the following error:
>>>>
>>>> $ build_binary arpa.en.lm arpa.en.binary
>>>> Reading lm.en.lm
>>>>
> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
>>>>
>>>> terminate called after throwing an instance of
> 'lm::FormatLoadException'
>>>>   what():  Expected blank line after 3-grams at byte 22348989 in file
>>>> arpa.en.lm
>>>> Aborted
>>>>
>>>> What am I missing?
>>>>
>>>> Thanks,
>>>> Tom
>>>>
>>>>
>>>> On Fri, 22 Oct 2010 10:15:21 -0400, Kenneth Heafield
>>>> <mo...@kheafield.com>
>>>> wrote:
>>>>> KenLM is inference-only.  It cannot create ARPA files.  So you'll
> need
>>>>> to use your favorite toolkit to generate the ARPA.
>>>>>
>>>>> On 10/22/10 07:52, supp...@precisiontranslationtools.com wrote:
>>>>>> Thanks Ken. Nice work.
>>>>>>
>>>>>> Is there a way to train the ARPA formatted LM with KenLM, or do we
>>>>>> need
>>>>>> to
>>>>>> train with another tool, like SRILM or convert IRSTLM to full ARPA
>>>>>> format?
>>>>>>
>>>>>> Thanks again,
>>>>>> Tom
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield
>>>>>> <mo...@kheafield.com>
>>>>>> wrote:
>>>>>>> Hi Moses,
>>>>>>>
>>>>>>>     Introducing kenlm in Moses trunk.  You no longer need to
>>>>>>> download a
>>>>>>> separate language model to use Moses; it's distributed with Moses
> and
>>>>>>> compiled in by default on UNIX.  This is threadsafe language model
>>>>>>> inference code that returns the same probabilities as SRI (up to
>>>>>>> floating point rounding).  It loads APRA files in 2/3 the time SRI
>>>> takes
>>>>>>> and uses less memory too.  Using kenlm is simple: in your
>>>> [lmodel-file]
>>>>>>> section, change the first digit to 8.  For example,
>>>>>>>
>>>>>>> "0 0 2 foo.arpa" changes to "8 0 2 foo.arpa"
>>>>>>>
>>>>>>>     For even faster loading, use the binary format:
>>>>>>>
>>>>>>> kenlm/build_binary foo.arpa foo.binary
>>>>>>>
>>>>>>> then simply provide the binary filename in your moses.ini e.g.
>>>>>>> "8 0 2 foo.binary"; it auto detects binary files using magic bytes
> at
>>>>>>> the beginning.
>>>>>>>
>>>>>>>     The code is ready for use and provides correct results. 
>>>>>>> Inference is
>>>>>>> slower than it should be due to inefficiencies in the Moses-side
>>>> wrapper
>>>>>>> code (it does a vocab lookup for all 5 words every time).  I'm
>>>>>>> working
>>>>>>> on it and once this is done I'll post some benchmarks against SRI
> and
>>>>>>> IRST. The binary format is subject to change, but contains a
> version
>>>>>>> number so on very rare occasions after, new versions will tell you
> to
>>>>>>> rebuild your binary files.  Windows is currently not supported (it
>>>> uses
>>>>>>> mmap) though I welcome contributions using #ifdef and
>>>> CreateFileMapping.
>>>>>>>
>>>>>>>     Have fun and let me know about your experiences with it.
>>>>>>>
>>>>>>> "Ken"
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> Moses-support@mit.edu
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to