Re: [Moses-support] Estimating probabilities with KenLM

Hieu Hoang Tue, 26 Nov 2013 07:43:13 -0800

runs ok for me. Try git pull on moses if your code is a few months old.
There might be some error incompatibility between the lmplz wrapper script
and lmplz.


My command:
# trainlm-lmplz.perl -order 5  -lmplz
~/workspace/github/mosesdecoder/bin/lmplz  -T . -S 1G -text
lm/europarl.lowercased.1 -lm lm/europarl.lmplz
...
Chain sizes: 1:171084 2:2345088 3:6649660 4:10482672 5:13132616
=== 5/5 Writing ARPA model ===
RSSMax:219410432 kB    user:2.62476    sys:2.46472    CPU:5.08947    real:0



On 26 November 2013 14:57, Prasanth K <[email protected]> wrote:

> Ok. I have managed to re-create this error (no reason why it shouldn't
> come back, I knew exactly what I told moses to do). So, the exact command
> run to create the language model from the logs is as follows:
>
> scripts/generic/trainlm-lmplz.perl -lmplz bin/lmplz -order 5 -T 
> europarl.en-sv/phrase-based-dup/tmp
> -S 10G -text europarl.en-sv/phrase-based-dup/lm/europarl.lowercased.1 -lm
>  europarl.en-sv/phrase-based-dup/lm/europarl.lm.1
>
> Of course, all paths in the above command given were absolute paths (I
> just removed them for readability.) When this is run, my log file from EMS
> gives the following in LM_europarl_train.id.STDERR
>
> EXECUTING bin/lmplz --order 5 -T europarl.en-sv/phrase-based-dup/tmp -S
> 10G < europarl.en-sv/phrase-based-dup/lm/europarl.lowercased.1 >
> europarl.en-sv/phrase-based-dup/lm/europarl.lm.1
>
> === 1/5 Counting and sorting n-grams ===
>
> Reading stdin
>
>
> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
>
>
> ****************************************************************************************************
>
> Function not implemented
>
> This does not get the language model step to crash, instead creates an
> empty language model (0 lines). The below is the log file for
> LM_europarl_binarize.id.STDERR
>
> Reading europarl.en-sv/phrase-based-dup/lm/europarl.lm.1
>
>
> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
>
> End of file Byte: 0 File: europarl.en-sv/phrase-based-dup/lm/europarl.lm.1
>
> ERROR
>
> Clearly, something is wrong with my installation of kenlm (the decoding
> with kenlm works just fine ..I have confirmed that now), which makes the
> estimation go funny. The question is where I start to fix this?
>
> Thanks.
>
> - Regards,
>
> Prasanth
>
>
> On Tue, Nov 26, 2013 at 1:56 PM, Hieu Hoang <[email protected]> wrote:
>
>>  ok, i can't reproduce your error
>>   FUnction not implemented
>> you should find out exactly how lmplz is being run, it may be that you
>> have a slightly older version and doesn't know all the arguments you've
>> given it.
>>
>>
>> On 26/11/2013 06:47, Prasanth K wrote:
>>
>> Hello Hieu,
>>
>>  My first attempt was to specify the absolute amount of memory (10G) but
>> that gave an error saying function not implemented. Later, when I tried
>> specifying the relative size (80%), I got a similar parse error to what you
>> have given above. Strange that it should
>>
>>  @Kenneth, thanks for the code to estimate physical memory. I am going
>> to give it a shot and let you know how it goes.
>>
>>  - Regards,
>> Prasanth
>>
>>
>> On Mon, Nov 25, 2013 at 9:20 PM, Hieu Hoang <[email protected]> wrote:
>>
>>> Prasanth - what is the exact lmplz command that was ran by the EMS?
>>>
>>>
>>> This works
>>>      .../lmplz --order 5 --text lm/europarl.lowercased.1 --arpa
>>> lm/europarl.lmplz -T /tmp -S 1G
>>> This doesn't
>>>     .../lmplz --order 5 --text lm/europarl.lowercased.1 --arpa
>>> lm/europarl.lmplz -T /tmp -S 80%
>>> it give the error
>>>    util/usage.cc:220 in uint64_t util::<anonymous
>>> namespace>::ParseNum(const std::string &) [Num = double] threw
>>> SizeParseError because `!mem'.
>>> Failed to parse 80% into a memory size because % was specified but the
>>> physical memory size could not be determined.
>>>
>>>  However, it worked even with the source code from 4 days ago.
>>>
>>>
>>> On 25/11/2013 19:07, Kenneth Heafield wrote:
>>> > Hi,
>>> >
>>> >       I've taken a shot in the dark based on physmem.c to support
>>> physical
>>> > memory estimation on BSD and OS X.  Please clone
>>> >
>>> > github.com/kpu/kenlm
>>> >
>>> > and compile with
>>> >
>>> > ./bjam
>>> >
>>> > If that fails, please let Hieu and I know (maybe Hieu can help since he
>>> > has OS X).  If it doesn't fail, run
>>> >
>>> > bin/lmplz
>>> >
>>> > with no argument.  The help message will include a line e.g.
>>> >
>>> > "This machine has 135224176640 bytes of memory."
>>> >
>>> > or
>>> >
>>> > "Unable to determine the amount of memory on this machine."
>>> >
>>> > If it works, then I'll push to Moses.  Trying to not break Moses master
>>> > for OS X.
>>> >
>>> > Kenneth
>>> >
>>> > On 11/24/13 22:40, Prasanth K wrote:
>>> >> Hi Kenneth,
>>> >>
>>> >> Thanks for the clarification w.r.t. calculating the memory size. But I
>>> >> am running these on a Mac (10.9 Mavericks). Do you think I should
>>> still
>>> >> port the lmplz code to Mac for the estimation of probabilities?
>>> >>
>>> >> One thing though, I did change the default clang compiler that comes
>>> >> with this new Mac to a gcc-4.8 (not sure that changes anything in this
>>> >> context).
>>> >>
>>> >> - Prasanth
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Fri, Nov 22, 2013 at 6:50 PM, Kenneth Heafield <
>>> [email protected]
>>> >> <mailto:[email protected]>> wrote:
>>> >>
>>> >>      Hi,
>>> >>
>>> >>              What OS are you on?  Cygwin?  Apparently every OS reports
>>> >>      memory size
>>> >>      in a different way:
>>> >>
>>> >>
>>> http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/physmem.c;h=2629936146e3042f927523322f18aca76996cd7f;hb=HEAD
>>> >>
>>> >>      The good news is that the above code is LGPLv2:
>>> >>
>>> >>
>>> http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=modules/physmem;h=9644522e0493a85a9fb4ae7c4449741c2c1500ea;hb=HEAD
>>> >>
>>> >>      But currently I'm just using this short function that will fail
>>> on some
>>> >>      platforms:
>>> >>
>>> >>      uint64_t GuessPhysicalMemory() {
>>> >>      #if defined(_WIN32) || defined(_WIN64)
>>> >>        return 0;
>>> >>      #elif defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE)
>>> >>        long pages = sysconf(_SC_PHYS_PAGES);
>>> >>        if (pages == -1) return 0;
>>> >>        long page_size = sysconf(_SC_PAGESIZE);
>>> >>        if (page_size == -1) return 0;
>>> >>        return static_cast<uint64_t>(pages) *
>>> >>      static_cast<uint64_t>(page_size);
>>> >>      #else
>>> >>        return 0;
>>> >>      #endif
>>> >>      }
>>> >>
>>> >>      If it fails, I just don't let users specify memory as a
>>> percentage.  So
>>> >>      one thing thing to fix is putting physmem.{h,c} in util then
>>> changing
>>> >>      calls to GuessPhysicalMemory.  But I'm also not a fan of the way
>>> the GNU
>>> >>      code gives up and makes up a number at the end.
>>> >>
>>> >>      The second porting issue is that lmplz makes parallel use of
>>> pread,
>>> >>      pwrite, and write.  Windows is unsafe in this regard (POSIX
>>> requires
>>> >>      that pread/pwrite not change the file pointer; Windows has no
>>> way to
>>> >>      implement that atomically).  To fix this, we'll always specify
>>> the file
>>> >>      offset in cases that happen concurrently.  Extend
>>> util/stream/io.* with
>>> >>      a PWrite class based on PWriteOrThrow then change FileBuffer to
>>> use
>>> >>      PWrite.  Then I guess one should rename
>>> PReadOrThrow/PWriteOrThrow to
>>> >>      something that indicates they're not-quite-POSIX on windows.
>>>  Also, the
>>> >>      macros in these functions should detect cygwin, bypassing
>>> cygwin's
>>> >>      "Function not implemented" and calling Windows APIs directly
>>> (they're
>>> >>      already there for _WIN32).
>>> >>
>>> >>      I don't have a windows box so I can say what should be changed
>>> at a high
>>> >>      level, but need an actual user to ensure it compiles and runs
>>> correctly.
>>> >>
>>> >>      Kenneth
>>> >>
>>> >>      On 11/22/13 06:49, Prasanth K wrote:
>>> >>      > Hi,
>>> >>      >
>>> >>      > I am trying to use KenLM for building a language model on the
>>> Europarl
>>> >>      > corpus. Following the instructions in
>>> >>      >
>>> >>      (
>>> http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel#ntoc19
>>> ),
>>> >>      > I added the few lines for getting KenLM to estimate the LM
>>> >>      probabilities
>>> >>      > (order/n=5) to my config file to the EMS. The language model
>>> dies down
>>> >>      > during training saying that the "Function not implemented" at
>>> counting
>>> >>      > and sorting n-grams stage (the first stage itself). Does this
>>> mean
>>> >>      there
>>> >>      > is something wrong with my installation? Or is just
>>> insufficient
>>> >>      memory?
>>> >>      >
>>> >>      > Incidentally, when I started giving the amount of memory in
>>> terms of %
>>> >>      > (80%) there was an error "Failed to parse .. into memory size
>>> because
>>> >>      > physical memory size could not be determined". I am also
>>> curious why
>>> >>      > this happens?
>>> >>      >
>>> >>      > Kenneth, can you shed some light on this? Thanks.
>>> >>      >
>>> >>      > - Regards,
>>> >>      > Prasanth
>>> >>      >
>>> >>      >
>>> >>      >
>>> >>      > --
>>> >>      > "Theories have four stages of acceptance. i) this is worthless
>>> >>      nonsense;
>>> >>      > ii) this is an interesting, but perverse, point of view, iii)
>>> this is
>>> >>      > true, but quite unimportant; iv) I always said so."
>>> >>      >
>>> >>      >   --- J.B.S. Haldane
>>> >>      >
>>> >>      >
>>> >>      > _______________________________________________
>>> >>      > Moses-support mailing list
>>> >>      > [email protected] <mailto:[email protected]>
>>> >>      > http://mailman.mit.edu/mailman/listinfo/moses-support
>>> >>      >
>>> >>      _______________________________________________
>>> >>      Moses-support mailing list
>>> >>      [email protected] <mailto:[email protected]>
>>> >>      http://mailman.mit.edu/mailman/listinfo/moses-support
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> "Theories have four stages of acceptance. i) this is worthless
>>> nonsense;
>>> >> ii) this is an interesting, but perverse, point of view, iii) this is
>>> >> true, but quite unimportant; iv) I always said so."
>>> >>
>>> >>    --- J.B.S. Haldane
>>> > _______________________________________________
>>> > Moses-support mailing list
>>> > [email protected]
>>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>>> >
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>>  http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>>
>>
>>  --
>> "Theories have four stages of acceptance. i) this is worthless nonsense;
>> ii) this is an interesting, but perverse, point of view, iii) this is true,
>> but quite unimportant; iv) I always said so."
>>
>>   --- J.B.S. Haldane
>>
>>
>>
>
>
> --
> "Theories have four stages of acceptance. i) this is worthless nonsense;
> ii) this is an interesting, but perverse, point of view, iii) this is true,
> but quite unimportant; iv) I always said so."
>
>   --- J.B.S. Haldane
>



-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Estimating probabilities with KenLM

Reply via email to