Yep it's a pain and I've had to write a fair amount of code to work 
around this.  By default, SRI prunes n-grams of order 3 or above if the 
adjusted count is 1.  For the highest order, the adjusted count is the 
raw count.  For all other orders, the adjusted count is the number of 
unique words that extend it to the right, formally

a(w_1^N) = c(w_1^N)

and

a(w_1^n) = |w^{n+1} : c(w_1^{n+1}) > 0 | for n < N

where c is the raw count and a is the adjusted count.  This also means 
it's possible to have n-grams A B C D E and C D E but not B C D E, which 
is why this is painful for me.

Kenneth

On 11/14/12 15:50, Philipp Koehn wrote:
> Hi,
>
> I encountered the same problem when using "msb" and
> pruned singletons on large corpora (Europarl).
> SRILM's ngram complaints about "no bow for prefix of ngram"
>
> Here a Czech example:
>
> grep 'schválení těchto' /home/pkoehn/experiment/wmt12-en-cs/lm/europarl.lm.38
> -2.35639      schválení těchto zpráv  -0.198088
> -0.390525     schválení těchto zpráv ,
> -0.390525     proti schválení těchto zpráv
>
> There should be an entry for the bigram "schválení těchto".
>
> I do not see how this could happen - the ngram occurs twice in the corpus:
>
>> grep 'schválení těchto zpráv' lm/europarl.truecased.16
> zatímco s dlouhodobými cíli souhlasíme , nemůžeme souhlasit s
> prostředky k jejich dosažení a hlasovali jsme proti schválení těchto
> zpráv , které se neomezují pouze na eurozónu .
> zatímco souhlasíme s dlouhodobými cíli , nemůžeme souhlasit s
> prostředky k jejich dosažení a hlasovali jsme proti schválení těchto
> zpráv , které se neomezují pouze na eurozónu .
>
> I suspect that the current implementation throws out higher order n-grams
> if they occur in _one_context_, not _once_.
>
> -phi
>
> On Thu, Nov 8, 2012 at 3:31 AM, Marcin Junczys-Dowmunt
> <junc...@amu.edu.pl>  wrote:
>> Using -lm=msb instead of -lm=sb and testing on several evaluation sets
>> seems to help. Then one time IRSTLM is better another time I have better
>> results with SRILM. So on average they seem to be on par now.
>>
>> Interesting, however, that you say there should be no differences. I
>> never manage to get the same BLEU scores on a test set for IRSTLM and
>> SRILM. I have to do some reading on this dub issue and see what happens.
>>
>> W dniu 08.11.2012 09:20, Nicola Bertoldi pisze:
>>> > From the FBK community...
>>>
>>> as already mentioned by ken,
>>>
>>> tlm computes correctly  the "Improved Kneser-Ney method"  (-lm=msb)
>>>
>>> tlm can keep the singletons: set parameter  -ps=no
>>>
>>> As concerns as OOV words tlm computes the probability of the OOV  as it 
>>> were a class of all possible unknown words.
>>> In order to get the actual prob of one single OOV token    tlm requires 
>>> that a Dictionary Upper Bound is set.
>>> The Dictionary Upper Bound is intended to be a rough estimate of the 
>>> dictionary size (a reasonable value could be 10e+7, which is also the 
>>> default)
>>> Note that having the same Dictionary Upper Bound (dub) value is 
>>> useful/mandatory to properly compare different LMs in terms of Perplexity
>>> Moreover, Note that the dub value is not stored in the saved LM
>>>
>>> In IRSTLM, you can/have to  set this value with the parameter  -dub   when 
>>> you compute the perplexity   either with    tlm    or    compile-lm
>>> In MOSES, you can/have to set this parameter with    "-lmodel-dub"
>>>
>>> I remember you can use the LM estimated by means of IRSTLM toolkit  
>>> directly in MOSES setting the first field of the "-lmodel-file" parameter 
>>> to "1"
>>> without transforming it with build-binary.
>>>
>>>
>>> As concerns the difference between IRSTLM and SRILM, they should not be 
>>> there.
>>> Have you notice difference also in the perplexity?
>>> Maybe you can send us  a tiny benchmark (data and used commands) in which 
>>> you experience such difference,
>>> so that we can debug.
>>>
>>>
>>>
>>> Nicola
>>>
>>>
>>> On Nov 8, 2012, at 8:22 AM, Marcin Junczys-Dowmunt wrote:
>>>
>>>> Hi Pratyush,
>>>> Thanks for the hint. That solved the problem I had with the arpa files
>>>> when using -lm=msb and KenLM. Unfortunately, this does not seem to
>>>> improve performance of IRSTLM much when compared to SRILM. So I guess I
>>>> will have to stick with SRILM for now.
>>>>
>>>> Kenneth, weren't you working on your own tool to produce language models?
>>>> Best,
>>>> Marcin
>>>>
>>>> W dniu 07.11.2012 11:18, Pratyush Banerjee pisze:
>>>>> Hi Marcin,
>>>>>
>>>>> I have used msb with irstlm... but seems to have worked fine for me...
>>>>>
>>>>> You mentioned faulty arpa files for 5-grams... is it because KenLM
>>>>> complains of missing 4-grams, 3-grams etc ?
>>>>> Have you tried using -ps=no option with tlm ?
>>>>>
>>>>> IRSTLM is known to prune singletons n-grams in order to reduce the
>>>>> size of the LM... (tlm has it on by default..)
>>>>>
>>>>> If you use this option, usually KenLM does not complain... I have also
>>>>> used such LMs with SRILM for further mixing and it went fine...
>>>>>
>>>>> I am sure somebody from the IRSTLM community could confirm this...
>>>>>
>>>>> Hope this resolves the issue...
>>>>>
>>>>> Thanks and Regards,
>>>>>
>>>>> Pratyush
>>>>>
>>>>>
>>>>> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt
>>>>> <junc...@amu.edu.pl<mailto:junc...@amu.edu.pl>>  wrote:
>>>>>
>>>>>      On the irstlm page it says:
>>>>>
>>>>>      'Modified shift-beta, also known as "improved kneser-ney smoothing"'
>>>>>
>>>>>      Unfortunately I cannot use "msb" because it seems to produce
>>>>>      faulty arpa
>>>>>      files for 5-grams. So I am trying only "shift-beta" whatever that
>>>>>      means.
>>>>>      Maybe that's the main problem?
>>>>>      Also, my data sets are not that small, the plain arpa files currently
>>>>>      exceed 20 GB.
>>>>>
>>>>>      Best,
>>>>>      Marcin
>>>>>
>>>>>      W dniu 06.11.2012 22:15, Jonathan Clark pisze:
>>>>>> As far as I know, exact modified Kneser-Ney smoothing (the current
>>>>>> state of the art) is not supported by IRSTLM. IRSTLM instead
>>>>>> implements modified shift-beta smoothing, which isn't quite as
>>>>>> effective -- especially on smaller data sets.
>>>>>>
>>>>>> Cheers,
>>>>>> Jon
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt
>>>>>> <junc...@amu.edu.pl<mailto:junc...@amu.edu.pl>>  wrote:
>>>>>>> Hi,
>>>>>>> Slightly off-topic, but I am out of ideas. I am trying to
>>>>>      figure out
>>>>>>> what set of parameters I have to use with IRSTLM to creates LMs
>>>>>      that are
>>>>>>> equivalent to language models created with SRILM using the
>>>>>      following
>>>>>>> command:
>>>>>>>
>>>>>>> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text
>>>>>>> input.en -lm lm.en.arpa
>>>>>>>
>>>>>>> Up to now, I am using this chain of commands for IRSTLM:
>>>>>>>
>>>>>>> perl -C -pe 'chomp; $_ = "<s>  $_</s>\n"'<  input.en>
>>>>>      input.en.sb<http://input.en.sb>
>>>>>>> ngt -i=input.en.sb<http://input.en.sb>  -n=5 -b=yes -o=lm.en.bin
>>>>>>> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa
>>>>>>>
>>>>>>> I know this is not quite the same, but it comes closest in terms of
>>>>>>> quality and size. The translation results, however, are still
>>>>>>> consistently worse than with SRILM models, differences in BLEU
>>>>>      are up to
>>>>>>> 1%.
>>>>>>>
>>>>>>> I use KenLM with Moses to binarize the resulting arpa files, so
>>>>>      this is
>>>>>>> not a code issue.
>>>>>>>
>>>>>>> Also it seems IRSTLM has a bug with the modified shift beta
>>>>>      option. At
>>>>>>> least KenLM complains that not all 4-grams are present although
>>>>>      there
>>>>>>> are 5-grams that contain them.
>>>>>>>
>>>>>>> Any ideas?
>>>>>>> Thanks,
>>>>>>> Marcin
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> Moses-support@mit.edu<mailto:Moses-support@mit.edu>
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>      _______________________________________________
>>>>>      Moses-support mailing list
>>>>>      Moses-support@mit.edu<mailto:Moses-support@mit.edu>
>>>>>      http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to