Re: [Moses-support] SRILM vs IRSTLM

Philipp Koehn Tue, 20 Nov 2012 09:25:48 -0800

Hi,

I added a script ( scripts/generic/scripts/generic/trainlm-irst2.perl ) that
works with the latest version of irstlm, and added instructions to the
example config files - even if training with pruned singletons cause
follow-up steps (kenlm binarization and interpolation to balk).


-phi

On Wed, Nov 14, 2012 at 4:46 PM, Jonathan Clark <[email protected]> wrote:
> Nicola,
>
> On an unrelated note, could you say why the smoothing technique is
> called Modified ShiftBeta in IRSTLM. I know it was originally called
> Improved Kneser-Ney and sometimes "Simplified" Kneser-Ney (Interspeech
> 2008), which hinted that it varied from the original description of
> Modified Kneser-Ney in some way. I've been curious about this for
> years and have never found a good opportunity to ask.
>
> Cheers,
> Jon
>
>
> On Wed, Nov 14, 2012 at 11:30 AM, Nicola Bertoldi <[email protected]> wrote:
>> Modified ShiftBeta  (aka modified Kenser Ney) does not considered the real 
>> counts for computing probabilties, but the corrected counts, which basically 
>> are the number of different successors of a n-gram.
>> Hence in this case your bigram "schválení těchto" occurs always before 
>> "zpráv", and hence it behaves like a "singleton".
>>
>> Please refer to this paper to more details about this smoothing technique:
>> Chen, S. F. and Goodman, J. (1999). An empirical study of smoothing 
>> techniques for language modeling. Computer Speech and Language, 
>> 4(13):359-393.
>>
>> Nicola
>>
>> On Nov 14, 2012, at 4:50 PM, Philipp Koehn wrote:
>>
>>> Hi,
>>>
>>> I encountered the same problem when using "msb" and
>>> pruned singletons on large corpora (Europarl).
>>> SRILM's ngram complaints about "no bow for prefix of ngram"
>>>
>>> Here a Czech example:
>>>
>>> grep 'schválení těchto' 
>>> /home/pkoehn/experiment/wmt12-en-cs/lm/europarl.lm.38
>>> -2.35639      schválení těchto zpráv  -0.198088
>>> -0.390525     schválení těchto zpráv ,
>>> -0.390525     proti schválení těchto zpráv
>>>
>>> There should be an entry for the bigram "schválení těchto".
>>>
>>> I do not see how this could happen - the ngram occurs twice in the corpus:
>>>
>>>> grep 'schválení těchto zpráv' lm/europarl.truecased.16
>>> zatímco s dlouhodobými cíli souhlasíme , nemůžeme souhlasit s
>>> prostředky k jejich dosažení a hlasovali jsme proti schválení těchto
>>> zpráv , které se neomezují pouze na eurozónu .
>>> zatímco souhlasíme s dlouhodobými cíli , nemůžeme souhlasit s
>>> prostředky k jejich dosažení a hlasovali jsme proti schválení těchto
>>> zpráv , které se neomezují pouze na eurozónu .
>>>
>>> I suspect that the current implementation throws out higher order n-grams
>>> if they occur in _one_context_, not _once_.
>>>
>>> -phi
>>>
>>> On Thu, Nov 8, 2012 at 3:31 AM, Marcin Junczys-Dowmunt
>>> <[email protected]> wrote:
>>>> Using -lm=msb instead of -lm=sb and testing on several evaluation sets
>>>> seems to help. Then one time IRSTLM is better another time I have better
>>>> results with SRILM. So on average they seem to be on par now.
>>>>
>>>> Interesting, however, that you say there should be no differences. I
>>>> never manage to get the same BLEU scores on a test set for IRSTLM and
>>>> SRILM. I have to do some reading on this dub issue and see what happens.
>>>>
>>>> W dniu 08.11.2012 09:20, Nicola Bertoldi pisze:
>>>>>> From the FBK community...
>>>>>
>>>>> as already mentioned by ken,
>>>>>
>>>>> tlm computes correctly  the "Improved Kneser-Ney method"  (-lm=msb)
>>>>>
>>>>> tlm can keep the singletons: set parameter  -ps=no
>>>>>
>>>>> As concerns as OOV words tlm computes the probability of the OOV  as it 
>>>>> were a class of all possible unknown words.
>>>>> In order to get the actual prob of one single OOV token    tlm requires 
>>>>> that a Dictionary Upper Bound is set.
>>>>> The Dictionary Upper Bound is intended to be a rough estimate of the 
>>>>> dictionary size (a reasonable value could be 10e+7, which is also the 
>>>>> default)
>>>>> Note that having the same Dictionary Upper Bound (dub) value is 
>>>>> useful/mandatory to properly compare different LMs in terms of Perplexity
>>>>> Moreover, Note that the dub value is not stored in the saved LM
>>>>>
>>>>> In IRSTLM, you can/have to  set this value with the parameter  -dub   
>>>>> when you compute the perplexity   either with    tlm    or    compile-lm
>>>>> In MOSES, you can/have to set this parameter with    "-lmodel-dub"
>>>>>
>>>>> I remember you can use the LM estimated by means of IRSTLM toolkit  
>>>>> directly in MOSES setting the first field of the "-lmodel-file" parameter 
>>>>> to "1"
>>>>> without transforming it with build-binary.
>>>>>
>>>>>
>>>>> As concerns the difference between IRSTLM and SRILM, they should not be 
>>>>> there.
>>>>> Have you notice difference also in the perplexity?
>>>>> Maybe you can send us  a tiny benchmark (data and used commands) in which 
>>>>> you experience such difference,
>>>>> so that we can debug.
>>>>>
>>>>>
>>>>>
>>>>> Nicola
>>>>>
>>>>>
>>>>> On Nov 8, 2012, at 8:22 AM, Marcin Junczys-Dowmunt wrote:
>>>>>
>>>>>> Hi Pratyush,
>>>>>> Thanks for the hint. That solved the problem I had with the arpa files
>>>>>> when using -lm=msb and KenLM. Unfortunately, this does not seem to
>>>>>> improve performance of IRSTLM much when compared to SRILM. So I guess I
>>>>>> will have to stick with SRILM for now.
>>>>>>
>>>>>> Kenneth, weren't you working on your own tool to produce language models?
>>>>>> Best,
>>>>>> Marcin
>>>>>>
>>>>>> W dniu 07.11.2012 11:18, Pratyush Banerjee pisze:
>>>>>>> Hi Marcin,
>>>>>>>
>>>>>>> I have used msb with irstlm... but seems to have worked fine for me...
>>>>>>>
>>>>>>> You mentioned faulty arpa files for 5-grams... is it because KenLM
>>>>>>> complains of missing 4-grams, 3-grams etc ?
>>>>>>> Have you tried using -ps=no option with tlm ?
>>>>>>>
>>>>>>> IRSTLM is known to prune singletons n-grams in order to reduce the
>>>>>>> size of the LM... (tlm has it on by default..)
>>>>>>>
>>>>>>> If you use this option, usually KenLM does not complain... I have also
>>>>>>> used such LMs with SRILM for further mixing and it went fine...
>>>>>>>
>>>>>>> I am sure somebody from the IRSTLM community could confirm this...
>>>>>>>
>>>>>>> Hope this resolves the issue...
>>>>>>>
>>>>>>> Thanks and Regards,
>>>>>>>
>>>>>>> Pratyush
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt
>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>
>>>>>>>    On the irstlm page it says:
>>>>>>>
>>>>>>>    'Modified shift-beta, also known as "improved kneser-ney smoothing"'
>>>>>>>
>>>>>>>    Unfortunately I cannot use "msb" because it seems to produce
>>>>>>>    faulty arpa
>>>>>>>    files for 5-grams. So I am trying only "shift-beta" whatever that
>>>>>>>    means.
>>>>>>>    Maybe that's the main problem?
>>>>>>>    Also, my data sets are not that small, the plain arpa files currently
>>>>>>>    exceed 20 GB.
>>>>>>>
>>>>>>>    Best,
>>>>>>>    Marcin
>>>>>>>
>>>>>>>    W dniu 06.11.2012 22:15, Jonathan Clark pisze:
>>>>>>>> As far as I know, exact modified Kneser-Ney smoothing (the current
>>>>>>>> state of the art) is not supported by IRSTLM. IRSTLM instead
>>>>>>>> implements modified shift-beta smoothing, which isn't quite as
>>>>>>>> effective -- especially on smaller data sets.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Jon
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt
>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>> Hi,
>>>>>>>>> Slightly off-topic, but I am out of ideas. I am trying to
>>>>>>>    figure out
>>>>>>>>> what set of parameters I have to use with IRSTLM to creates LMs
>>>>>>>    that are
>>>>>>>>> equivalent to language models created with SRILM using the
>>>>>>>    following
>>>>>>>>> command:
>>>>>>>>>
>>>>>>>>> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text
>>>>>>>>> input.en -lm lm.en.arpa
>>>>>>>>>
>>>>>>>>> Up to now, I am using this chain of commands for IRSTLM:
>>>>>>>>>
>>>>>>>>> perl -C -pe 'chomp; $_ = "<s> $_ </s>\n"' < input.en >
>>>>>>>    input.en.sb <http://input.en.sb>
>>>>>>>>> ngt -i=input.en.sb <http://input.en.sb> -n=5 -b=yes -o=lm.en.bin
>>>>>>>>> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa
>>>>>>>>>
>>>>>>>>> I know this is not quite the same, but it comes closest in terms of
>>>>>>>>> quality and size. The translation results, however, are still
>>>>>>>>> consistently worse than with SRILM models, differences in BLEU
>>>>>>>    are up to
>>>>>>>>> 1%.
>>>>>>>>>
>>>>>>>>> I use KenLM with Moses to binarize the resulting arpa files, so
>>>>>>>    this is
>>>>>>>>> not a code issue.
>>>>>>>>>
>>>>>>>>> Also it seems IRSTLM has a bug with the modified shift beta
>>>>>>>    option. At
>>>>>>>>> least KenLM complains that not all 4-grams are present although
>>>>>>>    there
>>>>>>>>> are 5-grams that contain them.
>>>>>>>>>
>>>>>>>>> Any ideas?
>>>>>>>>> Thanks,
>>>>>>>>> Marcin
>>>>>>>>> _______________________________________________
>>>>>>>>> Moses-support mailing list
>>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>    _______________________________________________
>>>>>>>    Moses-support mailing list
>>>>>>>    [email protected] <mailto:[email protected]>
>>>>>>>    http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> [email protected]
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] SRILM vs IRSTLM

Reply via email to