Modified ShiftBeta  (aka modified Kenser Ney) does not considered the real 
counts for computing probabilties, but the corrected counts, which basically 
are the number of different successors of a n-gram.
Hence in this case your bigram "schválení těchto" occurs always before "zpráv", 
and hence it behaves like a "singleton".

Please refer to this paper to more details about this smoothing technique:
Chen, S. F. and Goodman, J. (1999). An empirical study of smoothing techniques 
for language modeling. Computer Speech and Language, 4(13):359–393.

Nicola

On Nov 14, 2012, at 4:50 PM, Philipp Koehn wrote:

> Hi,
> 
> I encountered the same problem when using "msb" and
> pruned singletons on large corpora (Europarl).
> SRILM's ngram complaints about "no bow for prefix of ngram"
> 
> Here a Czech example:
> 
> grep 'schválení těchto' /home/pkoehn/experiment/wmt12-en-cs/lm/europarl.lm.38
> -2.35639      schválení těchto zpráv  -0.198088
> -0.390525     schválení těchto zpráv ,
> -0.390525     proti schválení těchto zpráv
> 
> There should be an entry for the bigram "schválení těchto".
> 
> I do not see how this could happen - the ngram occurs twice in the corpus:
> 
>> grep 'schválení těchto zpráv' lm/europarl.truecased.16
> zatímco s dlouhodobými cíli souhlasíme , nemůžeme souhlasit s
> prostředky k jejich dosažení a hlasovali jsme proti schválení těchto
> zpráv , které se neomezují pouze na eurozónu .
> zatímco souhlasíme s dlouhodobými cíli , nemůžeme souhlasit s
> prostředky k jejich dosažení a hlasovali jsme proti schválení těchto
> zpráv , které se neomezují pouze na eurozónu .
> 
> I suspect that the current implementation throws out higher order n-grams
> if they occur in _one_context_, not _once_.
> 
> -phi
> 
> On Thu, Nov 8, 2012 at 3:31 AM, Marcin Junczys-Dowmunt
> <junc...@amu.edu.pl> wrote:
>> Using -lm=msb instead of -lm=sb and testing on several evaluation sets
>> seems to help. Then one time IRSTLM is better another time I have better
>> results with SRILM. So on average they seem to be on par now.
>> 
>> Interesting, however, that you say there should be no differences. I
>> never manage to get the same BLEU scores on a test set for IRSTLM and
>> SRILM. I have to do some reading on this dub issue and see what happens.
>> 
>> W dniu 08.11.2012 09:20, Nicola Bertoldi pisze:
>>>> From the FBK community...
>>> 
>>> as already mentioned by ken,
>>> 
>>> tlm computes correctly  the "Improved Kneser-Ney method"  (-lm=msb)
>>> 
>>> tlm can keep the singletons: set parameter  -ps=no
>>> 
>>> As concerns as OOV words tlm computes the probability of the OOV  as it 
>>> were a class of all possible unknown words.
>>> In order to get the actual prob of one single OOV token    tlm requires 
>>> that a Dictionary Upper Bound is set.
>>> The Dictionary Upper Bound is intended to be a rough estimate of the 
>>> dictionary size (a reasonable value could be 10e+7, which is also the 
>>> default)
>>> Note that having the same Dictionary Upper Bound (dub) value is 
>>> useful/mandatory to properly compare different LMs in terms of Perplexity
>>> Moreover, Note that the dub value is not stored in the saved LM
>>> 
>>> In IRSTLM, you can/have to  set this value with the parameter  -dub   when 
>>> you compute the perplexity   either with    tlm    or    compile-lm
>>> In MOSES, you can/have to set this parameter with    "-lmodel-dub"
>>> 
>>> I remember you can use the LM estimated by means of IRSTLM toolkit  
>>> directly in MOSES setting the first field of the "-lmodel-file" parameter 
>>> to "1"
>>> without transforming it with build-binary.
>>> 
>>> 
>>> As concerns the difference between IRSTLM and SRILM, they should not be 
>>> there.
>>> Have you notice difference also in the perplexity?
>>> Maybe you can send us  a tiny benchmark (data and used commands) in which 
>>> you experience such difference,
>>> so that we can debug.
>>> 
>>> 
>>> 
>>> Nicola
>>> 
>>> 
>>> On Nov 8, 2012, at 8:22 AM, Marcin Junczys-Dowmunt wrote:
>>> 
>>>> Hi Pratyush,
>>>> Thanks for the hint. That solved the problem I had with the arpa files
>>>> when using -lm=msb and KenLM. Unfortunately, this does not seem to
>>>> improve performance of IRSTLM much when compared to SRILM. So I guess I
>>>> will have to stick with SRILM for now.
>>>> 
>>>> Kenneth, weren't you working on your own tool to produce language models?
>>>> Best,
>>>> Marcin
>>>> 
>>>> W dniu 07.11.2012 11:18, Pratyush Banerjee pisze:
>>>>> Hi Marcin,
>>>>> 
>>>>> I have used msb with irstlm... but seems to have worked fine for me...
>>>>> 
>>>>> You mentioned faulty arpa files for 5-grams... is it because KenLM
>>>>> complains of missing 4-grams, 3-grams etc ?
>>>>> Have you tried using -ps=no option with tlm ?
>>>>> 
>>>>> IRSTLM is known to prune singletons n-grams in order to reduce the
>>>>> size of the LM... (tlm has it on by default..)
>>>>> 
>>>>> If you use this option, usually KenLM does not complain... I have also
>>>>> used such LMs with SRILM for further mixing and it went fine...
>>>>> 
>>>>> I am sure somebody from the IRSTLM community could confirm this...
>>>>> 
>>>>> Hope this resolves the issue...
>>>>> 
>>>>> Thanks and Regards,
>>>>> 
>>>>> Pratyush
>>>>> 
>>>>> 
>>>>> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt
>>>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote:
>>>>> 
>>>>>    On the irstlm page it says:
>>>>> 
>>>>>    'Modified shift-beta, also known as "improved kneser-ney smoothing"'
>>>>> 
>>>>>    Unfortunately I cannot use "msb" because it seems to produce
>>>>>    faulty arpa
>>>>>    files for 5-grams. So I am trying only "shift-beta" whatever that
>>>>>    means.
>>>>>    Maybe that's the main problem?
>>>>>    Also, my data sets are not that small, the plain arpa files currently
>>>>>    exceed 20 GB.
>>>>> 
>>>>>    Best,
>>>>>    Marcin
>>>>> 
>>>>>    W dniu 06.11.2012 22:15, Jonathan Clark pisze:
>>>>>> As far as I know, exact modified Kneser-Ney smoothing (the current
>>>>>> state of the art) is not supported by IRSTLM. IRSTLM instead
>>>>>> implements modified shift-beta smoothing, which isn't quite as
>>>>>> effective -- especially on smaller data sets.
>>>>>> 
>>>>>> Cheers,
>>>>>> Jon
>>>>>> 
>>>>>> 
>>>>>> On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt
>>>>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote:
>>>>>>> Hi,
>>>>>>> Slightly off-topic, but I am out of ideas. I am trying to
>>>>>    figure out
>>>>>>> what set of parameters I have to use with IRSTLM to creates LMs
>>>>>    that are
>>>>>>> equivalent to language models created with SRILM using the
>>>>>    following
>>>>>>> command:
>>>>>>> 
>>>>>>> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text
>>>>>>> input.en -lm lm.en.arpa
>>>>>>> 
>>>>>>> Up to now, I am using this chain of commands for IRSTLM:
>>>>>>> 
>>>>>>> perl -C -pe 'chomp; $_ = "<s> $_ </s>\n"' < input.en >
>>>>>    input.en.sb <http://input.en.sb>
>>>>>>> ngt -i=input.en.sb <http://input.en.sb> -n=5 -b=yes -o=lm.en.bin
>>>>>>> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa
>>>>>>> 
>>>>>>> I know this is not quite the same, but it comes closest in terms of
>>>>>>> quality and size. The translation results, however, are still
>>>>>>> consistently worse than with SRILM models, differences in BLEU
>>>>>    are up to
>>>>>>> 1%.
>>>>>>> 
>>>>>>> I use KenLM with Moses to binarize the resulting arpa files, so
>>>>>    this is
>>>>>>> not a code issue.
>>>>>>> 
>>>>>>> Also it seems IRSTLM has a bug with the modified shift beta
>>>>>    option. At
>>>>>>> least KenLM complains that not all 4-grams are present although
>>>>>    there
>>>>>>> are 5-grams that contain them.
>>>>>>> 
>>>>>>> Any ideas?
>>>>>>> Thanks,
>>>>>>> Marcin
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>    _______________________________________________
>>>>>    Moses-support mailing list
>>>>>    Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>    http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>> 
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> 
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to