Hi Nicola,

I am very familiar with the way smoothing works with Kneser Ney, but
I have no idea how to properly handle singleton pruning.

But be this as it may:

In the example I cite, the trigram "schválení těchto zpráv" occurs only
in one context: following "proti". Why is it included in the ngram model?

-phi

On Wed, Nov 14, 2012 at 11:30 AM, Nicola Bertoldi <berto...@fbk.eu> wrote:
> Modified ShiftBeta  (aka modified Kenser Ney) does not considered the real 
> counts for computing probabilties, but the corrected counts, which basically 
> are the number of different successors of a n-gram.
> Hence in this case your bigram "schválení těchto" occurs always before 
> "zpráv", and hence it behaves like a "singleton".
>
> Please refer to this paper to more details about this smoothing technique:
> Chen, S. F. and Goodman, J. (1999). An empirical study of smoothing 
> techniques for language modeling. Computer Speech and Language, 4(13):359-393.
>
> Nicola
>
> On Nov 14, 2012, at 4:50 PM, Philipp Koehn wrote:
>
>> Hi,
>>
>> I encountered the same problem when using "msb" and
>> pruned singletons on large corpora (Europarl).
>> SRILM's ngram complaints about "no bow for prefix of ngram"
>>
>> Here a Czech example:
>>
>> grep 'schválení těchto' /home/pkoehn/experiment/wmt12-en-cs/lm/europarl.lm.38
>> -2.35639      schválení těchto zpráv  -0.198088
>> -0.390525     schválení těchto zpráv ,
>> -0.390525     proti schválení těchto zpráv
>>
>> There should be an entry for the bigram "schválení těchto".
>>
>> I do not see how this could happen - the ngram occurs twice in the corpus:
>>
>>> grep 'schválení těchto zpráv' lm/europarl.truecased.16
>> zatímco s dlouhodobými cíli souhlasíme , nemůžeme souhlasit s
>> prostředky k jejich dosažení a hlasovali jsme proti schválení těchto
>> zpráv , které se neomezují pouze na eurozónu .
>> zatímco souhlasíme s dlouhodobými cíli , nemůžeme souhlasit s
>> prostředky k jejich dosažení a hlasovali jsme proti schválení těchto
>> zpráv , které se neomezují pouze na eurozónu .
>>
>> I suspect that the current implementation throws out higher order n-grams
>> if they occur in _one_context_, not _once_.
>>
>> -phi
>>
>> On Thu, Nov 8, 2012 at 3:31 AM, Marcin Junczys-Dowmunt
>> <junc...@amu.edu.pl> wrote:
>>> Using -lm=msb instead of -lm=sb and testing on several evaluation sets
>>> seems to help. Then one time IRSTLM is better another time I have better
>>> results with SRILM. So on average they seem to be on par now.
>>>
>>> Interesting, however, that you say there should be no differences. I
>>> never manage to get the same BLEU scores on a test set for IRSTLM and
>>> SRILM. I have to do some reading on this dub issue and see what happens.
>>>
>>> W dniu 08.11.2012 09:20, Nicola Bertoldi pisze:
>>>>> From the FBK community...
>>>>
>>>> as already mentioned by ken,
>>>>
>>>> tlm computes correctly  the "Improved Kneser-Ney method"  (-lm=msb)
>>>>
>>>> tlm can keep the singletons: set parameter  -ps=no
>>>>
>>>> As concerns as OOV words tlm computes the probability of the OOV  as it 
>>>> were a class of all possible unknown words.
>>>> In order to get the actual prob of one single OOV token    tlm requires 
>>>> that a Dictionary Upper Bound is set.
>>>> The Dictionary Upper Bound is intended to be a rough estimate of the 
>>>> dictionary size (a reasonable value could be 10e+7, which is also the 
>>>> default)
>>>> Note that having the same Dictionary Upper Bound (dub) value is 
>>>> useful/mandatory to properly compare different LMs in terms of Perplexity
>>>> Moreover, Note that the dub value is not stored in the saved LM
>>>>
>>>> In IRSTLM, you can/have to  set this value with the parameter  -dub   when 
>>>> you compute the perplexity   either with    tlm    or    compile-lm
>>>> In MOSES, you can/have to set this parameter with    "-lmodel-dub"
>>>>
>>>> I remember you can use the LM estimated by means of IRSTLM toolkit  
>>>> directly in MOSES setting the first field of the "-lmodel-file" parameter 
>>>> to "1"
>>>> without transforming it with build-binary.
>>>>
>>>>
>>>> As concerns the difference between IRSTLM and SRILM, they should not be 
>>>> there.
>>>> Have you notice difference also in the perplexity?
>>>> Maybe you can send us  a tiny benchmark (data and used commands) in which 
>>>> you experience such difference,
>>>> so that we can debug.
>>>>
>>>>
>>>>
>>>> Nicola
>>>>
>>>>
>>>> On Nov 8, 2012, at 8:22 AM, Marcin Junczys-Dowmunt wrote:
>>>>
>>>>> Hi Pratyush,
>>>>> Thanks for the hint. That solved the problem I had with the arpa files
>>>>> when using -lm=msb and KenLM. Unfortunately, this does not seem to
>>>>> improve performance of IRSTLM much when compared to SRILM. So I guess I
>>>>> will have to stick with SRILM for now.
>>>>>
>>>>> Kenneth, weren't you working on your own tool to produce language models?
>>>>> Best,
>>>>> Marcin
>>>>>
>>>>> W dniu 07.11.2012 11:18, Pratyush Banerjee pisze:
>>>>>> Hi Marcin,
>>>>>>
>>>>>> I have used msb with irstlm... but seems to have worked fine for me...
>>>>>>
>>>>>> You mentioned faulty arpa files for 5-grams... is it because KenLM
>>>>>> complains of missing 4-grams, 3-grams etc ?
>>>>>> Have you tried using -ps=no option with tlm ?
>>>>>>
>>>>>> IRSTLM is known to prune singletons n-grams in order to reduce the
>>>>>> size of the LM... (tlm has it on by default..)
>>>>>>
>>>>>> If you use this option, usually KenLM does not complain... I have also
>>>>>> used such LMs with SRILM for further mixing and it went fine...
>>>>>>
>>>>>> I am sure somebody from the IRSTLM community could confirm this...
>>>>>>
>>>>>> Hope this resolves the issue...
>>>>>>
>>>>>> Thanks and Regards,
>>>>>>
>>>>>> Pratyush
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt
>>>>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote:
>>>>>>
>>>>>>    On the irstlm page it says:
>>>>>>
>>>>>>    'Modified shift-beta, also known as "improved kneser-ney smoothing"'
>>>>>>
>>>>>>    Unfortunately I cannot use "msb" because it seems to produce
>>>>>>    faulty arpa
>>>>>>    files for 5-grams. So I am trying only "shift-beta" whatever that
>>>>>>    means.
>>>>>>    Maybe that's the main problem?
>>>>>>    Also, my data sets are not that small, the plain arpa files currently
>>>>>>    exceed 20 GB.
>>>>>>
>>>>>>    Best,
>>>>>>    Marcin
>>>>>>
>>>>>>    W dniu 06.11.2012 22:15, Jonathan Clark pisze:
>>>>>>> As far as I know, exact modified Kneser-Ney smoothing (the current
>>>>>>> state of the art) is not supported by IRSTLM. IRSTLM instead
>>>>>>> implements modified shift-beta smoothing, which isn't quite as
>>>>>>> effective -- especially on smaller data sets.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jon
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt
>>>>>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote:
>>>>>>>> Hi,
>>>>>>>> Slightly off-topic, but I am out of ideas. I am trying to
>>>>>>    figure out
>>>>>>>> what set of parameters I have to use with IRSTLM to creates LMs
>>>>>>    that are
>>>>>>>> equivalent to language models created with SRILM using the
>>>>>>    following
>>>>>>>> command:
>>>>>>>>
>>>>>>>> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text
>>>>>>>> input.en -lm lm.en.arpa
>>>>>>>>
>>>>>>>> Up to now, I am using this chain of commands for IRSTLM:
>>>>>>>>
>>>>>>>> perl -C -pe 'chomp; $_ = "<s> $_ </s>\n"' < input.en >
>>>>>>    input.en.sb <http://input.en.sb>
>>>>>>>> ngt -i=input.en.sb <http://input.en.sb> -n=5 -b=yes -o=lm.en.bin
>>>>>>>> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa
>>>>>>>>
>>>>>>>> I know this is not quite the same, but it comes closest in terms of
>>>>>>>> quality and size. The translation results, however, are still
>>>>>>>> consistently worse than with SRILM models, differences in BLEU
>>>>>>    are up to
>>>>>>>> 1%.
>>>>>>>>
>>>>>>>> I use KenLM with Moses to binarize the resulting arpa files, so
>>>>>>    this is
>>>>>>>> not a code issue.
>>>>>>>>
>>>>>>>> Also it seems IRSTLM has a bug with the modified shift beta
>>>>>>    option. At
>>>>>>>> least KenLM complains that not all 4-grams are present although
>>>>>>    there
>>>>>>>> are 5-grams that contain them.
>>>>>>>>
>>>>>>>> Any ideas?
>>>>>>>> Thanks,
>>>>>>>> Marcin
>>>>>>>> _______________________________________________
>>>>>>>> Moses-support mailing list
>>>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>    _______________________________________________
>>>>>>    Moses-support mailing list
>>>>>>    Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>>    http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to