Hi,

I encountered the same problem when using "msb" and
pruned singletons on large corpora (Europarl).
SRILM's ngram complaints about "no bow for prefix of ngram"

Here a Czech example:

grep 'schválení těchto' /home/pkoehn/experiment/wmt12-en-cs/lm/europarl.lm.38
-2.35639        schválení těchto zpráv  -0.198088
-0.390525       schválení těchto zpráv ,
-0.390525       proti schválení těchto zpráv

There should be an entry for the bigram "schválení těchto".

I do not see how this could happen - the ngram occurs twice in the corpus:

> grep 'schválení těchto zpráv' lm/europarl.truecased.16
zatímco s dlouhodobými cíli souhlasíme , nemůžeme souhlasit s
prostředky k jejich dosažení a hlasovali jsme proti schválení těchto
zpráv , které se neomezují pouze na eurozónu .
zatímco souhlasíme s dlouhodobými cíli , nemůžeme souhlasit s
prostředky k jejich dosažení a hlasovali jsme proti schválení těchto
zpráv , které se neomezují pouze na eurozónu .

I suspect that the current implementation throws out higher order n-grams
if they occur in _one_context_, not _once_.

-phi

On Thu, Nov 8, 2012 at 3:31 AM, Marcin Junczys-Dowmunt
<junc...@amu.edu.pl> wrote:
> Using -lm=msb instead of -lm=sb and testing on several evaluation sets
> seems to help. Then one time IRSTLM is better another time I have better
> results with SRILM. So on average they seem to be on par now.
>
> Interesting, however, that you say there should be no differences. I
> never manage to get the same BLEU scores on a test set for IRSTLM and
> SRILM. I have to do some reading on this dub issue and see what happens.
>
> W dniu 08.11.2012 09:20, Nicola Bertoldi pisze:
>> >From the FBK community...
>>
>> as already mentioned by ken,
>>
>> tlm computes correctly  the "Improved Kneser-Ney method"  (-lm=msb)
>>
>> tlm can keep the singletons: set parameter  -ps=no
>>
>> As concerns as OOV words tlm computes the probability of the OOV  as it were 
>> a class of all possible unknown words.
>> In order to get the actual prob of one single OOV token    tlm requires that 
>> a Dictionary Upper Bound is set.
>> The Dictionary Upper Bound is intended to be a rough estimate of the 
>> dictionary size (a reasonable value could be 10e+7, which is also the 
>> default)
>> Note that having the same Dictionary Upper Bound (dub) value is 
>> useful/mandatory to properly compare different LMs in terms of Perplexity
>> Moreover, Note that the dub value is not stored in the saved LM
>>
>> In IRSTLM, you can/have to  set this value with the parameter  -dub   when 
>> you compute the perplexity   either with    tlm    or    compile-lm
>> In MOSES, you can/have to set this parameter with    "-lmodel-dub"
>>
>> I remember you can use the LM estimated by means of IRSTLM toolkit  directly 
>> in MOSES setting the first field of the "-lmodel-file" parameter to "1"
>> without transforming it with build-binary.
>>
>>
>> As concerns the difference between IRSTLM and SRILM, they should not be 
>> there.
>> Have you notice difference also in the perplexity?
>> Maybe you can send us  a tiny benchmark (data and used commands) in which 
>> you experience such difference,
>> so that we can debug.
>>
>>
>>
>> Nicola
>>
>>
>> On Nov 8, 2012, at 8:22 AM, Marcin Junczys-Dowmunt wrote:
>>
>>> Hi Pratyush,
>>> Thanks for the hint. That solved the problem I had with the arpa files
>>> when using -lm=msb and KenLM. Unfortunately, this does not seem to
>>> improve performance of IRSTLM much when compared to SRILM. So I guess I
>>> will have to stick with SRILM for now.
>>>
>>> Kenneth, weren't you working on your own tool to produce language models?
>>> Best,
>>> Marcin
>>>
>>> W dniu 07.11.2012 11:18, Pratyush Banerjee pisze:
>>>> Hi Marcin,
>>>>
>>>> I have used msb with irstlm... but seems to have worked fine for me...
>>>>
>>>> You mentioned faulty arpa files for 5-grams... is it because KenLM
>>>> complains of missing 4-grams, 3-grams etc ?
>>>> Have you tried using -ps=no option with tlm ?
>>>>
>>>> IRSTLM is known to prune singletons n-grams in order to reduce the
>>>> size of the LM... (tlm has it on by default..)
>>>>
>>>> If you use this option, usually KenLM does not complain... I have also
>>>> used such LMs with SRILM for further mixing and it went fine...
>>>>
>>>> I am sure somebody from the IRSTLM community could confirm this...
>>>>
>>>> Hope this resolves the issue...
>>>>
>>>> Thanks and Regards,
>>>>
>>>> Pratyush
>>>>
>>>>
>>>> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt
>>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote:
>>>>
>>>>     On the irstlm page it says:
>>>>
>>>>     'Modified shift-beta, also known as "improved kneser-ney smoothing"'
>>>>
>>>>     Unfortunately I cannot use "msb" because it seems to produce
>>>>     faulty arpa
>>>>     files for 5-grams. So I am trying only "shift-beta" whatever that
>>>>     means.
>>>>     Maybe that's the main problem?
>>>>     Also, my data sets are not that small, the plain arpa files currently
>>>>     exceed 20 GB.
>>>>
>>>>     Best,
>>>>     Marcin
>>>>
>>>>     W dniu 06.11.2012 22:15, Jonathan Clark pisze:
>>>>> As far as I know, exact modified Kneser-Ney smoothing (the current
>>>>> state of the art) is not supported by IRSTLM. IRSTLM instead
>>>>> implements modified shift-beta smoothing, which isn't quite as
>>>>> effective -- especially on smaller data sets.
>>>>>
>>>>> Cheers,
>>>>> Jon
>>>>>
>>>>>
>>>>> On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt
>>>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote:
>>>>>> Hi,
>>>>>> Slightly off-topic, but I am out of ideas. I am trying to
>>>>     figure out
>>>>>> what set of parameters I have to use with IRSTLM to creates LMs
>>>>     that are
>>>>>> equivalent to language models created with SRILM using the
>>>>     following
>>>>>> command:
>>>>>>
>>>>>> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text
>>>>>> input.en -lm lm.en.arpa
>>>>>>
>>>>>> Up to now, I am using this chain of commands for IRSTLM:
>>>>>>
>>>>>> perl -C -pe 'chomp; $_ = "<s> $_ </s>\n"' < input.en >
>>>>     input.en.sb <http://input.en.sb>
>>>>>> ngt -i=input.en.sb <http://input.en.sb> -n=5 -b=yes -o=lm.en.bin
>>>>>> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa
>>>>>>
>>>>>> I know this is not quite the same, but it comes closest in terms of
>>>>>> quality and size. The translation results, however, are still
>>>>>> consistently worse than with SRILM models, differences in BLEU
>>>>     are up to
>>>>>> 1%.
>>>>>>
>>>>>> I use KenLM with Moses to binarize the resulting arpa files, so
>>>>     this is
>>>>>> not a code issue.
>>>>>>
>>>>>> Also it seems IRSTLM has a bug with the modified shift beta
>>>>     option. At
>>>>>> least KenLM complains that not all 4-grams are present although
>>>>     there
>>>>>> are 5-grams that contain them.
>>>>>>
>>>>>> Any ideas?
>>>>>> Thanks,
>>>>>> Marcin
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>     _______________________________________________
>>>>     Moses-support mailing list
>>>>     Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>     http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to