Hi, I encountered the same problem when using "msb" and pruned singletons on large corpora (Europarl). SRILM's ngram complaints about "no bow for prefix of ngram"
Here a Czech example: grep 'schválení těchto' /home/pkoehn/experiment/wmt12-en-cs/lm/europarl.lm.38 -2.35639 schválení těchto zpráv -0.198088 -0.390525 schválení těchto zpráv , -0.390525 proti schválení těchto zpráv There should be an entry for the bigram "schválení těchto". I do not see how this could happen - the ngram occurs twice in the corpus: > grep 'schválení těchto zpráv' lm/europarl.truecased.16 zatímco s dlouhodobými cíli souhlasíme , nemůžeme souhlasit s prostředky k jejich dosažení a hlasovali jsme proti schválení těchto zpráv , které se neomezují pouze na eurozónu . zatímco souhlasíme s dlouhodobými cíli , nemůžeme souhlasit s prostředky k jejich dosažení a hlasovali jsme proti schválení těchto zpráv , které se neomezují pouze na eurozónu . I suspect that the current implementation throws out higher order n-grams if they occur in _one_context_, not _once_. -phi On Thu, Nov 8, 2012 at 3:31 AM, Marcin Junczys-Dowmunt <junc...@amu.edu.pl> wrote: > Using -lm=msb instead of -lm=sb and testing on several evaluation sets > seems to help. Then one time IRSTLM is better another time I have better > results with SRILM. So on average they seem to be on par now. > > Interesting, however, that you say there should be no differences. I > never manage to get the same BLEU scores on a test set for IRSTLM and > SRILM. I have to do some reading on this dub issue and see what happens. > > W dniu 08.11.2012 09:20, Nicola Bertoldi pisze: >> >From the FBK community... >> >> as already mentioned by ken, >> >> tlm computes correctly the "Improved Kneser-Ney method" (-lm=msb) >> >> tlm can keep the singletons: set parameter -ps=no >> >> As concerns as OOV words tlm computes the probability of the OOV as it were >> a class of all possible unknown words. >> In order to get the actual prob of one single OOV token tlm requires that >> a Dictionary Upper Bound is set. >> The Dictionary Upper Bound is intended to be a rough estimate of the >> dictionary size (a reasonable value could be 10e+7, which is also the >> default) >> Note that having the same Dictionary Upper Bound (dub) value is >> useful/mandatory to properly compare different LMs in terms of Perplexity >> Moreover, Note that the dub value is not stored in the saved LM >> >> In IRSTLM, you can/have to set this value with the parameter -dub when >> you compute the perplexity either with tlm or compile-lm >> In MOSES, you can/have to set this parameter with "-lmodel-dub" >> >> I remember you can use the LM estimated by means of IRSTLM toolkit directly >> in MOSES setting the first field of the "-lmodel-file" parameter to "1" >> without transforming it with build-binary. >> >> >> As concerns the difference between IRSTLM and SRILM, they should not be >> there. >> Have you notice difference also in the perplexity? >> Maybe you can send us a tiny benchmark (data and used commands) in which >> you experience such difference, >> so that we can debug. >> >> >> >> Nicola >> >> >> On Nov 8, 2012, at 8:22 AM, Marcin Junczys-Dowmunt wrote: >> >>> Hi Pratyush, >>> Thanks for the hint. That solved the problem I had with the arpa files >>> when using -lm=msb and KenLM. Unfortunately, this does not seem to >>> improve performance of IRSTLM much when compared to SRILM. So I guess I >>> will have to stick with SRILM for now. >>> >>> Kenneth, weren't you working on your own tool to produce language models? >>> Best, >>> Marcin >>> >>> W dniu 07.11.2012 11:18, Pratyush Banerjee pisze: >>>> Hi Marcin, >>>> >>>> I have used msb with irstlm... but seems to have worked fine for me... >>>> >>>> You mentioned faulty arpa files for 5-grams... is it because KenLM >>>> complains of missing 4-grams, 3-grams etc ? >>>> Have you tried using -ps=no option with tlm ? >>>> >>>> IRSTLM is known to prune singletons n-grams in order to reduce the >>>> size of the LM... (tlm has it on by default..) >>>> >>>> If you use this option, usually KenLM does not complain... I have also >>>> used such LMs with SRILM for further mixing and it went fine... >>>> >>>> I am sure somebody from the IRSTLM community could confirm this... >>>> >>>> Hope this resolves the issue... >>>> >>>> Thanks and Regards, >>>> >>>> Pratyush >>>> >>>> >>>> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt >>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote: >>>> >>>> On the irstlm page it says: >>>> >>>> 'Modified shift-beta, also known as "improved kneser-ney smoothing"' >>>> >>>> Unfortunately I cannot use "msb" because it seems to produce >>>> faulty arpa >>>> files for 5-grams. So I am trying only "shift-beta" whatever that >>>> means. >>>> Maybe that's the main problem? >>>> Also, my data sets are not that small, the plain arpa files currently >>>> exceed 20 GB. >>>> >>>> Best, >>>> Marcin >>>> >>>> W dniu 06.11.2012 22:15, Jonathan Clark pisze: >>>>> As far as I know, exact modified Kneser-Ney smoothing (the current >>>>> state of the art) is not supported by IRSTLM. IRSTLM instead >>>>> implements modified shift-beta smoothing, which isn't quite as >>>>> effective -- especially on smaller data sets. >>>>> >>>>> Cheers, >>>>> Jon >>>>> >>>>> >>>>> On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt >>>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote: >>>>>> Hi, >>>>>> Slightly off-topic, but I am out of ideas. I am trying to >>>> figure out >>>>>> what set of parameters I have to use with IRSTLM to creates LMs >>>> that are >>>>>> equivalent to language models created with SRILM using the >>>> following >>>>>> command: >>>>>> >>>>>> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text >>>>>> input.en -lm lm.en.arpa >>>>>> >>>>>> Up to now, I am using this chain of commands for IRSTLM: >>>>>> >>>>>> perl -C -pe 'chomp; $_ = "<s> $_ </s>\n"' < input.en > >>>> input.en.sb <http://input.en.sb> >>>>>> ngt -i=input.en.sb <http://input.en.sb> -n=5 -b=yes -o=lm.en.bin >>>>>> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa >>>>>> >>>>>> I know this is not quite the same, but it comes closest in terms of >>>>>> quality and size. The translation results, however, are still >>>>>> consistently worse than with SRILM models, differences in BLEU >>>> are up to >>>>>> 1%. >>>>>> >>>>>> I use KenLM with Moses to binarize the resulting arpa files, so >>>> this is >>>>>> not a code issue. >>>>>> >>>>>> Also it seems IRSTLM has a bug with the modified shift beta >>>> option. At >>>>>> least KenLM complains that not all 4-grams are present although >>>> there >>>>>> are 5-grams that contain them. >>>>>> >>>>>> Any ideas? >>>>>> Thanks, >>>>>> Marcin >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> _______________________________________________ >>>> Moses-support mailing list >>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> Moses-support@mit.edu >>> http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support