Modified ShiftBeta (aka modified Kenser Ney) does not considered the real counts for computing probabilties, but the corrected counts, which basically are the number of different successors of a n-gram. Hence in this case your bigram "schválení těchto" occurs always before "zpráv", and hence it behaves like a "singleton".
Please refer to this paper to more details about this smoothing technique: Chen, S. F. and Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 4(13):359–393. Nicola On Nov 14, 2012, at 4:50 PM, Philipp Koehn wrote: > Hi, > > I encountered the same problem when using "msb" and > pruned singletons on large corpora (Europarl). > SRILM's ngram complaints about "no bow for prefix of ngram" > > Here a Czech example: > > grep 'schválení těchto' /home/pkoehn/experiment/wmt12-en-cs/lm/europarl.lm.38 > -2.35639 schválení těchto zpráv -0.198088 > -0.390525 schválení těchto zpráv , > -0.390525 proti schválení těchto zpráv > > There should be an entry for the bigram "schválení těchto". > > I do not see how this could happen - the ngram occurs twice in the corpus: > >> grep 'schválení těchto zpráv' lm/europarl.truecased.16 > zatímco s dlouhodobými cíli souhlasíme , nemůžeme souhlasit s > prostředky k jejich dosažení a hlasovali jsme proti schválení těchto > zpráv , které se neomezují pouze na eurozónu . > zatímco souhlasíme s dlouhodobými cíli , nemůžeme souhlasit s > prostředky k jejich dosažení a hlasovali jsme proti schválení těchto > zpráv , které se neomezují pouze na eurozónu . > > I suspect that the current implementation throws out higher order n-grams > if they occur in _one_context_, not _once_. > > -phi > > On Thu, Nov 8, 2012 at 3:31 AM, Marcin Junczys-Dowmunt > <junc...@amu.edu.pl> wrote: >> Using -lm=msb instead of -lm=sb and testing on several evaluation sets >> seems to help. Then one time IRSTLM is better another time I have better >> results with SRILM. So on average they seem to be on par now. >> >> Interesting, however, that you say there should be no differences. I >> never manage to get the same BLEU scores on a test set for IRSTLM and >> SRILM. I have to do some reading on this dub issue and see what happens. >> >> W dniu 08.11.2012 09:20, Nicola Bertoldi pisze: >>>> From the FBK community... >>> >>> as already mentioned by ken, >>> >>> tlm computes correctly the "Improved Kneser-Ney method" (-lm=msb) >>> >>> tlm can keep the singletons: set parameter -ps=no >>> >>> As concerns as OOV words tlm computes the probability of the OOV as it >>> were a class of all possible unknown words. >>> In order to get the actual prob of one single OOV token tlm requires >>> that a Dictionary Upper Bound is set. >>> The Dictionary Upper Bound is intended to be a rough estimate of the >>> dictionary size (a reasonable value could be 10e+7, which is also the >>> default) >>> Note that having the same Dictionary Upper Bound (dub) value is >>> useful/mandatory to properly compare different LMs in terms of Perplexity >>> Moreover, Note that the dub value is not stored in the saved LM >>> >>> In IRSTLM, you can/have to set this value with the parameter -dub when >>> you compute the perplexity either with tlm or compile-lm >>> In MOSES, you can/have to set this parameter with "-lmodel-dub" >>> >>> I remember you can use the LM estimated by means of IRSTLM toolkit >>> directly in MOSES setting the first field of the "-lmodel-file" parameter >>> to "1" >>> without transforming it with build-binary. >>> >>> >>> As concerns the difference between IRSTLM and SRILM, they should not be >>> there. >>> Have you notice difference also in the perplexity? >>> Maybe you can send us a tiny benchmark (data and used commands) in which >>> you experience such difference, >>> so that we can debug. >>> >>> >>> >>> Nicola >>> >>> >>> On Nov 8, 2012, at 8:22 AM, Marcin Junczys-Dowmunt wrote: >>> >>>> Hi Pratyush, >>>> Thanks for the hint. That solved the problem I had with the arpa files >>>> when using -lm=msb and KenLM. Unfortunately, this does not seem to >>>> improve performance of IRSTLM much when compared to SRILM. So I guess I >>>> will have to stick with SRILM for now. >>>> >>>> Kenneth, weren't you working on your own tool to produce language models? >>>> Best, >>>> Marcin >>>> >>>> W dniu 07.11.2012 11:18, Pratyush Banerjee pisze: >>>>> Hi Marcin, >>>>> >>>>> I have used msb with irstlm... but seems to have worked fine for me... >>>>> >>>>> You mentioned faulty arpa files for 5-grams... is it because KenLM >>>>> complains of missing 4-grams, 3-grams etc ? >>>>> Have you tried using -ps=no option with tlm ? >>>>> >>>>> IRSTLM is known to prune singletons n-grams in order to reduce the >>>>> size of the LM... (tlm has it on by default..) >>>>> >>>>> If you use this option, usually KenLM does not complain... I have also >>>>> used such LMs with SRILM for further mixing and it went fine... >>>>> >>>>> I am sure somebody from the IRSTLM community could confirm this... >>>>> >>>>> Hope this resolves the issue... >>>>> >>>>> Thanks and Regards, >>>>> >>>>> Pratyush >>>>> >>>>> >>>>> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt >>>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote: >>>>> >>>>> On the irstlm page it says: >>>>> >>>>> 'Modified shift-beta, also known as "improved kneser-ney smoothing"' >>>>> >>>>> Unfortunately I cannot use "msb" because it seems to produce >>>>> faulty arpa >>>>> files for 5-grams. So I am trying only "shift-beta" whatever that >>>>> means. >>>>> Maybe that's the main problem? >>>>> Also, my data sets are not that small, the plain arpa files currently >>>>> exceed 20 GB. >>>>> >>>>> Best, >>>>> Marcin >>>>> >>>>> W dniu 06.11.2012 22:15, Jonathan Clark pisze: >>>>>> As far as I know, exact modified Kneser-Ney smoothing (the current >>>>>> state of the art) is not supported by IRSTLM. IRSTLM instead >>>>>> implements modified shift-beta smoothing, which isn't quite as >>>>>> effective -- especially on smaller data sets. >>>>>> >>>>>> Cheers, >>>>>> Jon >>>>>> >>>>>> >>>>>> On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt >>>>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote: >>>>>>> Hi, >>>>>>> Slightly off-topic, but I am out of ideas. I am trying to >>>>> figure out >>>>>>> what set of parameters I have to use with IRSTLM to creates LMs >>>>> that are >>>>>>> equivalent to language models created with SRILM using the >>>>> following >>>>>>> command: >>>>>>> >>>>>>> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text >>>>>>> input.en -lm lm.en.arpa >>>>>>> >>>>>>> Up to now, I am using this chain of commands for IRSTLM: >>>>>>> >>>>>>> perl -C -pe 'chomp; $_ = "<s> $_ </s>\n"' < input.en > >>>>> input.en.sb <http://input.en.sb> >>>>>>> ngt -i=input.en.sb <http://input.en.sb> -n=5 -b=yes -o=lm.en.bin >>>>>>> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa >>>>>>> >>>>>>> I know this is not quite the same, but it comes closest in terms of >>>>>>> quality and size. The translation results, however, are still >>>>>>> consistently worse than with SRILM models, differences in BLEU >>>>> are up to >>>>>>> 1%. >>>>>>> >>>>>>> I use KenLM with Moses to binarize the resulting arpa files, so >>>>> this is >>>>>>> not a code issue. >>>>>>> >>>>>>> Also it seems IRSTLM has a bug with the modified shift beta >>>>> option. At >>>>>>> least KenLM complains that not all 4-grams are present although >>>>> there >>>>>>> are 5-grams that contain them. >>>>>>> >>>>>>> Any ideas? >>>>>>> Thanks, >>>>>>> Marcin >>>>>>> _______________________________________________ >>>>>>> Moses-support mailing list >>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> Moses-support@mit.edu >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support