Hi Nicola, I am very familiar with the way smoothing works with Kneser Ney, but I have no idea how to properly handle singleton pruning.
But be this as it may: In the example I cite, the trigram "schválení těchto zpráv" occurs only in one context: following "proti". Why is it included in the ngram model? -phi On Wed, Nov 14, 2012 at 11:30 AM, Nicola Bertoldi <berto...@fbk.eu> wrote: > Modified ShiftBeta (aka modified Kenser Ney) does not considered the real > counts for computing probabilties, but the corrected counts, which basically > are the number of different successors of a n-gram. > Hence in this case your bigram "schválení těchto" occurs always before > "zpráv", and hence it behaves like a "singleton". > > Please refer to this paper to more details about this smoothing technique: > Chen, S. F. and Goodman, J. (1999). An empirical study of smoothing > techniques for language modeling. Computer Speech and Language, 4(13):359-393. > > Nicola > > On Nov 14, 2012, at 4:50 PM, Philipp Koehn wrote: > >> Hi, >> >> I encountered the same problem when using "msb" and >> pruned singletons on large corpora (Europarl). >> SRILM's ngram complaints about "no bow for prefix of ngram" >> >> Here a Czech example: >> >> grep 'schválení těchto' /home/pkoehn/experiment/wmt12-en-cs/lm/europarl.lm.38 >> -2.35639 schválení těchto zpráv -0.198088 >> -0.390525 schválení těchto zpráv , >> -0.390525 proti schválení těchto zpráv >> >> There should be an entry for the bigram "schválení těchto". >> >> I do not see how this could happen - the ngram occurs twice in the corpus: >> >>> grep 'schválení těchto zpráv' lm/europarl.truecased.16 >> zatímco s dlouhodobými cíli souhlasíme , nemůžeme souhlasit s >> prostředky k jejich dosažení a hlasovali jsme proti schválení těchto >> zpráv , které se neomezují pouze na eurozónu . >> zatímco souhlasíme s dlouhodobými cíli , nemůžeme souhlasit s >> prostředky k jejich dosažení a hlasovali jsme proti schválení těchto >> zpráv , které se neomezují pouze na eurozónu . >> >> I suspect that the current implementation throws out higher order n-grams >> if they occur in _one_context_, not _once_. >> >> -phi >> >> On Thu, Nov 8, 2012 at 3:31 AM, Marcin Junczys-Dowmunt >> <junc...@amu.edu.pl> wrote: >>> Using -lm=msb instead of -lm=sb and testing on several evaluation sets >>> seems to help. Then one time IRSTLM is better another time I have better >>> results with SRILM. So on average they seem to be on par now. >>> >>> Interesting, however, that you say there should be no differences. I >>> never manage to get the same BLEU scores on a test set for IRSTLM and >>> SRILM. I have to do some reading on this dub issue and see what happens. >>> >>> W dniu 08.11.2012 09:20, Nicola Bertoldi pisze: >>>>> From the FBK community... >>>> >>>> as already mentioned by ken, >>>> >>>> tlm computes correctly the "Improved Kneser-Ney method" (-lm=msb) >>>> >>>> tlm can keep the singletons: set parameter -ps=no >>>> >>>> As concerns as OOV words tlm computes the probability of the OOV as it >>>> were a class of all possible unknown words. >>>> In order to get the actual prob of one single OOV token tlm requires >>>> that a Dictionary Upper Bound is set. >>>> The Dictionary Upper Bound is intended to be a rough estimate of the >>>> dictionary size (a reasonable value could be 10e+7, which is also the >>>> default) >>>> Note that having the same Dictionary Upper Bound (dub) value is >>>> useful/mandatory to properly compare different LMs in terms of Perplexity >>>> Moreover, Note that the dub value is not stored in the saved LM >>>> >>>> In IRSTLM, you can/have to set this value with the parameter -dub when >>>> you compute the perplexity either with tlm or compile-lm >>>> In MOSES, you can/have to set this parameter with "-lmodel-dub" >>>> >>>> I remember you can use the LM estimated by means of IRSTLM toolkit >>>> directly in MOSES setting the first field of the "-lmodel-file" parameter >>>> to "1" >>>> without transforming it with build-binary. >>>> >>>> >>>> As concerns the difference between IRSTLM and SRILM, they should not be >>>> there. >>>> Have you notice difference also in the perplexity? >>>> Maybe you can send us a tiny benchmark (data and used commands) in which >>>> you experience such difference, >>>> so that we can debug. >>>> >>>> >>>> >>>> Nicola >>>> >>>> >>>> On Nov 8, 2012, at 8:22 AM, Marcin Junczys-Dowmunt wrote: >>>> >>>>> Hi Pratyush, >>>>> Thanks for the hint. That solved the problem I had with the arpa files >>>>> when using -lm=msb and KenLM. Unfortunately, this does not seem to >>>>> improve performance of IRSTLM much when compared to SRILM. So I guess I >>>>> will have to stick with SRILM for now. >>>>> >>>>> Kenneth, weren't you working on your own tool to produce language models? >>>>> Best, >>>>> Marcin >>>>> >>>>> W dniu 07.11.2012 11:18, Pratyush Banerjee pisze: >>>>>> Hi Marcin, >>>>>> >>>>>> I have used msb with irstlm... but seems to have worked fine for me... >>>>>> >>>>>> You mentioned faulty arpa files for 5-grams... is it because KenLM >>>>>> complains of missing 4-grams, 3-grams etc ? >>>>>> Have you tried using -ps=no option with tlm ? >>>>>> >>>>>> IRSTLM is known to prune singletons n-grams in order to reduce the >>>>>> size of the LM... (tlm has it on by default..) >>>>>> >>>>>> If you use this option, usually KenLM does not complain... I have also >>>>>> used such LMs with SRILM for further mixing and it went fine... >>>>>> >>>>>> I am sure somebody from the IRSTLM community could confirm this... >>>>>> >>>>>> Hope this resolves the issue... >>>>>> >>>>>> Thanks and Regards, >>>>>> >>>>>> Pratyush >>>>>> >>>>>> >>>>>> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt >>>>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote: >>>>>> >>>>>> On the irstlm page it says: >>>>>> >>>>>> 'Modified shift-beta, also known as "improved kneser-ney smoothing"' >>>>>> >>>>>> Unfortunately I cannot use "msb" because it seems to produce >>>>>> faulty arpa >>>>>> files for 5-grams. So I am trying only "shift-beta" whatever that >>>>>> means. >>>>>> Maybe that's the main problem? >>>>>> Also, my data sets are not that small, the plain arpa files currently >>>>>> exceed 20 GB. >>>>>> >>>>>> Best, >>>>>> Marcin >>>>>> >>>>>> W dniu 06.11.2012 22:15, Jonathan Clark pisze: >>>>>>> As far as I know, exact modified Kneser-Ney smoothing (the current >>>>>>> state of the art) is not supported by IRSTLM. IRSTLM instead >>>>>>> implements modified shift-beta smoothing, which isn't quite as >>>>>>> effective -- especially on smaller data sets. >>>>>>> >>>>>>> Cheers, >>>>>>> Jon >>>>>>> >>>>>>> >>>>>>> On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt >>>>>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote: >>>>>>>> Hi, >>>>>>>> Slightly off-topic, but I am out of ideas. I am trying to >>>>>> figure out >>>>>>>> what set of parameters I have to use with IRSTLM to creates LMs >>>>>> that are >>>>>>>> equivalent to language models created with SRILM using the >>>>>> following >>>>>>>> command: >>>>>>>> >>>>>>>> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text >>>>>>>> input.en -lm lm.en.arpa >>>>>>>> >>>>>>>> Up to now, I am using this chain of commands for IRSTLM: >>>>>>>> >>>>>>>> perl -C -pe 'chomp; $_ = "<s> $_ </s>\n"' < input.en > >>>>>> input.en.sb <http://input.en.sb> >>>>>>>> ngt -i=input.en.sb <http://input.en.sb> -n=5 -b=yes -o=lm.en.bin >>>>>>>> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa >>>>>>>> >>>>>>>> I know this is not quite the same, but it comes closest in terms of >>>>>>>> quality and size. The translation results, however, are still >>>>>>>> consistently worse than with SRILM models, differences in BLEU >>>>>> are up to >>>>>>>> 1%. >>>>>>>> >>>>>>>> I use KenLM with Moses to binarize the resulting arpa files, so >>>>>> this is >>>>>>>> not a code issue. >>>>>>>> >>>>>>>> Also it seems IRSTLM has a bug with the modified shift beta >>>>>> option. At >>>>>>>> least KenLM complains that not all 4-grams are present although >>>>>> there >>>>>>>> are 5-grams that contain them. >>>>>>>> >>>>>>>> Any ideas? >>>>>>>> Thanks, >>>>>>>> Marcin >>>>>>>> _______________________________________________ >>>>>>>> Moses-support mailing list >>>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> Moses-support@mit.edu >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> _______________________________________________ >>> Moses-support mailing list >>> Moses-support@mit.edu >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support