Hi, I added a script ( scripts/generic/scripts/generic/trainlm-irst2.perl ) that works with the latest version of irstlm, and added instructions to the example config files - even if training with pruned singletons cause follow-up steps (kenlm binarization and interpolation to balk).
-phi On Wed, Nov 14, 2012 at 4:46 PM, Jonathan Clark <[email protected]> wrote: > Nicola, > > On an unrelated note, could you say why the smoothing technique is > called Modified ShiftBeta in IRSTLM. I know it was originally called > Improved Kneser-Ney and sometimes "Simplified" Kneser-Ney (Interspeech > 2008), which hinted that it varied from the original description of > Modified Kneser-Ney in some way. I've been curious about this for > years and have never found a good opportunity to ask. > > Cheers, > Jon > > > On Wed, Nov 14, 2012 at 11:30 AM, Nicola Bertoldi <[email protected]> wrote: >> Modified ShiftBeta (aka modified Kenser Ney) does not considered the real >> counts for computing probabilties, but the corrected counts, which basically >> are the number of different successors of a n-gram. >> Hence in this case your bigram "schválení těchto" occurs always before >> "zpráv", and hence it behaves like a "singleton". >> >> Please refer to this paper to more details about this smoothing technique: >> Chen, S. F. and Goodman, J. (1999). An empirical study of smoothing >> techniques for language modeling. Computer Speech and Language, >> 4(13):359-393. >> >> Nicola >> >> On Nov 14, 2012, at 4:50 PM, Philipp Koehn wrote: >> >>> Hi, >>> >>> I encountered the same problem when using "msb" and >>> pruned singletons on large corpora (Europarl). >>> SRILM's ngram complaints about "no bow for prefix of ngram" >>> >>> Here a Czech example: >>> >>> grep 'schválení těchto' >>> /home/pkoehn/experiment/wmt12-en-cs/lm/europarl.lm.38 >>> -2.35639 schválení těchto zpráv -0.198088 >>> -0.390525 schválení těchto zpráv , >>> -0.390525 proti schválení těchto zpráv >>> >>> There should be an entry for the bigram "schválení těchto". >>> >>> I do not see how this could happen - the ngram occurs twice in the corpus: >>> >>>> grep 'schválení těchto zpráv' lm/europarl.truecased.16 >>> zatímco s dlouhodobými cíli souhlasíme , nemůžeme souhlasit s >>> prostředky k jejich dosažení a hlasovali jsme proti schválení těchto >>> zpráv , které se neomezují pouze na eurozónu . >>> zatímco souhlasíme s dlouhodobými cíli , nemůžeme souhlasit s >>> prostředky k jejich dosažení a hlasovali jsme proti schválení těchto >>> zpráv , které se neomezují pouze na eurozónu . >>> >>> I suspect that the current implementation throws out higher order n-grams >>> if they occur in _one_context_, not _once_. >>> >>> -phi >>> >>> On Thu, Nov 8, 2012 at 3:31 AM, Marcin Junczys-Dowmunt >>> <[email protected]> wrote: >>>> Using -lm=msb instead of -lm=sb and testing on several evaluation sets >>>> seems to help. Then one time IRSTLM is better another time I have better >>>> results with SRILM. So on average they seem to be on par now. >>>> >>>> Interesting, however, that you say there should be no differences. I >>>> never manage to get the same BLEU scores on a test set for IRSTLM and >>>> SRILM. I have to do some reading on this dub issue and see what happens. >>>> >>>> W dniu 08.11.2012 09:20, Nicola Bertoldi pisze: >>>>>> From the FBK community... >>>>> >>>>> as already mentioned by ken, >>>>> >>>>> tlm computes correctly the "Improved Kneser-Ney method" (-lm=msb) >>>>> >>>>> tlm can keep the singletons: set parameter -ps=no >>>>> >>>>> As concerns as OOV words tlm computes the probability of the OOV as it >>>>> were a class of all possible unknown words. >>>>> In order to get the actual prob of one single OOV token tlm requires >>>>> that a Dictionary Upper Bound is set. >>>>> The Dictionary Upper Bound is intended to be a rough estimate of the >>>>> dictionary size (a reasonable value could be 10e+7, which is also the >>>>> default) >>>>> Note that having the same Dictionary Upper Bound (dub) value is >>>>> useful/mandatory to properly compare different LMs in terms of Perplexity >>>>> Moreover, Note that the dub value is not stored in the saved LM >>>>> >>>>> In IRSTLM, you can/have to set this value with the parameter -dub >>>>> when you compute the perplexity either with tlm or compile-lm >>>>> In MOSES, you can/have to set this parameter with "-lmodel-dub" >>>>> >>>>> I remember you can use the LM estimated by means of IRSTLM toolkit >>>>> directly in MOSES setting the first field of the "-lmodel-file" parameter >>>>> to "1" >>>>> without transforming it with build-binary. >>>>> >>>>> >>>>> As concerns the difference between IRSTLM and SRILM, they should not be >>>>> there. >>>>> Have you notice difference also in the perplexity? >>>>> Maybe you can send us a tiny benchmark (data and used commands) in which >>>>> you experience such difference, >>>>> so that we can debug. >>>>> >>>>> >>>>> >>>>> Nicola >>>>> >>>>> >>>>> On Nov 8, 2012, at 8:22 AM, Marcin Junczys-Dowmunt wrote: >>>>> >>>>>> Hi Pratyush, >>>>>> Thanks for the hint. That solved the problem I had with the arpa files >>>>>> when using -lm=msb and KenLM. Unfortunately, this does not seem to >>>>>> improve performance of IRSTLM much when compared to SRILM. So I guess I >>>>>> will have to stick with SRILM for now. >>>>>> >>>>>> Kenneth, weren't you working on your own tool to produce language models? >>>>>> Best, >>>>>> Marcin >>>>>> >>>>>> W dniu 07.11.2012 11:18, Pratyush Banerjee pisze: >>>>>>> Hi Marcin, >>>>>>> >>>>>>> I have used msb with irstlm... but seems to have worked fine for me... >>>>>>> >>>>>>> You mentioned faulty arpa files for 5-grams... is it because KenLM >>>>>>> complains of missing 4-grams, 3-grams etc ? >>>>>>> Have you tried using -ps=no option with tlm ? >>>>>>> >>>>>>> IRSTLM is known to prune singletons n-grams in order to reduce the >>>>>>> size of the LM... (tlm has it on by default..) >>>>>>> >>>>>>> If you use this option, usually KenLM does not complain... I have also >>>>>>> used such LMs with SRILM for further mixing and it went fine... >>>>>>> >>>>>>> I am sure somebody from the IRSTLM community could confirm this... >>>>>>> >>>>>>> Hope this resolves the issue... >>>>>>> >>>>>>> Thanks and Regards, >>>>>>> >>>>>>> Pratyush >>>>>>> >>>>>>> >>>>>>> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt >>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>> >>>>>>> On the irstlm page it says: >>>>>>> >>>>>>> 'Modified shift-beta, also known as "improved kneser-ney smoothing"' >>>>>>> >>>>>>> Unfortunately I cannot use "msb" because it seems to produce >>>>>>> faulty arpa >>>>>>> files for 5-grams. So I am trying only "shift-beta" whatever that >>>>>>> means. >>>>>>> Maybe that's the main problem? >>>>>>> Also, my data sets are not that small, the plain arpa files currently >>>>>>> exceed 20 GB. >>>>>>> >>>>>>> Best, >>>>>>> Marcin >>>>>>> >>>>>>> W dniu 06.11.2012 22:15, Jonathan Clark pisze: >>>>>>>> As far as I know, exact modified Kneser-Ney smoothing (the current >>>>>>>> state of the art) is not supported by IRSTLM. IRSTLM instead >>>>>>>> implements modified shift-beta smoothing, which isn't quite as >>>>>>>> effective -- especially on smaller data sets. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Jon >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt >>>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>>>> Hi, >>>>>>>>> Slightly off-topic, but I am out of ideas. I am trying to >>>>>>> figure out >>>>>>>>> what set of parameters I have to use with IRSTLM to creates LMs >>>>>>> that are >>>>>>>>> equivalent to language models created with SRILM using the >>>>>>> following >>>>>>>>> command: >>>>>>>>> >>>>>>>>> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text >>>>>>>>> input.en -lm lm.en.arpa >>>>>>>>> >>>>>>>>> Up to now, I am using this chain of commands for IRSTLM: >>>>>>>>> >>>>>>>>> perl -C -pe 'chomp; $_ = "<s> $_ </s>\n"' < input.en > >>>>>>> input.en.sb <http://input.en.sb> >>>>>>>>> ngt -i=input.en.sb <http://input.en.sb> -n=5 -b=yes -o=lm.en.bin >>>>>>>>> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa >>>>>>>>> >>>>>>>>> I know this is not quite the same, but it comes closest in terms of >>>>>>>>> quality and size. The translation results, however, are still >>>>>>>>> consistently worse than with SRILM models, differences in BLEU >>>>>>> are up to >>>>>>>>> 1%. >>>>>>>>> >>>>>>>>> I use KenLM with Moses to binarize the resulting arpa files, so >>>>>>> this is >>>>>>>>> not a code issue. >>>>>>>>> >>>>>>>>> Also it seems IRSTLM has a bug with the modified shift beta >>>>>>> option. At >>>>>>>>> least KenLM complains that not all 4-grams are present although >>>>>>> there >>>>>>>>> are 5-grams that contain them. >>>>>>>>> >>>>>>>>> Any ideas? >>>>>>>>> Thanks, >>>>>>>>> Marcin >>>>>>>>> _______________________________________________ >>>>>>>>> Moses-support mailing list >>>>>>>>> [email protected] <mailto:[email protected]> >>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>> _______________________________________________ >>>>>>> Moses-support mailing list >>>>>>> [email protected] <mailto:[email protected]> >>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> [email protected] >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
