Using -lm=msb instead of -lm=sb and testing on several evaluation sets seems to help. Then one time IRSTLM is better another time I have better results with SRILM. So on average they seem to be on par now.
Interesting, however, that you say there should be no differences. I never manage to get the same BLEU scores on a test set for IRSTLM and SRILM. I have to do some reading on this dub issue and see what happens. W dniu 08.11.2012 09:20, Nicola Bertoldi pisze: > >From the FBK community... > > as already mentioned by ken, > > tlm computes correctly the "Improved Kneser-Ney method" (-lm=msb) > > tlm can keep the singletons: set parameter -ps=no > > As concerns as OOV words tlm computes the probability of the OOV as it were > a class of all possible unknown words. > In order to get the actual prob of one single OOV token tlm requires that > a Dictionary Upper Bound is set. > The Dictionary Upper Bound is intended to be a rough estimate of the > dictionary size (a reasonable value could be 10e+7, which is also the default) > Note that having the same Dictionary Upper Bound (dub) value is > useful/mandatory to properly compare different LMs in terms of Perplexity > Moreover, Note that the dub value is not stored in the saved LM > > In IRSTLM, you can/have to set this value with the parameter -dub when > you compute the perplexity either with tlm or compile-lm > In MOSES, you can/have to set this parameter with "-lmodel-dub" > > I remember you can use the LM estimated by means of IRSTLM toolkit directly > in MOSES setting the first field of the "-lmodel-file" parameter to "1" > without transforming it with build-binary. > > > As concerns the difference between IRSTLM and SRILM, they should not be there. > Have you notice difference also in the perplexity? > Maybe you can send us a tiny benchmark (data and used commands) in which you > experience such difference, > so that we can debug. > > > > Nicola > > > On Nov 8, 2012, at 8:22 AM, Marcin Junczys-Dowmunt wrote: > >> Hi Pratyush, >> Thanks for the hint. That solved the problem I had with the arpa files >> when using -lm=msb and KenLM. Unfortunately, this does not seem to >> improve performance of IRSTLM much when compared to SRILM. So I guess I >> will have to stick with SRILM for now. >> >> Kenneth, weren't you working on your own tool to produce language models? >> Best, >> Marcin >> >> W dniu 07.11.2012 11:18, Pratyush Banerjee pisze: >>> Hi Marcin, >>> >>> I have used msb with irstlm... but seems to have worked fine for me... >>> >>> You mentioned faulty arpa files for 5-grams... is it because KenLM >>> complains of missing 4-grams, 3-grams etc ? >>> Have you tried using -ps=no option with tlm ? >>> >>> IRSTLM is known to prune singletons n-grams in order to reduce the >>> size of the LM... (tlm has it on by default..) >>> >>> If you use this option, usually KenLM does not complain... I have also >>> used such LMs with SRILM for further mixing and it went fine... >>> >>> I am sure somebody from the IRSTLM community could confirm this... >>> >>> Hope this resolves the issue... >>> >>> Thanks and Regards, >>> >>> Pratyush >>> >>> >>> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt >>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote: >>> >>> On the irstlm page it says: >>> >>> 'Modified shift-beta, also known as “improved kneser-ney smoothing”' >>> >>> Unfortunately I cannot use "msb" because it seems to produce >>> faulty arpa >>> files for 5-grams. So I am trying only "shift-beta" whatever that >>> means. >>> Maybe that's the main problem? >>> Also, my data sets are not that small, the plain arpa files currently >>> exceed 20 GB. >>> >>> Best, >>> Marcin >>> >>> W dniu 06.11.2012 22:15, Jonathan Clark pisze: >>>> As far as I know, exact modified Kneser-Ney smoothing (the current >>>> state of the art) is not supported by IRSTLM. IRSTLM instead >>>> implements modified shift-beta smoothing, which isn't quite as >>>> effective -- especially on smaller data sets. >>>> >>>> Cheers, >>>> Jon >>>> >>>> >>>> On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt >>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote: >>>>> Hi, >>>>> Slightly off-topic, but I am out of ideas. I am trying to >>> figure out >>>>> what set of parameters I have to use with IRSTLM to creates LMs >>> that are >>>>> equivalent to language models created with SRILM using the >>> following >>>>> command: >>>>> >>>>> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text >>>>> input.en -lm lm.en.arpa >>>>> >>>>> Up to now, I am using this chain of commands for IRSTLM: >>>>> >>>>> perl -C -pe 'chomp; $_ = "<s> $_ </s>\n"' < input.en > >>> input.en.sb <http://input.en.sb> >>>>> ngt -i=input.en.sb <http://input.en.sb> -n=5 -b=yes -o=lm.en.bin >>>>> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa >>>>> >>>>> I know this is not quite the same, but it comes closest in terms of >>>>> quality and size. The translation results, however, are still >>>>> consistently worse than with SRILM models, differences in BLEU >>> are up to >>>>> 1%. >>>>> >>>>> I use KenLM with Moses to binarize the resulting arpa files, so >>> this is >>>>> not a code issue. >>>>> >>>>> Also it seems IRSTLM has a bug with the modified shift beta >>> option. At >>>>> least KenLM complains that not all 4-grams are present although >>> there >>>>> are 5-grams that contain them. >>>>> >>>>> Any ideas? >>>>> Thanks, >>>>> Marcin >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> _______________________________________________ >>> Moses-support mailing list >>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> >> >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support