>From the FBK community... as already mentioned by ken,
tlm computes correctly the "Improved Kneser-Ney method" (-lm=msb) tlm can keep the singletons: set parameter -ps=no As concerns as OOV words tlm computes the probability of the OOV as it were a class of all possible unknown words. In order to get the actual prob of one single OOV token tlm requires that a Dictionary Upper Bound is set. The Dictionary Upper Bound is intended to be a rough estimate of the dictionary size (a reasonable value could be 10e+7, which is also the default) Note that having the same Dictionary Upper Bound (dub) value is useful/mandatory to properly compare different LMs in terms of Perplexity Moreover, Note that the dub value is not stored in the saved LM In IRSTLM, you can/have to set this value with the parameter -dub when you compute the perplexity either with tlm or compile-lm In MOSES, you can/have to set this parameter with "-lmodel-dub" I remember you can use the LM estimated by means of IRSTLM toolkit directly in MOSES setting the first field of the "-lmodel-file" parameter to "1" without transforming it with build-binary. As concerns the difference between IRSTLM and SRILM, they should not be there. Have you notice difference also in the perplexity? Maybe you can send us a tiny benchmark (data and used commands) in which you experience such difference, so that we can debug. Nicola On Nov 8, 2012, at 8:22 AM, Marcin Junczys-Dowmunt wrote: > Hi Pratyush, > Thanks for the hint. That solved the problem I had with the arpa files > when using -lm=msb and KenLM. Unfortunately, this does not seem to > improve performance of IRSTLM much when compared to SRILM. So I guess I > will have to stick with SRILM for now. > > Kenneth, weren't you working on your own tool to produce language models? > Best, > Marcin > > W dniu 07.11.2012 11:18, Pratyush Banerjee pisze: >> Hi Marcin, >> >> I have used msb with irstlm... but seems to have worked fine for me... >> >> You mentioned faulty arpa files for 5-grams... is it because KenLM >> complains of missing 4-grams, 3-grams etc ? >> Have you tried using -ps=no option with tlm ? >> >> IRSTLM is known to prune singletons n-grams in order to reduce the >> size of the LM... (tlm has it on by default..) >> >> If you use this option, usually KenLM does not complain... I have also >> used such LMs with SRILM for further mixing and it went fine... >> >> I am sure somebody from the IRSTLM community could confirm this... >> >> Hope this resolves the issue... >> >> Thanks and Regards, >> >> Pratyush >> >> >> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt >> <[email protected] <mailto:[email protected]>> wrote: >> >> On the irstlm page it says: >> >> 'Modified shift-beta, also known as “improved kneser-ney smoothing”' >> >> Unfortunately I cannot use "msb" because it seems to produce >> faulty arpa >> files for 5-grams. So I am trying only "shift-beta" whatever that >> means. >> Maybe that's the main problem? >> Also, my data sets are not that small, the plain arpa files currently >> exceed 20 GB. >> >> Best, >> Marcin >> >> W dniu 06.11.2012 22:15, Jonathan Clark pisze: >>> As far as I know, exact modified Kneser-Ney smoothing (the current >>> state of the art) is not supported by IRSTLM. IRSTLM instead >>> implements modified shift-beta smoothing, which isn't quite as >>> effective -- especially on smaller data sets. >>> >>> Cheers, >>> Jon >>> >>> >>> On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt >>> <[email protected] <mailto:[email protected]>> wrote: >>>> Hi, >>>> Slightly off-topic, but I am out of ideas. I am trying to >> figure out >>>> what set of parameters I have to use with IRSTLM to creates LMs >> that are >>>> equivalent to language models created with SRILM using the >> following >>>> command: >>>> >>>> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text >>>> input.en -lm lm.en.arpa >>>> >>>> Up to now, I am using this chain of commands for IRSTLM: >>>> >>>> perl -C -pe 'chomp; $_ = "<s> $_ </s>\n"' < input.en > >> input.en.sb <http://input.en.sb> >>>> ngt -i=input.en.sb <http://input.en.sb> -n=5 -b=yes -o=lm.en.bin >>>> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa >>>> >>>> I know this is not quite the same, but it comes closest in terms of >>>> quality and size. The translation results, however, are still >>>> consistently worse than with SRILM models, differences in BLEU >> are up to >>>> 1%. >>>> >>>> I use KenLM with Moses to binarize the resulting arpa files, so >> this is >>>> not a code issue. >>>> >>>> Also it seems IRSTLM has a bug with the modified shift beta >> option. At >>>> least KenLM complains that not all 4-grams are present although >> there >>>> are 5-grams that contain them. >>>> >>>> Any ideas? >>>> Thanks, >>>> Marcin >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] <mailto:[email protected]> >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] <mailto:[email protected]> >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
