Hi Pratyush, Thanks for the hint. That solved the problem I had with the arpa files when using -lm=msb and KenLM. Unfortunately, this does not seem to improve performance of IRSTLM much when compared to SRILM. So I guess I will have to stick with SRILM for now.
Kenneth, weren't you working on your own tool to produce language models? Best, Marcin W dniu 07.11.2012 11:18, Pratyush Banerjee pisze: > Hi Marcin, > > I have used msb with irstlm... but seems to have worked fine for me... > > You mentioned faulty arpa files for 5-grams... is it because KenLM > complains of missing 4-grams, 3-grams etc ? > Have you tried using -ps=no option with tlm ? > > IRSTLM is known to prune singletons n-grams in order to reduce the > size of the LM... (tlm has it on by default..) > > If you use this option, usually KenLM does not complain... I have also > used such LMs with SRILM for further mixing and it went fine... > > I am sure somebody from the IRSTLM community could confirm this... > > Hope this resolves the issue... > > Thanks and Regards, > > Pratyush > > > On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt > <[email protected] <mailto:[email protected]>> wrote: > > On the irstlm page it says: > > 'Modified shift-beta, also known as “improved kneser-ney smoothing”' > > Unfortunately I cannot use "msb" because it seems to produce > faulty arpa > files for 5-grams. So I am trying only "shift-beta" whatever that > means. > Maybe that's the main problem? > Also, my data sets are not that small, the plain arpa files currently > exceed 20 GB. > > Best, > Marcin > > W dniu 06.11.2012 22:15, Jonathan Clark pisze: > > As far as I know, exact modified Kneser-Ney smoothing (the current > > state of the art) is not supported by IRSTLM. IRSTLM instead > > implements modified shift-beta smoothing, which isn't quite as > > effective -- especially on smaller data sets. > > > > Cheers, > > Jon > > > > > > On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt > > <[email protected] <mailto:[email protected]>> wrote: > >> Hi, > >> Slightly off-topic, but I am out of ideas. I am trying to > figure out > >> what set of parameters I have to use with IRSTLM to creates LMs > that are > >> equivalent to language models created with SRILM using the > following > >> command: > >> > >> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text > >> input.en -lm lm.en.arpa > >> > >> Up to now, I am using this chain of commands for IRSTLM: > >> > >> perl -C -pe 'chomp; $_ = "<s> $_ </s>\n"' < input.en > > input.en.sb <http://input.en.sb> > >> ngt -i=input.en.sb <http://input.en.sb> -n=5 -b=yes -o=lm.en.bin > >> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa > >> > >> I know this is not quite the same, but it comes closest in terms of > >> quality and size. The translation results, however, are still > >> consistently worse than with SRILM models, differences in BLEU > are up to > >> 1%. > >> > >> I use KenLM with Moses to binarize the resulting arpa files, so > this is > >> not a code issue. > >> > >> Also it seems IRSTLM has a bug with the modified shift beta > option. At > >> least KenLM complains that not all 4-grams are present although > there > >> are 5-grams that contain them. > >> > >> Any ideas? > >> Thanks, > >> Marcin > >> _______________________________________________ > >> Moses-support mailing list > >> [email protected] <mailto:[email protected]> > >> http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ > Moses-support mailing list > [email protected] <mailto:[email protected]> > http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
