Hi Pratyush,
Thanks for the hint. That solved the problem I had with the arpa files 
when using -lm=msb and KenLM. Unfortunately, this does not seem to 
improve performance of IRSTLM much when compared to SRILM. So I guess I 
will have to stick with SRILM for now.

Kenneth, weren't you working on your own tool to produce language models?
Best,
Marcin

W dniu 07.11.2012 11:18, Pratyush Banerjee pisze:
> Hi Marcin,
>
> I have used msb with irstlm... but seems to have worked fine for me...
>
> You mentioned faulty arpa files for 5-grams... is it because KenLM 
> complains of missing 4-grams, 3-grams etc ?
> Have you tried using -ps=no option with tlm ?
>
> IRSTLM is known to prune singletons n-grams in order to reduce the 
> size of the LM... (tlm has it on by default..)
>
> If you use this option, usually KenLM does not complain... I have also 
> used such LMs with SRILM for further mixing and it went fine...
>
> I am sure somebody from the IRSTLM community could confirm this...
>
> Hope this resolves the issue...
>
> Thanks and Regards,
>
> Pratyush
>
>
> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt 
> <[email protected] <mailto:[email protected]>> wrote:
>
>     On the irstlm page it says:
>
>     'Modified shift-beta, also known as “improved kneser-ney smoothing”'
>
>     Unfortunately I cannot use "msb" because it seems to produce
>     faulty arpa
>     files for 5-grams. So I am trying only "shift-beta" whatever that
>     means.
>     Maybe that's the main problem?
>     Also, my data sets are not that small, the plain arpa files currently
>     exceed 20 GB.
>
>     Best,
>     Marcin
>
>     W dniu 06.11.2012 22:15, Jonathan Clark pisze:
>     > As far as I know, exact modified Kneser-Ney smoothing (the current
>     > state of the art) is not supported by IRSTLM. IRSTLM instead
>     > implements modified shift-beta smoothing, which isn't quite as
>     > effective -- especially on smaller data sets.
>     >
>     > Cheers,
>     > Jon
>     >
>     >
>     > On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt
>     > <[email protected] <mailto:[email protected]>> wrote:
>     >> Hi,
>     >> Slightly off-topic, but I am out of ideas. I am trying to
>     figure out
>     >> what set of parameters I have to use with IRSTLM to creates LMs
>     that are
>     >> equivalent to language models created with SRILM using the
>     following
>     >> command:
>     >>
>     >> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text
>     >> input.en -lm lm.en.arpa
>     >>
>     >> Up to now, I am using this chain of commands for IRSTLM:
>     >>
>     >> perl -C -pe 'chomp; $_ = "<s> $_ </s>\n"' < input.en >
>     input.en.sb <http://input.en.sb>
>     >> ngt -i=input.en.sb <http://input.en.sb> -n=5 -b=yes -o=lm.en.bin
>     >> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa
>     >>
>     >> I know this is not quite the same, but it comes closest in terms of
>     >> quality and size. The translation results, however, are still
>     >> consistently worse than with SRILM models, differences in BLEU
>     are up to
>     >> 1%.
>     >>
>     >> I use KenLM with Moses to binarize the resulting arpa files, so
>     this is
>     >> not a code issue.
>     >>
>     >> Also it seems IRSTLM has a bug with the modified shift beta
>     option. At
>     >> least KenLM complains that not all 4-grams are present although
>     there
>     >> are 5-grams that contain them.
>     >>
>     >> Any ideas?
>     >> Thanks,
>     >> Marcin
>     >> _______________________________________________
>     >> Moses-support mailing list
>     >> [email protected] <mailto:[email protected]>
>     >> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>     _______________________________________________
>     Moses-support mailing list
>     [email protected] <mailto:[email protected]>
>     http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to