Hi,

        I use the <unk> probability.  Dare say that either way is wrong for MT 
and people should be using the LM OOV feature,

Kenneth

On 11/08/12 08:51, Nick Ruiz wrote:
> Hi Marcin,
>
> Have you done any perplexity tests on your trained LMs? For example, can
> you compute the perplexity on your evaluation set using IRSTLM and also
> using SRILM and compare the results? Also, keep in mind that IRSTLM
> reserves out-of-vocabulary probabilities based on the predefined
> vocabulary size of the LM. This is done using the `dub` parameter. I
> typically only use IRSTLM in Moses, so I'm not sure if the `dictionary
> upper-bound` could be something missing from KenLM that could cause a
> performance hit on IRSTLM-trained models. Just a guess.
>
> Best,
> Nick
>
>
> On 11/08/2012 08:22 AM, Marcin Junczys-Dowmunt wrote:
>> Hi Pratyush,
>> Thanks for the hint. That solved the problem I had with the arpa files
>> when using -lm=msb and KenLM. Unfortunately, this does not seem to
>> improve performance of IRSTLM much when compared to SRILM. So I guess I
>> will have to stick with SRILM for now.
>>
>> Kenneth, weren't you working on your own tool to produce language models?
>> Best,
>> Marcin
>>
>> W dniu 07.11.2012 11:18, Pratyush Banerjee pisze:
>>> Hi Marcin,
>>>
>>> I have used msb with irstlm... but seems to have worked fine for me...
>>>
>>> You mentioned faulty arpa files for 5-grams... is it because KenLM
>>> complains of missing 4-grams, 3-grams etc ?
>>> Have you tried using -ps=no option with tlm ?
>>>
>>> IRSTLM is known to prune singletons n-grams in order to reduce the
>>> size of the LM... (tlm has it on by default..)
>>>
>>> If you use this option, usually KenLM does not complain... I have also
>>> used such LMs with SRILM for further mixing and it went fine...
>>>
>>> I am sure somebody from the IRSTLM community could confirm this...
>>>
>>> Hope this resolves the issue...
>>>
>>> Thanks and Regards,
>>>
>>> Pratyush
>>>
>>>
>>> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt
>>> <[email protected]<mailto:[email protected]>>   wrote:
>>>
>>>       On the irstlm page it says:
>>>
>>>       'Modified shift-beta, also known as “improved kneser-ney smoothing”'
>>>
>>>       Unfortunately I cannot use "msb" because it seems to produce
>>>       faulty arpa
>>>       files for 5-grams. So I am trying only "shift-beta" whatever that
>>>       means.
>>>       Maybe that's the main problem?
>>>       Also, my data sets are not that small, the plain arpa files currently
>>>       exceed 20 GB.
>>>
>>>       Best,
>>>       Marcin
>>>
>>>       W dniu 06.11.2012 22:15, Jonathan Clark pisze:
>>>       >   As far as I know, exact modified Kneser-Ney smoothing (the current
>>>       >   state of the art) is not supported by IRSTLM. IRSTLM instead
>>>       >   implements modified shift-beta smoothing, which isn't quite as
>>>       >   effective -- especially on smaller data sets.
>>>       >
>>>       >   Cheers,
>>>       >   Jon
>>>       >
>>>       >
>>>       >   On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt
>>>       >   <[email protected]<mailto:[email protected]>>   wrote:
>>>       >>   Hi,
>>>       >>   Slightly off-topic, but I am out of ideas. I am trying to
>>>       figure out
>>>       >>   what set of parameters I have to use with IRSTLM to creates LMs
>>>       that are
>>>       >>   equivalent to language models created with SRILM using the
>>>       following
>>>       >>   command:
>>>       >>
>>>       >>   (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text
>>>       >>   input.en -lm lm.en.arpa
>>>       >>
>>>       >>   Up to now, I am using this chain of commands for IRSTLM:
>>>       >>
>>>       >>   perl -C -pe 'chomp; $_ = "<s>   $_</s>\n"'<   input.en>
>>>       input.en.sb<http://input.en.sb>
>>>       >>   ngt -i=input.en.sb<http://input.en.sb>   -n=5 -b=yes -o=lm.en.bin
>>>       >>   tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa
>>>       >>
>>>       >>   I know this is not quite the same, but it comes closest in terms 
>>> of
>>>       >>   quality and size. The translation results, however, are still
>>>       >>   consistently worse than with SRILM models, differences in BLEU
>>>       are up to
>>>       >>   1%.
>>>       >>
>>>       >>   I use KenLM with Moses to binarize the resulting arpa files, so
>>>       this is
>>>       >>   not a code issue.
>>>       >>
>>>       >>   Also it seems IRSTLM has a bug with the modified shift beta
>>>       option. At
>>>       >>   least KenLM complains that not all 4-grams are present although
>>>       there
>>>       >>   are 5-grams that contain them.
>>>       >>
>>>       >>   Any ideas?
>>>       >>   Thanks,
>>>       >>   Marcin
>>>       >>   _______________________________________________
>>>       >>   Moses-support mailing list
>>>       >>   [email protected]<mailto:[email protected]>
>>>       >>   http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>       _______________________________________________
>>>       Moses-support mailing list
>>>       [email protected]<mailto:[email protected]>
>>>       http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to