Hi,
I use the <unk> probability. Dare say that either way is wrong for MT
and people should be using the LM OOV feature,
Kenneth
On 11/08/12 08:51, Nick Ruiz wrote:
> Hi Marcin,
>
> Have you done any perplexity tests on your trained LMs? For example, can
> you compute the perplexity on your evaluation set using IRSTLM and also
> using SRILM and compare the results? Also, keep in mind that IRSTLM
> reserves out-of-vocabulary probabilities based on the predefined
> vocabulary size of the LM. This is done using the `dub` parameter. I
> typically only use IRSTLM in Moses, so I'm not sure if the `dictionary
> upper-bound` could be something missing from KenLM that could cause a
> performance hit on IRSTLM-trained models. Just a guess.
>
> Best,
> Nick
>
>
> On 11/08/2012 08:22 AM, Marcin Junczys-Dowmunt wrote:
>> Hi Pratyush,
>> Thanks for the hint. That solved the problem I had with the arpa files
>> when using -lm=msb and KenLM. Unfortunately, this does not seem to
>> improve performance of IRSTLM much when compared to SRILM. So I guess I
>> will have to stick with SRILM for now.
>>
>> Kenneth, weren't you working on your own tool to produce language models?
>> Best,
>> Marcin
>>
>> W dniu 07.11.2012 11:18, Pratyush Banerjee pisze:
>>> Hi Marcin,
>>>
>>> I have used msb with irstlm... but seems to have worked fine for me...
>>>
>>> You mentioned faulty arpa files for 5-grams... is it because KenLM
>>> complains of missing 4-grams, 3-grams etc ?
>>> Have you tried using -ps=no option with tlm ?
>>>
>>> IRSTLM is known to prune singletons n-grams in order to reduce the
>>> size of the LM... (tlm has it on by default..)
>>>
>>> If you use this option, usually KenLM does not complain... I have also
>>> used such LMs with SRILM for further mixing and it went fine...
>>>
>>> I am sure somebody from the IRSTLM community could confirm this...
>>>
>>> Hope this resolves the issue...
>>>
>>> Thanks and Regards,
>>>
>>> Pratyush
>>>
>>>
>>> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt
>>> <[email protected]<mailto:[email protected]>> wrote:
>>>
>>> On the irstlm page it says:
>>>
>>> 'Modified shift-beta, also known as “improved kneser-ney smoothing”'
>>>
>>> Unfortunately I cannot use "msb" because it seems to produce
>>> faulty arpa
>>> files for 5-grams. So I am trying only "shift-beta" whatever that
>>> means.
>>> Maybe that's the main problem?
>>> Also, my data sets are not that small, the plain arpa files currently
>>> exceed 20 GB.
>>>
>>> Best,
>>> Marcin
>>>
>>> W dniu 06.11.2012 22:15, Jonathan Clark pisze:
>>> > As far as I know, exact modified Kneser-Ney smoothing (the current
>>> > state of the art) is not supported by IRSTLM. IRSTLM instead
>>> > implements modified shift-beta smoothing, which isn't quite as
>>> > effective -- especially on smaller data sets.
>>> >
>>> > Cheers,
>>> > Jon
>>> >
>>> >
>>> > On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt
>>> > <[email protected]<mailto:[email protected]>> wrote:
>>> >> Hi,
>>> >> Slightly off-topic, but I am out of ideas. I am trying to
>>> figure out
>>> >> what set of parameters I have to use with IRSTLM to creates LMs
>>> that are
>>> >> equivalent to language models created with SRILM using the
>>> following
>>> >> command:
>>> >>
>>> >> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text
>>> >> input.en -lm lm.en.arpa
>>> >>
>>> >> Up to now, I am using this chain of commands for IRSTLM:
>>> >>
>>> >> perl -C -pe 'chomp; $_ = "<s> $_</s>\n"'< input.en>
>>> input.en.sb<http://input.en.sb>
>>> >> ngt -i=input.en.sb<http://input.en.sb> -n=5 -b=yes -o=lm.en.bin
>>> >> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa
>>> >>
>>> >> I know this is not quite the same, but it comes closest in terms
>>> of
>>> >> quality and size. The translation results, however, are still
>>> >> consistently worse than with SRILM models, differences in BLEU
>>> are up to
>>> >> 1%.
>>> >>
>>> >> I use KenLM with Moses to binarize the resulting arpa files, so
>>> this is
>>> >> not a code issue.
>>> >>
>>> >> Also it seems IRSTLM has a bug with the modified shift beta
>>> option. At
>>> >> least KenLM complains that not all 4-grams are present although
>>> there
>>> >> are 5-grams that contain them.
>>> >>
>>> >> Any ideas?
>>> >> Thanks,
>>> >> Marcin
>>> >> _______________________________________________
>>> >> Moses-support mailing list
>>> >> [email protected]<mailto:[email protected]>
>>> >> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]<mailto:[email protected]>
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support