>From the FBK community... 

as already mentioned by ken,

tlm computes correctly  the "Improved Kneser-Ney method"  (-lm=msb)

tlm can keep the singletons: set parameter  -ps=no

As concerns as OOV words tlm computes the probability of the OOV  as it were a 
class of all possible unknown words.
In order to get the actual prob of one single OOV token    tlm requires that a 
Dictionary Upper Bound is set.
The Dictionary Upper Bound is intended to be a rough estimate of the dictionary 
size (a reasonable value could be 10e+7, which is also the default)
Note that having the same Dictionary Upper Bound (dub) value is 
useful/mandatory to properly compare different LMs in terms of Perplexity
Moreover, Note that the dub value is not stored in the saved LM 

In IRSTLM, you can/have to  set this value with the parameter  -dub   when you 
compute the perplexity   either with    tlm    or    compile-lm
In MOSES, you can/have to set this parameter with    "-lmodel-dub"

I remember you can use the LM estimated by means of IRSTLM toolkit  directly in 
MOSES setting the first field of the "-lmodel-file" parameter to "1"
without transforming it with build-binary.


As concerns the difference between IRSTLM and SRILM, they should not be there.
Have you notice difference also in the perplexity?
Maybe you can send us  a tiny benchmark (data and used commands) in which you 
experience such difference,
so that we can debug.



Nicola


On Nov 8, 2012, at 8:22 AM, Marcin Junczys-Dowmunt wrote:

> Hi Pratyush,
> Thanks for the hint. That solved the problem I had with the arpa files 
> when using -lm=msb and KenLM. Unfortunately, this does not seem to 
> improve performance of IRSTLM much when compared to SRILM. So I guess I 
> will have to stick with SRILM for now.
> 
> Kenneth, weren't you working on your own tool to produce language models?
> Best,
> Marcin
> 
> W dniu 07.11.2012 11:18, Pratyush Banerjee pisze:
>> Hi Marcin,
>> 
>> I have used msb with irstlm... but seems to have worked fine for me...
>> 
>> You mentioned faulty arpa files for 5-grams... is it because KenLM 
>> complains of missing 4-grams, 3-grams etc ?
>> Have you tried using -ps=no option with tlm ?
>> 
>> IRSTLM is known to prune singletons n-grams in order to reduce the 
>> size of the LM... (tlm has it on by default..)
>> 
>> If you use this option, usually KenLM does not complain... I have also 
>> used such LMs with SRILM for further mixing and it went fine...
>> 
>> I am sure somebody from the IRSTLM community could confirm this...
>> 
>> Hope this resolves the issue...
>> 
>> Thanks and Regards,
>> 
>> Pratyush
>> 
>> 
>> On Tue, Nov 6, 2012 at 9:26 PM, Marcin Junczys-Dowmunt 
>> <[email protected] <mailto:[email protected]>> wrote:
>> 
>>    On the irstlm page it says:
>> 
>>    'Modified shift-beta, also known as “improved kneser-ney smoothing”'
>> 
>>    Unfortunately I cannot use "msb" because it seems to produce
>>    faulty arpa
>>    files for 5-grams. So I am trying only "shift-beta" whatever that
>>    means.
>>    Maybe that's the main problem?
>>    Also, my data sets are not that small, the plain arpa files currently
>>    exceed 20 GB.
>> 
>>    Best,
>>    Marcin
>> 
>>    W dniu 06.11.2012 22:15, Jonathan Clark pisze:
>>> As far as I know, exact modified Kneser-Ney smoothing (the current
>>> state of the art) is not supported by IRSTLM. IRSTLM instead
>>> implements modified shift-beta smoothing, which isn't quite as
>>> effective -- especially on smaller data sets.
>>> 
>>> Cheers,
>>> Jon
>>> 
>>> 
>>> On Tue, Nov 6, 2012 at 1:08 PM, Marcin Junczys-Dowmunt
>>> <[email protected] <mailto:[email protected]>> wrote:
>>>> Hi,
>>>> Slightly off-topic, but I am out of ideas. I am trying to
>>    figure out
>>>> what set of parameters I have to use with IRSTLM to creates LMs
>>    that are
>>>> equivalent to language models created with SRILM using the
>>    following
>>>> command:
>>>> 
>>>> (SRILM:) ngram-count -order 5 -unk -interpolate -kndiscount -text
>>>> input.en -lm lm.en.arpa
>>>> 
>>>> Up to now, I am using this chain of commands for IRSTLM:
>>>> 
>>>> perl -C -pe 'chomp; $_ = "<s> $_ </s>\n"' < input.en >
>>    input.en.sb <http://input.en.sb>
>>>> ngt -i=input.en.sb <http://input.en.sb> -n=5 -b=yes -o=lm.en.bin
>>>> tlm -tr=lm.en.bin -lm=sb -bo=yes -n=5 -o=lm.en.arpa
>>>> 
>>>> I know this is not quite the same, but it comes closest in terms of
>>>> quality and size. The translation results, however, are still
>>>> consistently worse than with SRILM models, differences in BLEU
>>    are up to
>>>> 1%.
>>>> 
>>>> I use KenLM with Moses to binarize the resulting arpa files, so
>>    this is
>>>> not a code issue.
>>>> 
>>>> Also it seems IRSTLM has a bug with the modified shift beta
>>    option. At
>>>> least KenLM complains that not all 4-grams are present although
>>    there
>>>> are 5-grams that contain them.
>>>> 
>>>> Any ideas?
>>>> Thanks,
>>>> Marcin
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> 
>>    _______________________________________________
>>    Moses-support mailing list
>>    [email protected] <mailto:[email protected]>
>>    http://mailman.mit.edu/mailman/listinfo/moses-support
>> 
>> 
> 
> 
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to