Re: [Moses-support] KenLM: "The context of every 4-gram should appear as a 3-gram"

Nicola Bertoldi Fri, 17 Feb 2012 00:03:48 -0800

Dear Sylvain,

I am starting to answer the question in this thread.


- Most recent release of IRSTLM    is 5.70.04 and can be downloaded from 
SourceForge

- The IRSTLM user guide can be found  in the SourceForge website:
https://sourceforge.net/apps/mediawiki/irstlm/index.php?title=Main_Page

  We try to keep it updated as much as possible, and your suggestions to 
improve it are welcome.


- By default   tlm   performs    pruning of n-gram singletons of order larger 
or equal to 3.
  To disable singleton pruning use this parameter "-PruneSingletons=no"   (or 
its short version  "-ps=no")

  Note that, for hystorical reasons, singleton prunin is off by default if you 
use "build-lm.sh" to build a LM
  To enable, in this case, please use "-p"


- As concerns the original problem, it is not really clear to me, whether the 
4-gram "to support them ."
  is present or not  in the LM built with IRSTLM tlm command.
  I am glad to debug this if you could send me the input text you train the 
model on.

  In general, the Modified Shift Beta smoothing approach can have odd behavior 
if the training data are few,
 and it recommended to use a less sophisticated, but more robust smoothing 
approaches, like ShiftBeta or even Witten-Bell.

- As concerns Ken's question I have to double-check with the other developers, 
I will come back to you very soon.

best,
Nicola

On Feb 16, 2012, at 6:23 PM, Sylvain Raybaud wrote:

Hi

 No, I haven't turned on pruning. I've been looking in IRSTLM manual if
it was on by default but I couldn't find the information (and I couldn't
find an up to date manual either, only for version 5.60.something).

Since it seems to depend on the smoothing method, maybe msb turns it on,
but not sb?

The solution you propose would indeed make me happy :) Actually, I just
need it to run with moses and yield acceptable performance to be happy.
I can even live with -lm=sb, since finding the best LM parameters isn't
the core of my research :)

thanks for your reply!

cheers,

Sylvain

On 16/02/12 17:46, Kenneth Heafield wrote:
Hi,

This is hopefully a stupid question.  Did you turn on pruning?  I don't
see it in the command line: "tlm -tr=toy.sent_start_end.en -lm=msb -n=5
-o=toy.en.n5.lm".  Or did IRSTLM make pruning the default in new releases?

KenLM should be accepting pruned models and I take responsibility for
that.  But I am also confused as to how "to support them" did not appear
if pruning was off.

Kenneth

On 02/16/2012 10:16 AM, Kenneth Heafield wrote:
Hi,

Interesting.  The only other person to run into this is David Chiang
who had some custom software to prune/build models.

I have been requiring that property to make right state minimization
work correctly: if it doesn't match "to support them" then the right
state contains at most "support them", rendering "to support them ."
inaccessible.  I could reinsert "to support them" when this happens,
with p(to support them) = b(to support)p(support them) and b(to support
them) = 0.

It's a bit of a pain to do this correctly.  Would you be happy if only
the default probing model supported it, but the trie continued to throw
an error message?

The ARPA standard, to the extent that there is one, does not require
this behavior, so IRSTLM is within their rights to prune them.

Nicola, how does IRSTLM handle these cases at inference time?

Kenneth

On 02/16/2012 07:59 AM, Sylvain Raybaud wrote:
Hi

   LM stuff again!

I've created a language model with IRSTLM (release 5.70.04):
tlm -tr=toy.sent_start_end.en -lm=msb -n=5 -o=toy.en.n5.lm

When I specify type 1 (IRSTLM) in moses.ini it's loading fine. But if I
try to load it with KenLM I get:

The context of every 4-gram should appear as a 3-gram Byte: 471440 File:
/global/markov/raybauds/DATA/TOY/toy.en.n5.lm

Byte 471440 seems to be the '\n' between the following lines:
-1.16894        to support them .       -0.0679314
-0.836008       to deal with hamas

As a matter of fact, "to support them" does not appear as a trigram in
the model. If I remove this 4-gram the same problem arises with another
one, whose 3-gram prefix is also missing. I think it is the problem. If
I change the smoothing method to "sb" instead of "msb" I get a usable
LM. Is this normal behavior? Do you think it's a KenLM or an IRSTLM
related problem?


cheers,

_______________________________________________
Moses-support mailing list
[email protected]<mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]<mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support


--
Sylvain Raybaud
_______________________________________________
Moses-support mailing list
[email protected]<mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] KenLM: "The context of every 4-gram should appear as a 3-gram"

Reply via email to