Re: [Moses-support] KenLM: "The context of every 4-gram should appear as a 3-gram"

Sylvain Raybaud Fri, 17 Feb 2012 02:08:09 -0800

Hi Nicolas

  Thank you for your answer!


On 17/02/12 09:02, Nicola Bertoldi wrote:
> Dear Sylvain,
> 
> I am starting to answer the question in this thread.
> 
> - Most recent release of IRSTLM    is 5.70.04 and can be downloaded from 
> SourceForge
> 

that's the one I'm using.

> - The IRSTLM user guide can be found  in the SourceForge website:
> https://sourceforge.net/apps/mediawiki/irstlm/index.php?title=Main_Page

thanks! I had missed it.

> 
>   We try to keep it updated as much as possible, and your suggestions to 
> improve it are welcome.
> 
> 
> - By default   tlm   performs    pruning of n-gram singletons of order larger 
> or equal to 3.
>   To disable singleton pruning use this parameter "-PruneSingletons=no"   (or 
> its short version  "-ps=no")
> 
>   Note that, for hystorical reasons, singleton prunin is off by default if 
> you use "build-lm.sh" to build a LM
>   To enable, in this case, please use "-p"
> 
> 
> - As concerns the original problem, it is not really clear to me, whether the 
> 4-gram "to support them ."
>   is present or not  in the LM built with IRSTLM tlm command.
>   I am glad to debug this if you could send me the input text you train the 
> model on.
> 
>   In general, the Modified Shift Beta smoothing approach can have odd
> behavior if the training data are few,
>  and it recommended to use a less sophisticated, but more robust
> smoothing approaches, like ShiftBeta or even Witten-Bell.
>

turning off pruning fixes the problem indeed! it's strange, because 'to
support them .' appears to time in the corpus:
grep 'to support them .' toy.sent_start_end.en | wc -l
2

and 4-gram 'to support them .' does appear in the LM.

It's true that the training corpus is very small in this case (1000
sentences): it's a toy corpus I just use during development, but I train
LMs with the same parameters I use for real corpora. I haven't tried
with a bigger corpus yet.
You'll find the corpus here:
http://perso.crans.org/raybaud/toy.sent_start_end.en.gz

cheers,

Sylvain

> - As concerns Ken's question I have to double-check with the other 
> developers, I will come back to you very soon.
> 
> best,
> Nicola
> 
> On Feb 16, 2012, at 6:23 PM, Sylvain Raybaud wrote:
> 
> Hi
> 
>  No, I haven't turned on pruning. I've been looking in IRSTLM manual if
> it was on by default but I couldn't find the information (and I couldn't
> find an up to date manual either, only for version 5.60.something).
> 
> Since it seems to depend on the smoothing method, maybe msb turns it on,
> but not sb?
> 
> The solution you propose would indeed make me happy :) Actually, I just
> need it to run with moses and yield acceptable performance to be happy.
> I can even live with -lm=sb, since finding the best LM parameters isn't
> the core of my research :)
> 
> thanks for your reply!
> 
> cheers,
> 
> Sylvain
> 
> On 16/02/12 17:46, Kenneth Heafield wrote:
> Hi,
> 
> This is hopefully a stupid question.  Did you turn on pruning?  I don't
> see it in the command line: "tlm -tr=toy.sent_start_end.en -lm=msb -n=5
> -o=toy.en.n5.lm".  Or did IRSTLM make pruning the default in new releases?
> 
> KenLM should be accepting pruned models and I take responsibility for
> that.  But I am also confused as to how "to support them" did not appear
> if pruning was off.
> 
> Kenneth
> 
> On 02/16/2012 10:16 AM, Kenneth Heafield wrote:
> Hi,
> 
> Interesting.  The only other person to run into this is David Chiang
> who had some custom software to prune/build models.
> 
> I have been requiring that property to make right state minimization
> work correctly: if it doesn't match "to support them" then the right
> state contains at most "support them", rendering "to support them ."
> inaccessible.  I could reinsert "to support them" when this happens,
> with p(to support them) = b(to support)p(support them) and b(to support
> them) = 0.
> 
> It's a bit of a pain to do this correctly.  Would you be happy if only
> the default probing model supported it, but the trie continued to throw
> an error message?
> 
> The ARPA standard, to the extent that there is one, does not require
> this behavior, so IRSTLM is within their rights to prune them.
> 
> Nicola, how does IRSTLM handle these cases at inference time?
> 
> Kenneth
> 
> On 02/16/2012 07:59 AM, Sylvain Raybaud wrote:
> Hi
> 
>    LM stuff again!
> 
> I've created a language model with IRSTLM (release 5.70.04):
> tlm -tr=toy.sent_start_end.en -lm=msb -n=5 -o=toy.en.n5.lm
> 
> When I specify type 1 (IRSTLM) in moses.ini it's loading fine. But if I
> try to load it with KenLM I get:
> 
> The context of every 4-gram should appear as a 3-gram Byte: 471440 File:
> /global/markov/raybauds/DATA/TOY/toy.en.n5.lm
> 
> Byte 471440 seems to be the '\n' between the following lines:
> -1.16894        to support them .       -0.0679314
> -0.836008       to deal with hamas
> 
> As a matter of fact, "to support them" does not appear as a trigram in
> the model. If I remove this 4-gram the same problem arises with another
> one, whose 3-gram prefix is also missing. I think it is the problem. If
> I change the smoothing method to "sb" instead of "msb" I get a usable
> LM. Is this normal behavior? Do you think it's a KenLM or an IRSTLM
> related problem?
> 
> 
> cheers,
> 
> _______________________________________________
> Moses-support mailing list
> [email protected]<mailto:[email protected]>
> http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> [email protected]<mailto:[email protected]>
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
> 
> --
> Sylvain Raybaud
> _______________________________________________
> Moses-support mailing list
> [email protected]<mailto:[email protected]>
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 


-- 
Sylvain Raybaud
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] KenLM: "The context of every 4-gram should appear as a 3-gram"

Reply via email to