Dear Sylvain, I am starting to answer the question in this thread.
- Most recent release of IRSTLM is 5.70.04 and can be downloaded from SourceForge - The IRSTLM user guide can be found in the SourceForge website: https://sourceforge.net/apps/mediawiki/irstlm/index.php?title=Main_Page We try to keep it updated as much as possible, and your suggestions to improve it are welcome. - By default tlm performs pruning of n-gram singletons of order larger or equal to 3. To disable singleton pruning use this parameter "-PruneSingletons=no" (or its short version "-ps=no") Note that, for hystorical reasons, singleton prunin is off by default if you use "build-lm.sh" to build a LM To enable, in this case, please use "-p" - As concerns the original problem, it is not really clear to me, whether the 4-gram "to support them ." is present or not in the LM built with IRSTLM tlm command. I am glad to debug this if you could send me the input text you train the model on. In general, the Modified Shift Beta smoothing approach can have odd behavior if the training data are few, and it recommended to use a less sophisticated, but more robust smoothing approaches, like ShiftBeta or even Witten-Bell. - As concerns Ken's question I have to double-check with the other developers, I will come back to you very soon. best, Nicola On Feb 16, 2012, at 6:23 PM, Sylvain Raybaud wrote: Hi No, I haven't turned on pruning. I've been looking in IRSTLM manual if it was on by default but I couldn't find the information (and I couldn't find an up to date manual either, only for version 5.60.something). Since it seems to depend on the smoothing method, maybe msb turns it on, but not sb? The solution you propose would indeed make me happy :) Actually, I just need it to run with moses and yield acceptable performance to be happy. I can even live with -lm=sb, since finding the best LM parameters isn't the core of my research :) thanks for your reply! cheers, Sylvain On 16/02/12 17:46, Kenneth Heafield wrote: Hi, This is hopefully a stupid question. Did you turn on pruning? I don't see it in the command line: "tlm -tr=toy.sent_start_end.en -lm=msb -n=5 -o=toy.en.n5.lm". Or did IRSTLM make pruning the default in new releases? KenLM should be accepting pruned models and I take responsibility for that. But I am also confused as to how "to support them" did not appear if pruning was off. Kenneth On 02/16/2012 10:16 AM, Kenneth Heafield wrote: Hi, Interesting. The only other person to run into this is David Chiang who had some custom software to prune/build models. I have been requiring that property to make right state minimization work correctly: if it doesn't match "to support them" then the right state contains at most "support them", rendering "to support them ." inaccessible. I could reinsert "to support them" when this happens, with p(to support them) = b(to support)p(support them) and b(to support them) = 0. It's a bit of a pain to do this correctly. Would you be happy if only the default probing model supported it, but the trie continued to throw an error message? The ARPA standard, to the extent that there is one, does not require this behavior, so IRSTLM is within their rights to prune them. Nicola, how does IRSTLM handle these cases at inference time? Kenneth On 02/16/2012 07:59 AM, Sylvain Raybaud wrote: Hi LM stuff again! I've created a language model with IRSTLM (release 5.70.04): tlm -tr=toy.sent_start_end.en -lm=msb -n=5 -o=toy.en.n5.lm When I specify type 1 (IRSTLM) in moses.ini it's loading fine. But if I try to load it with KenLM I get: The context of every 4-gram should appear as a 3-gram Byte: 471440 File: /global/markov/raybauds/DATA/TOY/toy.en.n5.lm Byte 471440 seems to be the '\n' between the following lines: -1.16894 to support them . -0.0679314 -0.836008 to deal with hamas As a matter of fact, "to support them" does not appear as a trigram in the model. If I remove this 4-gram the same problem arises with another one, whose 3-gram prefix is also missing. I think it is the problem. If I change the smoothing method to "sb" instead of "msb" I get a usable LM. Is this normal behavior? Do you think it's a KenLM or an IRSTLM related problem? cheers, _______________________________________________ Moses-support mailing list [email protected]<mailto:[email protected]> http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected]<mailto:[email protected]> http://mailman.mit.edu/mailman/listinfo/moses-support -- Sylvain Raybaud _______________________________________________ Moses-support mailing list [email protected]<mailto:[email protected]> http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
