Hi Nicolas Thank you for your answer!
On 17/02/12 09:02, Nicola Bertoldi wrote: > Dear Sylvain, > > I am starting to answer the question in this thread. > > - Most recent release of IRSTLM is 5.70.04 and can be downloaded from > SourceForge > that's the one I'm using. > - The IRSTLM user guide can be found in the SourceForge website: > https://sourceforge.net/apps/mediawiki/irstlm/index.php?title=Main_Page thanks! I had missed it. > > We try to keep it updated as much as possible, and your suggestions to > improve it are welcome. > > > - By default tlm performs pruning of n-gram singletons of order larger > or equal to 3. > To disable singleton pruning use this parameter "-PruneSingletons=no" (or > its short version "-ps=no") > > Note that, for hystorical reasons, singleton prunin is off by default if > you use "build-lm.sh" to build a LM > To enable, in this case, please use "-p" > > > - As concerns the original problem, it is not really clear to me, whether the > 4-gram "to support them ." > is present or not in the LM built with IRSTLM tlm command. > I am glad to debug this if you could send me the input text you train the > model on. > > In general, the Modified Shift Beta smoothing approach can have odd > behavior if the training data are few, > and it recommended to use a less sophisticated, but more robust > smoothing approaches, like ShiftBeta or even Witten-Bell. > turning off pruning fixes the problem indeed! it's strange, because 'to support them .' appears to time in the corpus: grep 'to support them .' toy.sent_start_end.en | wc -l 2 and 4-gram 'to support them .' does appear in the LM. It's true that the training corpus is very small in this case (1000 sentences): it's a toy corpus I just use during development, but I train LMs with the same parameters I use for real corpora. I haven't tried with a bigger corpus yet. You'll find the corpus here: http://perso.crans.org/raybaud/toy.sent_start_end.en.gz cheers, Sylvain > - As concerns Ken's question I have to double-check with the other > developers, I will come back to you very soon. > > best, > Nicola > > On Feb 16, 2012, at 6:23 PM, Sylvain Raybaud wrote: > > Hi > > No, I haven't turned on pruning. I've been looking in IRSTLM manual if > it was on by default but I couldn't find the information (and I couldn't > find an up to date manual either, only for version 5.60.something). > > Since it seems to depend on the smoothing method, maybe msb turns it on, > but not sb? > > The solution you propose would indeed make me happy :) Actually, I just > need it to run with moses and yield acceptable performance to be happy. > I can even live with -lm=sb, since finding the best LM parameters isn't > the core of my research :) > > thanks for your reply! > > cheers, > > Sylvain > > On 16/02/12 17:46, Kenneth Heafield wrote: > Hi, > > This is hopefully a stupid question. Did you turn on pruning? I don't > see it in the command line: "tlm -tr=toy.sent_start_end.en -lm=msb -n=5 > -o=toy.en.n5.lm". Or did IRSTLM make pruning the default in new releases? > > KenLM should be accepting pruned models and I take responsibility for > that. But I am also confused as to how "to support them" did not appear > if pruning was off. > > Kenneth > > On 02/16/2012 10:16 AM, Kenneth Heafield wrote: > Hi, > > Interesting. The only other person to run into this is David Chiang > who had some custom software to prune/build models. > > I have been requiring that property to make right state minimization > work correctly: if it doesn't match "to support them" then the right > state contains at most "support them", rendering "to support them ." > inaccessible. I could reinsert "to support them" when this happens, > with p(to support them) = b(to support)p(support them) and b(to support > them) = 0. > > It's a bit of a pain to do this correctly. Would you be happy if only > the default probing model supported it, but the trie continued to throw > an error message? > > The ARPA standard, to the extent that there is one, does not require > this behavior, so IRSTLM is within their rights to prune them. > > Nicola, how does IRSTLM handle these cases at inference time? > > Kenneth > > On 02/16/2012 07:59 AM, Sylvain Raybaud wrote: > Hi > > LM stuff again! > > I've created a language model with IRSTLM (release 5.70.04): > tlm -tr=toy.sent_start_end.en -lm=msb -n=5 -o=toy.en.n5.lm > > When I specify type 1 (IRSTLM) in moses.ini it's loading fine. But if I > try to load it with KenLM I get: > > The context of every 4-gram should appear as a 3-gram Byte: 471440 File: > /global/markov/raybauds/DATA/TOY/toy.en.n5.lm > > Byte 471440 seems to be the '\n' between the following lines: > -1.16894 to support them . -0.0679314 > -0.836008 to deal with hamas > > As a matter of fact, "to support them" does not appear as a trigram in > the model. If I remove this 4-gram the same problem arises with another > one, whose 3-gram prefix is also missing. I think it is the problem. If > I change the smoothing method to "sb" instead of "msb" I get a usable > LM. Is this normal behavior? Do you think it's a KenLM or an IRSTLM > related problem? > > > cheers, > > _______________________________________________ > Moses-support mailing list > [email protected]<mailto:[email protected]> > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ > Moses-support mailing list > [email protected]<mailto:[email protected]> > http://mailman.mit.edu/mailman/listinfo/moses-support > > > -- > Sylvain Raybaud > _______________________________________________ > Moses-support mailing list > [email protected]<mailto:[email protected]> > http://mailman.mit.edu/mailman/listinfo/moses-support > -- Sylvain Raybaud _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
