The original behavior was to refuse to load any model without <unk>. Early on, Hieu asked me to change that. The default is now to substitute probability 0.0 and print this complaint stderr:
The ARPA file is missing <unk>. Substituting probability 0.0. SRI's ngram scoring tool skips OOVs, so a probability of 0.0 reproduces that behavior (though I still charge the backoff penalty from preceding words). I'm still not happy with it. Documentation like http://statmt.org/wmt11/baseline.html carries influence. Can you add -unk? On 03/19/11 13:07, Philipp Koehn wrote: > Hi, > > I have recently build all my language models with the "-unk" flag, > so it creates probability mass for unseen words (there is a line > for <unk> in the language model file). > > But I am actually not sure if the SRILM interface properly uses > this probability. It may just fall back to a very low floor. > So it may be that Alex's desired feature is just a bug, which can > be reproduced with kenlm by not training with "-unk", hence > also falling back to the floor probability (if that is what kenlm > is doing). > > -phi > > On Sat, Mar 19, 2011 at 4:59 PM, Kenneth Heafield <[email protected]> wrote: >> I believe the right answer to this is adding an OOV count feature to >> Moses. In fact, I've gone through and made all the language models >> return a struct indicating if the word just scored was OOV. However, >> this needs to make in into the phrases and ultimately the features. >> Also, there's the fun of adding a config option to moses.ini. Thoughts >> on default behavior? >> >> You can control the unknown word probability by passing -u probability >> to build_binary. Set that to something negative. It will only be >> effective if the ARPA file was trained without <unk>. >> >> Also, is there are evidence out there for or against passing -unk to >> SRILM? >> >> Kenneth >> >> On 03/19/11 12:51, Alexander Fraser wrote: >>> Hi Folks, >>> >>> Is there some way to penalize LM-OOVs when using Moses+KenLM? I saw a >>> suggestion to create an open-vocab LM (I usually use closed-vocab) but >>> I think this means that in some context a LM-OOV could be produced in >>> preference to a non LM-OOV. This should not be the case in standard >>> phrase-based SMT (e.g., using the feature functions used in the Moses >>> baseline for the shared task for instance). Instead, Moses should >>> produce the minimal number of LM-OOVs possible. >>> >>> There are exceptions to this when using different feature functions. >>> For instance, we have a paper on trading off transliteration vs >>> semantic translation (for Hindi to Urdu translation), where the >>> transliterations are sometimes LM-OOV, but still a better choice than >>> available semantic translations (which are not LM-OOV). But the >>> overall SMT models we used supports this specific trade-off (and it >>> took work to make the models do this correctly, this is described in >>> the paper). >>> >>> I believe for the other three LM packages used with Moses the minimal >>> number of LM-OOVs is always produced. I've switched back to >>> Moses+SRILM for now due to this issue. I think it may be the case that >>> Moses+KenLM actually produces the maximal number of OOVs allowed by >>> the phrases loaded, which would be highly undesirable. Empirically, it >>> certainly produces more than Moses+SRILM in my experiments. >>> >>> Thanks and Cheers, Alex >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
