Hi, I have recently build all my language models with the "-unk" flag, so it creates probability mass for unseen words (there is a line for <unk> in the language model file).
But I am actually not sure if the SRILM interface properly uses this probability. It may just fall back to a very low floor. So it may be that Alex's desired feature is just a bug, which can be reproduced with kenlm by not training with "-unk", hence also falling back to the floor probability (if that is what kenlm is doing). -phi On Sat, Mar 19, 2011 at 4:59 PM, Kenneth Heafield <[email protected]> wrote: > I believe the right answer to this is adding an OOV count feature to > Moses. In fact, I've gone through and made all the language models > return a struct indicating if the word just scored was OOV. However, > this needs to make in into the phrases and ultimately the features. > Also, there's the fun of adding a config option to moses.ini. Thoughts > on default behavior? > > You can control the unknown word probability by passing -u probability > to build_binary. Set that to something negative. It will only be > effective if the ARPA file was trained without <unk>. > > Also, is there are evidence out there for or against passing -unk to > SRILM? > > Kenneth > > On 03/19/11 12:51, Alexander Fraser wrote: >> Hi Folks, >> >> Is there some way to penalize LM-OOVs when using Moses+KenLM? I saw a >> suggestion to create an open-vocab LM (I usually use closed-vocab) but >> I think this means that in some context a LM-OOV could be produced in >> preference to a non LM-OOV. This should not be the case in standard >> phrase-based SMT (e.g., using the feature functions used in the Moses >> baseline for the shared task for instance). Instead, Moses should >> produce the minimal number of LM-OOVs possible. >> >> There are exceptions to this when using different feature functions. >> For instance, we have a paper on trading off transliteration vs >> semantic translation (for Hindi to Urdu translation), where the >> transliterations are sometimes LM-OOV, but still a better choice than >> available semantic translations (which are not LM-OOV). But the >> overall SMT models we used supports this specific trade-off (and it >> took work to make the models do this correctly, this is described in >> the paper). >> >> I believe for the other three LM packages used with Moses the minimal >> number of LM-OOVs is always produced. I've switched back to >> Moses+SRILM for now due to this issue. I think it may be the case that >> Moses+KenLM actually produces the maximal number of OOVs allowed by >> the phrases loaded, which would be highly undesirable. Empirically, it >> certainly produces more than Moses+SRILM in my experiments. >> >> Thanks and Cheers, Alex >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
