I believe -vocab takes a file containing the vocabulary and maps everything else in your training data to OOV, including producing n-grams that contain <unk>. Placing <unk> in the training data will cause it to be treated like any other word in the corpus, which seems to be what you want.
With the -100 penalty all you're doing is forcing the OOV feature weight to be -100 * the LM weight. I suspect MERT can do a better job of determining the ratio of these weights for your particular data, but MERT is known to make mistakes. Pass-through and language model OOV are close, but separate, issues. A passed-through phrase table OOV is often still found in the language model. Kenneth On 03/19/11 15:01, Alexander Fraser wrote: > Cool, thanks for the explanation and fix. > > What does -vocab do? Is it a trick to replace things that are not in > the vocab with <unk>? Does explicitly putting <unk> in the training > data not work? I thought I could do that, the SRILM FAQ seems to > indicate that this will work, haven't tried it yet. > > How exactly are you folks training your open vocab LMs, are you > replacing something (singleton LM vocab?) with <unk>, or just adding a > single line to the training data with <unk> in it? I think SRILM > prunes singletons by default, does that affect <unk> at all? > > I agree in general about OOVs, but I still think it is questionable > whether the addition of a single penalty is enough to let the baseline > Moses model intelligently trade-off between LM-OOV and LM-Known > (assuming that the parallel corpus is in the LM, which I > experimentally verified is a good idea many years ago, and I think the > result probably still holds). But perhaps Chris already has the > results to prove me wrong. Anyway, I agee that adding this feature > function is the right solution. > > BTW, if you think the Moses model with the addition of the penalty can > do this trade-off correctly, then you should allow pass-through for > *all* words, not just words that can wind up uncovered, you would then > get a further improvement. > > Cheers, Alex > > > On Sat, Mar 19, 2011 at 7:18 PM, Kenneth Heafield <[email protected]> wrote: >> With a closed vocabulary LM, SRILM returns -inf on OOV and moses floors >> this to LOWEST_SCORE which is -100.0. If you want identical behavior >> from KenLM, >> >> kenlm/build_binary -u -100.0 foo.arpa foo.binary >> >> Unless you passed -vocab to SRILM (and most people don't), <unk> never >> appears except as a unigram. Therefore, Chris is not getting any gain >> from additional conditioning. >> >> OOVs can be good: names of people who appear in the news, new product >> names etc. >> >> On 03/19/11 14:02, Alexander Fraser wrote: >>> Hi Folks -- >>> >>> An LM-OOV feature sounds like a good solution to me. Chris, have you >>> tried pegging the LM-OOV feature weight at an extremely high value? I >>> suspect the gains you are getting are due to the use of <unk> in LM >>> conditioning, i.e., p(word|... <unk> ...), rather than due to allowing >>> more LM-OOVs. >>> >>> If the LM-OOV feature were defaulted to an extremely high value, we >>> would get the behavior that Moses+SRILM has, but people who wanted to >>> could try training the weight. >>> >>> I think using an open-class LM without such a penalty is not a good >>> idea. I guess maybe the Moses+SRILM code defaults to a log probability >>> value of something like -20 for p(LM-OOV|any-context) regardless of >>> whether <unk> is present in the LM, so that is why it is OK to use an >>> open-class LM with SRILM. >>> >>> Cheers, Alex >>> >>> >>> On Sat, Mar 19, 2011 at 6:03 PM, Chris Dyer <[email protected]> wrote: >>>> I've started using an OOV feature (fires for each LM-OOV) together >>>> with an open-vocabulary LM, and found that this improves the BLEU >>>> score. Typically, the weight learned on the OOV feature (by MERT) is >>>> quite a bit more negative than the default amount estimated during LM >>>> training, but it is still far greater than the "avoid at all costs" >>>> moses/joshua OOV default behavior. As a result, there is a small >>>> increase in the number of OOVs in the output (I have not counted this >>>> number). However, the I find that the bleu score increases a bit for >>>> doing this (magnitude depends on a number of factors), and the "extra" >>>> OOVs typically occur in places where the possible English translation >>>> would have been completely nonsensical. >>>> -Chris >>>> >>>> On Sat, Mar 19, 2011 at 12:51 PM, Alexander Fraser >>>> <[email protected]> wrote: >>>>> Hi Folks, >>>>> >>>>> Is there some way to penalize LM-OOVs when using Moses+KenLM? I saw a >>>>> suggestion to create an open-vocab LM (I usually use closed-vocab) but >>>>> I think this means that in some context a LM-OOV could be produced in >>>>> preference to a non LM-OOV. This should not be the case in standard >>>>> phrase-based SMT (e.g., using the feature functions used in the Moses >>>>> baseline for the shared task for instance). Instead, Moses should >>>>> produce the minimal number of LM-OOVs possible. >>>>> >>>>> There are exceptions to this when using different feature functions. >>>>> For instance, we have a paper on trading off transliteration vs >>>>> semantic translation (for Hindi to Urdu translation), where the >>>>> transliterations are sometimes LM-OOV, but still a better choice than >>>>> available semantic translations (which are not LM-OOV). But the >>>>> overall SMT models we used supports this specific trade-off (and it >>>>> took work to make the models do this correctly, this is described in >>>>> the paper). >>>>> >>>>> I believe for the other three LM packages used with Moses the minimal >>>>> number of LM-OOVs is always produced. I've switched back to >>>>> Moses+SRILM for now due to this issue. I think it may be the case that >>>>> Moses+KenLM actually produces the maximal number of OOVs allowed by >>>>> the phrases loaded, which would be highly undesirable. Empirically, it >>>>> certainly produces more than Moses+SRILM in my experiments. >>>>> >>>>> Thanks and Cheers, Alex >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> [email protected] >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> >>>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
