Revision 3933 changes the default <unk> log10 probability to -100.0 and changes messages to clarify that the field is log10 probability not probability (my mistake). As KenLM's wrapper does not call FloorScore, any backoff penalty will still be charged, causing the score to go below -100.0 in many cases. I believe this behavior is better than the situation with SRI where no backoff penalty is charged, and therefore you may encounter different results when using KenLM on any language model without <unk>.
Kenneth On 03/21/11 09:56, Kenneth Heafield wrote: > So, assuming the parallel data is part of the language model training > data, the weight on <unk> shouldn't matter. However, a severe penalty > can aggravate beam search's procrastination bias. Hypotheses that > haven't translated the word will fill the beam and push out hypotheses > that have translated the word, procrastinating translation of an unknown. > > Also, we're forgetting the penalty for backing off to unigram charged by > the preceding n-grams. With the way we handle SRI, the value returned > is -inf, the sum of -inf log probability and some finite penalty for > backing off. Then Moses maps it to -100.0. > > By contrast, if you set a finite <unk> probability, then the score > returned is this probability plus the backoff penalty from preceding > words. This biases unknown word placement to places that prefer backoff. > > In either case, if Alexander included the parallel training data in LM > data, he should not be seeing more or less <unk> using SRI or KenLM as > they currently are. The <unk> penalty should only impact relative > ranking but KenLM's inclusion of backoff at <unk> should cause better > hypotheses on average. > > I agree that <unk> shouldn't have probability 1.0 (or log probability > 0.0) as currently implemented, though this replicates the behavior of > SRI's perplexity tool. What should the default be? If I make it -inf > then we'll lose the backoff like SRI currently does. > > Kenneth > > On 03/20/11 13:28, Philipp Koehn wrote: >> Hi, >> >> can I ask a dumb question - >> where do these unknown words come from? >> >> Obviously there are words that are unknown in the source, >> hence placed verbatim in the output, which will be likely >> be unknown to the language model. But there is really not >> much choice about having them or not (besides -drop-unknown). >> All translations will have them. >> >> Otherwise, all words in the translation model should be known. >> >> So, what is the choice here? >> >> -phi >> >> On Sat, Mar 19, 2011 at 7:19 PM, Kenneth Heafield <[email protected]> >> wrote: >>> I believe -vocab takes a file containing the vocabulary and maps >>> everything else in your training data to OOV, including producing >>> n-grams that contain <unk>. Placing <unk> in the training data will >>> cause it to be treated like any other word in the corpus, which seems to >>> be what you want. >>> >>> With the -100 penalty all you're doing is forcing the OOV feature weight >>> to be -100 * the LM weight. I suspect MERT can do a better job of >>> determining the ratio of these weights for your particular data, but >>> MERT is known to make mistakes. >>> >>> Pass-through and language model OOV are close, but separate, issues. A >>> passed-through phrase table OOV is often still found in the language >>> model. >>> >>> Kenneth >>> >>> On 03/19/11 15:01, Alexander Fraser wrote: >>>> Cool, thanks for the explanation and fix. >>>> >>>> What does -vocab do? Is it a trick to replace things that are not in >>>> the vocab with <unk>? Does explicitly putting <unk> in the training >>>> data not work? I thought I could do that, the SRILM FAQ seems to >>>> indicate that this will work, haven't tried it yet. >>>> >>>> How exactly are you folks training your open vocab LMs, are you >>>> replacing something (singleton LM vocab?) with <unk>, or just adding a >>>> single line to the training data with <unk> in it? I think SRILM >>>> prunes singletons by default, does that affect <unk> at all? >>>> >>>> I agree in general about OOVs, but I still think it is questionable >>>> whether the addition of a single penalty is enough to let the baseline >>>> Moses model intelligently trade-off between LM-OOV and LM-Known >>>> (assuming that the parallel corpus is in the LM, which I >>>> experimentally verified is a good idea many years ago, and I think the >>>> result probably still holds). But perhaps Chris already has the >>>> results to prove me wrong. Anyway, I agee that adding this feature >>>> function is the right solution. >>>> >>>> BTW, if you think the Moses model with the addition of the penalty can >>>> do this trade-off correctly, then you should allow pass-through for >>>> *all* words, not just words that can wind up uncovered, you would then >>>> get a further improvement. >>>> >>>> Cheers, Alex >>>> >>>> >>>> On Sat, Mar 19, 2011 at 7:18 PM, Kenneth Heafield <[email protected]> >>>> wrote: >>>>> With a closed vocabulary LM, SRILM returns -inf on OOV and moses floors >>>>> this to LOWEST_SCORE which is -100.0. If you want identical behavior >>>>> from KenLM, >>>>> >>>>> kenlm/build_binary -u -100.0 foo.arpa foo.binary >>>>> >>>>> Unless you passed -vocab to SRILM (and most people don't), <unk> never >>>>> appears except as a unigram. Therefore, Chris is not getting any gain >>>>> from additional conditioning. >>>>> >>>>> OOVs can be good: names of people who appear in the news, new product >>>>> names etc. >>>>> >>>>> On 03/19/11 14:02, Alexander Fraser wrote: >>>>>> Hi Folks -- >>>>>> >>>>>> An LM-OOV feature sounds like a good solution to me. Chris, have you >>>>>> tried pegging the LM-OOV feature weight at an extremely high value? I >>>>>> suspect the gains you are getting are due to the use of <unk> in LM >>>>>> conditioning, i.e., p(word|... <unk> ...), rather than due to allowing >>>>>> more LM-OOVs. >>>>>> >>>>>> If the LM-OOV feature were defaulted to an extremely high value, we >>>>>> would get the behavior that Moses+SRILM has, but people who wanted to >>>>>> could try training the weight. >>>>>> >>>>>> I think using an open-class LM without such a penalty is not a good >>>>>> idea. I guess maybe the Moses+SRILM code defaults to a log probability >>>>>> value of something like -20 for p(LM-OOV|any-context) regardless of >>>>>> whether <unk> is present in the LM, so that is why it is OK to use an >>>>>> open-class LM with SRILM. >>>>>> >>>>>> Cheers, Alex >>>>>> >>>>>> >>>>>> On Sat, Mar 19, 2011 at 6:03 PM, Chris Dyer <[email protected]> wrote: >>>>>>> I've started using an OOV feature (fires for each LM-OOV) together >>>>>>> with an open-vocabulary LM, and found that this improves the BLEU >>>>>>> score. Typically, the weight learned on the OOV feature (by MERT) is >>>>>>> quite a bit more negative than the default amount estimated during LM >>>>>>> training, but it is still far greater than the "avoid at all costs" >>>>>>> moses/joshua OOV default behavior. As a result, there is a small >>>>>>> increase in the number of OOVs in the output (I have not counted this >>>>>>> number). However, the I find that the bleu score increases a bit for >>>>>>> doing this (magnitude depends on a number of factors), and the "extra" >>>>>>> OOVs typically occur in places where the possible English translation >>>>>>> would have been completely nonsensical. >>>>>>> -Chris >>>>>>> >>>>>>> On Sat, Mar 19, 2011 at 12:51 PM, Alexander Fraser >>>>>>> <[email protected]> wrote: >>>>>>>> Hi Folks, >>>>>>>> >>>>>>>> Is there some way to penalize LM-OOVs when using Moses+KenLM? I saw a >>>>>>>> suggestion to create an open-vocab LM (I usually use closed-vocab) but >>>>>>>> I think this means that in some context a LM-OOV could be produced in >>>>>>>> preference to a non LM-OOV. This should not be the case in standard >>>>>>>> phrase-based SMT (e.g., using the feature functions used in the Moses >>>>>>>> baseline for the shared task for instance). Instead, Moses should >>>>>>>> produce the minimal number of LM-OOVs possible. >>>>>>>> >>>>>>>> There are exceptions to this when using different feature functions. >>>>>>>> For instance, we have a paper on trading off transliteration vs >>>>>>>> semantic translation (for Hindi to Urdu translation), where the >>>>>>>> transliterations are sometimes LM-OOV, but still a better choice than >>>>>>>> available semantic translations (which are not LM-OOV). But the >>>>>>>> overall SMT models we used supports this specific trade-off (and it >>>>>>>> took work to make the models do this correctly, this is described in >>>>>>>> the paper). >>>>>>>> >>>>>>>> I believe for the other three LM packages used with Moses the minimal >>>>>>>> number of LM-OOVs is always produced. I've switched back to >>>>>>>> Moses+SRILM for now due to this issue. I think it may be the case that >>>>>>>> Moses+KenLM actually produces the maximal number of OOVs allowed by >>>>>>>> the phrases loaded, which would be highly undesirable. Empirically, it >>>>>>>> certainly produces more than Moses+SRILM in my experiments. >>>>>>>> >>>>>>>> Thanks and Cheers, Alex >>>>>>>> _______________________________________________ >>>>>>>> Moses-support mailing list >>>>>>>> [email protected] >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> [email protected] >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> [email protected] >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
