With Moses and a single phrase table, this can happen if there is a single word that is not covered by the phrase table as a singleton, but it can be covered with a longer phrase. In a typical shared task dev or test set with grow-diag-final-and GIZA alignments, this is restricted to about 5 to 10 words. I think it is possible that for these 5 to 10 words pass-through directly competes with translation (in Moses); but I haven't carefully checked this. I instead noticed that KenLM liked to output things that were missing from my LM (this was not competition of pass-through and trandlation), so this is similar to the first scenario Chris outlined.
Chris -- with respect to the second scenario - it wasn't clear to me if you have tried allowing pass-through for a larger set of words than these 5 to 10 words? How do you build your open-class LM? I assume this matters a lot. Cheers, Alex On Sun, Mar 20, 2011 at 6:58 PM, Chris Dyer <[email protected]> wrote: > There are two sources: > > 1) if you have multiple LMs, and one does not include the target side > of the bitext, you'll have a different profile of OOVs that are > actually in the language. Relatedly, I decided to exclude the Europarl > text from the LM training data, since I knew we would be translating > newsy genres. > > 2) there seems to be some evidence that some translations in the > phrase table are so bad that having leaving some words untranslated > is "better" than using what's in the phrase table. I can see an > argument that says that you should use the phrase table entries no > matter what, but my limited experiments suggest that letting the LM > make this call at least improves the BLEU score. Interpret that as you > will. > > -C > > On Sun, Mar 20, 2011 at 1:28 PM, Philipp Koehn <[email protected]> wrote: >> Hi, >> >> can I ask a dumb question - >> where do these unknown words come from? >> >> Obviously there are words that are unknown in the source, >> hence placed verbatim in the output, which will be likely >> be unknown to the language model. But there is really not >> much choice about having them or not (besides -drop-unknown). >> All translations will have them. >> >> Otherwise, all words in the translation model should be known. >> >> So, what is the choice here? >> >> -phi >> >> On Sat, Mar 19, 2011 at 7:19 PM, Kenneth Heafield <[email protected]> >> wrote: >>> I believe -vocab takes a file containing the vocabulary and maps >>> everything else in your training data to OOV, including producing >>> n-grams that contain <unk>. Placing <unk> in the training data will >>> cause it to be treated like any other word in the corpus, which seems to >>> be what you want. >>> >>> With the -100 penalty all you're doing is forcing the OOV feature weight >>> to be -100 * the LM weight. I suspect MERT can do a better job of >>> determining the ratio of these weights for your particular data, but >>> MERT is known to make mistakes. >>> >>> Pass-through and language model OOV are close, but separate, issues. A >>> passed-through phrase table OOV is often still found in the language >>> model. >>> >>> Kenneth >>> >>> On 03/19/11 15:01, Alexander Fraser wrote: >>>> Cool, thanks for the explanation and fix. >>>> >>>> What does -vocab do? Is it a trick to replace things that are not in >>>> the vocab with <unk>? Does explicitly putting <unk> in the training >>>> data not work? I thought I could do that, the SRILM FAQ seems to >>>> indicate that this will work, haven't tried it yet. >>>> >>>> How exactly are you folks training your open vocab LMs, are you >>>> replacing something (singleton LM vocab?) with <unk>, or just adding a >>>> single line to the training data with <unk> in it? I think SRILM >>>> prunes singletons by default, does that affect <unk> at all? >>>> >>>> I agree in general about OOVs, but I still think it is questionable >>>> whether the addition of a single penalty is enough to let the baseline >>>> Moses model intelligently trade-off between LM-OOV and LM-Known >>>> (assuming that the parallel corpus is in the LM, which I >>>> experimentally verified is a good idea many years ago, and I think the >>>> result probably still holds). But perhaps Chris already has the >>>> results to prove me wrong. Anyway, I agee that adding this feature >>>> function is the right solution. >>>> >>>> BTW, if you think the Moses model with the addition of the penalty can >>>> do this trade-off correctly, then you should allow pass-through for >>>> *all* words, not just words that can wind up uncovered, you would then >>>> get a further improvement. >>>> >>>> Cheers, Alex >>>> >>>> >>>> On Sat, Mar 19, 2011 at 7:18 PM, Kenneth Heafield <[email protected]> >>>> wrote: >>>>> With a closed vocabulary LM, SRILM returns -inf on OOV and moses floors >>>>> this to LOWEST_SCORE which is -100.0. If you want identical behavior >>>>> from KenLM, >>>>> >>>>> kenlm/build_binary -u -100.0 foo.arpa foo.binary >>>>> >>>>> Unless you passed -vocab to SRILM (and most people don't), <unk> never >>>>> appears except as a unigram. Therefore, Chris is not getting any gain >>>>> from additional conditioning. >>>>> >>>>> OOVs can be good: names of people who appear in the news, new product >>>>> names etc. >>>>> >>>>> On 03/19/11 14:02, Alexander Fraser wrote: >>>>>> Hi Folks -- >>>>>> >>>>>> An LM-OOV feature sounds like a good solution to me. Chris, have you >>>>>> tried pegging the LM-OOV feature weight at an extremely high value? I >>>>>> suspect the gains you are getting are due to the use of <unk> in LM >>>>>> conditioning, i.e., p(word|... <unk> ...), rather than due to allowing >>>>>> more LM-OOVs. >>>>>> >>>>>> If the LM-OOV feature were defaulted to an extremely high value, we >>>>>> would get the behavior that Moses+SRILM has, but people who wanted to >>>>>> could try training the weight. >>>>>> >>>>>> I think using an open-class LM without such a penalty is not a good >>>>>> idea. I guess maybe the Moses+SRILM code defaults to a log probability >>>>>> value of something like -20 for p(LM-OOV|any-context) regardless of >>>>>> whether <unk> is present in the LM, so that is why it is OK to use an >>>>>> open-class LM with SRILM. >>>>>> >>>>>> Cheers, Alex >>>>>> >>>>>> >>>>>> On Sat, Mar 19, 2011 at 6:03 PM, Chris Dyer <[email protected]> wrote: >>>>>>> I've started using an OOV feature (fires for each LM-OOV) together >>>>>>> with an open-vocabulary LM, and found that this improves the BLEU >>>>>>> score. Typically, the weight learned on the OOV feature (by MERT) is >>>>>>> quite a bit more negative than the default amount estimated during LM >>>>>>> training, but it is still far greater than the "avoid at all costs" >>>>>>> moses/joshua OOV default behavior. As a result, there is a small >>>>>>> increase in the number of OOVs in the output (I have not counted this >>>>>>> number). However, the I find that the bleu score increases a bit for >>>>>>> doing this (magnitude depends on a number of factors), and the "extra" >>>>>>> OOVs typically occur in places where the possible English translation >>>>>>> would have been completely nonsensical. >>>>>>> -Chris >>>>>>> >>>>>>> On Sat, Mar 19, 2011 at 12:51 PM, Alexander Fraser >>>>>>> <[email protected]> wrote: >>>>>>>> Hi Folks, >>>>>>>> >>>>>>>> Is there some way to penalize LM-OOVs when using Moses+KenLM? I saw a >>>>>>>> suggestion to create an open-vocab LM (I usually use closed-vocab) but >>>>>>>> I think this means that in some context a LM-OOV could be produced in >>>>>>>> preference to a non LM-OOV. This should not be the case in standard >>>>>>>> phrase-based SMT (e.g., using the feature functions used in the Moses >>>>>>>> baseline for the shared task for instance). Instead, Moses should >>>>>>>> produce the minimal number of LM-OOVs possible. >>>>>>>> >>>>>>>> There are exceptions to this when using different feature functions. >>>>>>>> For instance, we have a paper on trading off transliteration vs >>>>>>>> semantic translation (for Hindi to Urdu translation), where the >>>>>>>> transliterations are sometimes LM-OOV, but still a better choice than >>>>>>>> available semantic translations (which are not LM-OOV). But the >>>>>>>> overall SMT models we used supports this specific trade-off (and it >>>>>>>> took work to make the models do this correctly, this is described in >>>>>>>> the paper). >>>>>>>> >>>>>>>> I believe for the other three LM packages used with Moses the minimal >>>>>>>> number of LM-OOVs is always produced. I've switched back to >>>>>>>> Moses+SRILM for now due to this issue. I think it may be the case that >>>>>>>> Moses+KenLM actually produces the maximal number of OOVs allowed by >>>>>>>> the phrases loaded, which would be highly undesirable. Empirically, it >>>>>>>> certainly produces more than Moses+SRILM in my experiments. >>>>>>>> >>>>>>>> Thanks and Cheers, Alex >>>>>>>> _______________________________________________ >>>>>>>> Moses-support mailing list >>>>>>>> [email protected] >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> [email protected] >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> [email protected] >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
