Re: [Moses-support] producing the minimal number of LM-OOVs

Tom Hoar Sun, 20 Mar 2011 17:33:03 -0700

Chris,

It seems as though excluding target side of bitext is creating more
trouble than it solves. How about another approach that avoids LM-OOVs:


1) Include the target side, but gives it less weight. Therefore, the LM
can "make its call" with something instead of nothing. We've done this
by de-duping the target side and then doubling (sometimes tripling) the
non-target side corpora. 

2) "some translations in the phrase table are so bad..." We've found
that many of these are the duplicitous entries (such as formatting
labels) and the above approach helps. 

Tom


-----Original Message-----
From: Chris Dyer <[email protected]>
To: Philipp Koehn <[email protected]>
Cc: [email protected]
Subject: Re: [Moses-support] producing the minimal number of LM-OOVs
Date: Sun, 20 Mar 2011 13:58:59 -0400


There are two sources:

1) if you have multiple LMs, and one does not include the target side
of the bitext, you'll have a different profile of OOVs that are
actually in the language. Relatedly, I decided to exclude the Europarl
text from the LM training data, since I knew we would be translating
newsy genres.

2) there seems to be some evidence that some translations in the
phrase table are so bad that having leaving some words untranslated
is "better" than using what's in the phrase table. I can see an
argument that says that you should use the phrase table entries no
matter what, but my limited experiments suggest that letting the LM
make this call at least improves the BLEU score. Interpret that as you
will.

-C

On Sun, Mar 20, 2011 at 1:28 PM, Philipp Koehn <[email protected]> wrote:
> Hi,
>
> can I ask a dumb question -
> where do these unknown words come from?
>
> Obviously there are words that are unknown in the source,
> hence placed verbatim in the output, which will be likely
> be unknown to the language model. But there is really not
> much choice about having them or not (besides -drop-unknown).
> All translations will have them.
>
> Otherwise, all words in the translation model should be known.
>
> So, what is the choice here?
>
> -phi
>
> On Sat, Mar 19, 2011 at 7:19 PM, Kenneth Heafield <[email protected]> wrote:
>> I believe -vocab takes a file containing the vocabulary and maps
>> everything else in your training data to OOV, including producing
>> n-grams that contain <unk>.  Placing <unk> in the training data will
>> cause it to be treated like any other word in the corpus, which seems to
>> be what you want.
>>
>> With the -100 penalty all you're doing is forcing the OOV feature weight
>> to be -100 * the LM weight.  I suspect MERT can do a better job of
>> determining the ratio of these weights for your particular data, but
>> MERT is known to make mistakes.
>>
>> Pass-through and language model OOV are close, but separate, issues.  A
>> passed-through phrase table OOV is often still found in the language
>> model.
>>
>> Kenneth
>>
>> On 03/19/11 15:01, Alexander Fraser wrote:
>>> Cool, thanks for the explanation and fix.
>>>
>>> What does -vocab do? Is it a trick to replace things that are not in
>>> the vocab with <unk>? Does explicitly putting <unk> in the training
>>> data not work? I thought I could do that, the SRILM FAQ seems to
>>> indicate that this will work, haven't tried it yet.
>>>
>>> How exactly are you folks training your open vocab LMs, are you
>>> replacing something (singleton LM vocab?) with <unk>, or just adding a
>>> single line to the training data with <unk> in it? I think SRILM
>>> prunes singletons by default, does that affect <unk> at all?
>>>
>>> I agree in general about OOVs, but I still think it is questionable
>>> whether the addition of a single penalty is enough to let the baseline
>>> Moses model intelligently trade-off between LM-OOV and LM-Known
>>> (assuming that the parallel corpus is in the LM, which I
>>> experimentally verified is a good idea many years ago, and I think the
>>> result probably still holds). But perhaps Chris already has the
>>> results to prove me wrong. Anyway, I agee that adding this feature
>>> function is the right solution.
>>>
>>> BTW, if you think the Moses model with the addition of the penalty can
>>> do this trade-off correctly, then you should allow pass-through for
>>> *all* words, not just words that can wind up uncovered, you would then
>>> get a further improvement.
>>>
>>> Cheers, Alex
>>>
>>>
>>> On Sat, Mar 19, 2011 at 7:18 PM, Kenneth Heafield <[email protected]> 
>>> wrote:
>>>> With a closed vocabulary LM, SRILM returns -inf on OOV and moses floors
>>>> this to LOWEST_SCORE which is -100.0.  If you want identical behavior
>>>> from KenLM,
>>>>
>>>> kenlm/build_binary -u -100.0 foo.arpa foo.binary
>>>>
>>>> Unless you passed -vocab to SRILM (and most people don't), <unk> never
>>>> appears except as a unigram.  Therefore, Chris is not getting any gain
>>>> from additional conditioning.
>>>>
>>>> OOVs can be good: names of people who appear in the news, new product
>>>> names etc.
>>>>
>>>> On 03/19/11 14:02, Alexander Fraser wrote:
>>>>> Hi Folks --
>>>>>
>>>>> An LM-OOV feature sounds like a good solution to me. Chris, have you
>>>>> tried pegging the LM-OOV feature weight at an extremely high value? I
>>>>> suspect the gains you are getting are due to the use of <unk> in LM
>>>>> conditioning, i.e., p(word|... <unk> ...), rather than due to allowing
>>>>> more LM-OOVs.
>>>>>
>>>>> If the LM-OOV feature were defaulted to an extremely high value, we
>>>>> would get the behavior that Moses+SRILM has, but people who wanted to
>>>>> could try training the weight.
>>>>>
>>>>> I think using an open-class LM without such a penalty is not a good
>>>>> idea. I guess maybe the Moses+SRILM code defaults to a log probability
>>>>> value of something like -20 for p(LM-OOV|any-context) regardless of
>>>>> whether <unk> is present in the LM, so that is why it is OK to use an
>>>>> open-class LM with SRILM.
>>>>>
>>>>> Cheers, Alex
>>>>>
>>>>>
>>>>> On Sat, Mar 19, 2011 at 6:03 PM, Chris Dyer <[email protected]> wrote:
>>>>>> I've started using an OOV feature (fires for each LM-OOV) together
>>>>>> with an open-vocabulary LM, and found that this improves the BLEU
>>>>>> score. Typically, the weight learned on the OOV feature (by MERT) is
>>>>>> quite a bit more negative than the default amount estimated during LM
>>>>>> training, but it is still far greater than the "avoid at all costs"
>>>>>> moses/joshua OOV default behavior. As a result, there is a small
>>>>>> increase in the number of OOVs in the output (I have not counted this
>>>>>> number). However, the I find that the bleu score increases a bit for
>>>>>> doing this (magnitude depends on a number of factors), and the "extra"
>>>>>> OOVs typically occur in places where the possible English translation
>>>>>> would have been completely nonsensical.
>>>>>> -Chris
>>>>>>
>>>>>> On Sat, Mar 19, 2011 at 12:51 PM, Alexander Fraser
>>>>>> <[email protected]> wrote:
>>>>>>> Hi Folks,
>>>>>>>
>>>>>>> Is there some way to penalize LM-OOVs when using Moses+KenLM? I saw a
>>>>>>> suggestion to create an open-vocab LM (I usually use closed-vocab) but
>>>>>>> I think this means that in some context a LM-OOV could be produced in
>>>>>>> preference to a non LM-OOV. This should not be the case in standard
>>>>>>> phrase-based SMT (e.g., using the feature functions used in the Moses
>>>>>>> baseline for the shared task for instance). Instead, Moses should
>>>>>>> produce the minimal number of LM-OOVs possible.
>>>>>>>
>>>>>>> There are exceptions to this when using different feature functions.
>>>>>>> For instance, we have a paper on trading off transliteration vs
>>>>>>> semantic translation (for Hindi to Urdu translation), where the
>>>>>>> transliterations are sometimes LM-OOV, but still a better choice than
>>>>>>> available semantic translations (which are not LM-OOV). But the
>>>>>>> overall SMT models we used supports this specific trade-off (and it
>>>>>>> took work to make the models do this correctly, this is described in
>>>>>>> the paper).
>>>>>>>
>>>>>>> I believe for the other three LM packages used with Moses the minimal
>>>>>>> number of LM-OOVs is always produced. I've switched back to
>>>>>>> Moses+SRILM for now due to this issue. I think it may be the case that
>>>>>>> Moses+KenLM actually produces the maximal number of OOVs allowed by
>>>>>>> the phrases loaded, which would be highly undesirable. Empirically, it
>>>>>>> certainly produces more than Moses+SRILM in my experiments.
>>>>>>>
>>>>>>> Thanks and Cheers, Alex
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> [email protected]
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> [email protected]
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] producing the minimal number of LM-OOVs

Reply via email to