[Moses-support] producing the minimal number of LM-OOVs

Alexander Fraser Sat, 19 Mar 2011 12:01:59 -0700

Cool, thanks for the explanation and fix.

What does -vocab do? Is it a trick to replace things that are not in
the vocab with <unk>? Does explicitly putting <unk> in the training
data not work? I thought I could do that, the SRILM FAQ seems to
indicate that this will work, haven't tried it yet.


How exactly are you folks training your open vocab LMs, are you
replacing something (singleton LM vocab?) with <unk>, or just adding a
single line to the training data with <unk> in it? I think SRILM
prunes singletons by default, does that affect <unk> at all?

I agree in general about OOVs, but I still think it is questionable
whether the addition of a single penalty is enough to let the baseline
Moses model intelligently trade-off between LM-OOV and LM-Known
(assuming that the parallel corpus is in the LM, which I
experimentally verified is a good idea many years ago, and I think the
result probably still holds). But perhaps Chris already has the
results to prove me wrong. Anyway, I agee that adding this feature
function is the right solution.

BTW, if you think the Moses model with the addition of the penalty can
do this trade-off correctly, then you should allow pass-through for
*all* words, not just words that can wind up uncovered, you would then
get a further improvement.

Cheers, Alex


On Sat, Mar 19, 2011 at 7:18 PM, Kenneth Heafield <[email protected]> wrote:
> With a closed vocabulary LM, SRILM returns -inf on OOV and moses floors
> this to LOWEST_SCORE which is -100.0.  If you want identical behavior
> from KenLM,
>
> kenlm/build_binary -u -100.0 foo.arpa foo.binary
>
> Unless you passed -vocab to SRILM (and most people don't), <unk> never
> appears except as a unigram.  Therefore, Chris is not getting any gain
> from additional conditioning.
>
> OOVs can be good: names of people who appear in the news, new product
> names etc.
>
> On 03/19/11 14:02, Alexander Fraser wrote:
>> Hi Folks --
>>
>> An LM-OOV feature sounds like a good solution to me. Chris, have you
>> tried pegging the LM-OOV feature weight at an extremely high value? I
>> suspect the gains you are getting are due to the use of <unk> in LM
>> conditioning, i.e., p(word|... <unk> ...), rather than due to allowing
>> more LM-OOVs.
>>
>> If the LM-OOV feature were defaulted to an extremely high value, we
>> would get the behavior that Moses+SRILM has, but people who wanted to
>> could try training the weight.
>>
>> I think using an open-class LM without such a penalty is not a good
>> idea. I guess maybe the Moses+SRILM code defaults to a log probability
>> value of something like -20 for p(LM-OOV|any-context) regardless of
>> whether <unk> is present in the LM, so that is why it is OK to use an
>> open-class LM with SRILM.
>>
>> Cheers, Alex
>>
>>
>> On Sat, Mar 19, 2011 at 6:03 PM, Chris Dyer <[email protected]> wrote:
>>> I've started using an OOV feature (fires for each LM-OOV) together
>>> with an open-vocabulary LM, and found that this improves the BLEU
>>> score. Typically, the weight learned on the OOV feature (by MERT) is
>>> quite a bit more negative than the default amount estimated during LM
>>> training, but it is still far greater than the "avoid at all costs"
>>> moses/joshua OOV default behavior. As a result, there is a small
>>> increase in the number of OOVs in the output (I have not counted this
>>> number). However, the I find that the bleu score increases a bit for
>>> doing this (magnitude depends on a number of factors), and the "extra"
>>> OOVs typically occur in places where the possible English translation
>>> would have been completely nonsensical.
>>> -Chris
>>>
>>> On Sat, Mar 19, 2011 at 12:51 PM, Alexander Fraser
>>> <[email protected]> wrote:
>>>> Hi Folks,
>>>>
>>>> Is there some way to penalize LM-OOVs when using Moses+KenLM? I saw a
>>>> suggestion to create an open-vocab LM (I usually use closed-vocab) but
>>>> I think this means that in some context a LM-OOV could be produced in
>>>> preference to a non LM-OOV. This should not be the case in standard
>>>> phrase-based SMT (e.g., using the feature functions used in the Moses
>>>> baseline for the shared task for instance). Instead, Moses should
>>>> produce the minimal number of LM-OOVs possible.
>>>>
>>>> There are exceptions to this when using different feature functions.
>>>> For instance, we have a paper on trading off transliteration vs
>>>> semantic translation (for Hindi to Urdu translation), where the
>>>> transliterations are sometimes LM-OOV, but still a better choice than
>>>> available semantic translations (which are not LM-OOV). But the
>>>> overall SMT models we used supports this specific trade-off (and it
>>>> took work to make the models do this correctly, this is described in
>>>> the paper).
>>>>
>>>> I believe for the other three LM packages used with Moses the minimal
>>>> number of LM-OOVs is always produced. I've switched back to
>>>> Moses+SRILM for now due to this issue. I think it may be the case that
>>>> Moses+KenLM actually produces the maximal number of OOVs allowed by
>>>> the phrases loaded, which would be highly undesirable. Empirically, it
>>>> certainly produces more than Moses+SRILM in my experiments.
>>>>
>>>> Thanks and Cheers, Alex
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] producing the minimal number of LM-OOVs

Reply via email to