The original behavior was to refuse to load any model without <unk>.
Early on, Hieu asked me to change that.  The default is now to
substitute probability 0.0 and print this complaint stderr:

The ARPA file is missing <unk>.  Substituting probability 0.0.

SRI's ngram scoring tool skips OOVs, so a probability of 0.0 reproduces
that behavior (though I still charge the backoff penalty from preceding
words).  I'm still not happy with it.

Documentation like http://statmt.org/wmt11/baseline.html carries
influence.  Can you add -unk?

On 03/19/11 13:07, Philipp Koehn wrote:
> Hi,
> 
> I have recently build all my language models with the "-unk" flag,
> so it creates probability mass for unseen words (there is a line
> for <unk> in the language model file).
> 
> But I am actually not sure if the SRILM interface properly uses
> this probability. It may just fall back to a very low floor.
> So it may be that Alex's desired feature is just a bug, which can
> be reproduced with kenlm by not training with "-unk", hence
> also falling back to the floor probability (if that is what kenlm
> is doing).
> 
> -phi
> 
> On Sat, Mar 19, 2011 at 4:59 PM, Kenneth Heafield <[email protected]> wrote:
>> I believe the right answer to this is adding an OOV count feature to
>> Moses.  In fact, I've gone through and made all the language models
>> return a struct indicating if the word just scored was OOV.  However,
>> this needs to make in into the phrases and ultimately the features.
>> Also, there's the fun of adding a config option to moses.ini.  Thoughts
>> on default behavior?
>>
>> You can control the unknown word probability by passing -u probability
>> to build_binary.  Set that to something negative.  It will only be
>> effective if the ARPA file was trained without <unk>.
>>
>> Also, is there are evidence out there for or against passing -unk to
>> SRILM?
>>
>> Kenneth
>>
>> On 03/19/11 12:51, Alexander Fraser wrote:
>>> Hi Folks,
>>>
>>> Is there some way to penalize LM-OOVs when using Moses+KenLM? I saw a
>>> suggestion to create an open-vocab LM (I usually use closed-vocab) but
>>> I think this means that in some context a LM-OOV could be produced in
>>> preference to a non LM-OOV. This should not be the case in standard
>>> phrase-based SMT (e.g., using the feature functions used in the Moses
>>> baseline for the shared task for instance). Instead, Moses should
>>> produce the minimal number of LM-OOVs possible.
>>>
>>> There are exceptions to this when using different feature functions.
>>> For instance, we have a paper on trading off transliteration vs
>>> semantic translation (for Hindi to Urdu translation), where the
>>> transliterations are sometimes LM-OOV, but still a better choice than
>>> available semantic translations (which are not LM-OOV). But the
>>> overall SMT models we used supports this specific trade-off (and it
>>> took work to make the models do this correctly, this is described in
>>> the paper).
>>>
>>> I believe for the other three LM packages used with Moses the minimal
>>> number of LM-OOVs is always produced. I've switched back to
>>> Moses+SRILM for now due to this issue. I think it may be the case that
>>> Moses+KenLM actually produces the maximal number of OOVs allowed by
>>> the phrases loaded, which would be highly undesirable. Empirically, it
>>> certainly produces more than Moses+SRILM in my experiments.
>>>
>>> Thanks and Cheers, Alex
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to