Re: [Moses-support] Skip OOV when computing Language Model score

LUONG NGOC Quang Mon, 18 Jan 2016 07:58:46 -0800

Dear All,

Thank you all of you for your contribution.


Actually I am using a LM trained over data not identical to the target side
of the phrase table (with much more limited vocabulary, for my own
purpose), so I don't think that -drop-unknown option would help.

As Jie also emphasized, my objective is to jump to one word futher when
encoutering <unk>, and so on. That would match "house <unk> in" to "house
in" in the LM without doing anything else (e.g. backoff). And it is not
exactly what the setting oov-feature=1 can do!

I also observed that -skipoovs option puts zero probability for all ngrams
containing OOV and therefore does not count them in the overall sentence LM
score.

So far I am more convinced that the code modification is the possible way
to accomplish my goal, although it is not straighforward for me at present.

Best,
Quang



On Fri, Jan 15, 2016 at 4:41 PM, Ergun Bicici <[email protected]> wrote:

>
> No comment.
>
>
>
> *Best Regards,*
> Ergun
>
> Ergun Biçici
> DFKI Projektbüro Berlin
>
>
> On Fri, Jan 15, 2016 at 4:20 PM, Jie Jiang <[email protected]>
> wrote:
>
>> Hi Ergun:
>>
>> I think the -skipoovs option would just drop all the n-gram scores that
>> has OOV in it, rather than using a skip-ngram LM model.
>>
>> Easy way to test it is just run it with that option to calculate log prob
>> on a sentence with OOV, and it should result in a rather high score.
>>
>> Please correct me if I'm wrong...
>>
>> 2016-01-15 14:07 GMT+00:00 Ergun Bicici <[email protected]>:
>>
>>>
>>> Dear Jie,
>>>
>>> There may be some option from SRILM:
>>> - http://www.speech.sri.com/pipermail/srilm-user/2013q2/001509.html
>>> - http://www.speech.sri.com/projects/srilm/manpages/ngram.1.html:
>>> *    -skipoovs*
>>> Instruct the LM to skip over contexts that contain out-of-vocabulary
>>> words, instead of using a backoff strategy in these cases.
>>>
>>> if it is not there maybe for a reason...
>>>
>>> Bing appears fast to index this thread:
>>> http://comments.gmane.org/gmane.comp.nlp.moses.user/14570
>>>
>>>
>>> *Best Regards,*
>>> Ergun
>>>
>>> Ergun Biçici
>>> DFKI Projektbüro Berlin
>>>
>>>
>>> On Fri, Jan 15, 2016 at 2:37 PM, Jie Jiang <[email protected]>
>>> wrote:
>>>
>>>> Hi Ergun:
>>>>
>>>> The original request in Quang's post was:
>>>>
>>>> *For instance, with the n-gram: "the <unk> house <unk> in", I would
>>>> like the decoder to assign it the probability of the phrase: "the house in"
>>>> (existing in the LM).*
>>>>
>>>> so each time there is a <unk> when calculating the LM score, you need
>>>> to look another word further.
>>>>
>>>> I believe that it cannot be achieved on current LM tools without
>>>> modifying the source code, which has already been clarified by Kenneth.
>>>>
>>>>
>>>> 2016-01-15 13:20 GMT+00:00 Ergun Bicici <[email protected]>:
>>>>
>>>>>
>>>>> Dear Kenneth,
>>>>>
>>>>> In the Moses manual, -drop-unknown switch is mentioned:
>>>>>
>>>>> 4.7.2
>>>>>  Handling Unknown Words
>>>>> Unknown words are copied verbatim to the output. They are also scored
>>>>> by the language
>>>>> model, and may be placed out of order. Alternatively, you may want to
>>>>> drop unknown words.
>>>>> To do so add the switch -drop-unknown.
>>>>>
>>>>> Alternatively, you can write a script that replaces all OOV tokens
>>>>> with some OOV-token-identifier such as <unk> before sending for
>>>>> translation.
>>>>>
>>>>>
>>>>> *Best Regards,*
>>>>> Ergun
>>>>>
>>>>> Ergun Biçici
>>>>> DFKI Projektbüro Berlin
>>>>>
>>>>>
>>>>> On Fri, Jan 15, 2016 at 12:22 AM, Kenneth Heafield <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>         I think oov-feature=1 just activates the OOV count feature
>>>>>> while
>>>>>> leaving LM score unchanged.  So it would still include p(<unk> | in).
>>>>>>
>>>>>>         One might try setting the OOV feature weight to -weight_LM *
>>>>>> weird_moses_internal_constant * log p(<unk>) in an attempt to cancel
>>>>>> out
>>>>>> the log p(<unk>) terms.  However that won't work either because:
>>>>>>
>>>>>> 1) It will still charge backoff penalties, b(the)b(house) in the
>>>>>> example.
>>>>>>
>>>>>> 2) The context will be lost each time so it's p(house) not p(house |
>>>>>> the).
>>>>>>
>>>>>> If the <unk>s follow a pattern, such as appearing every other word,
>>>>>> one
>>>>>> could insert them into the ARPA file though that would waste memory.
>>>>>>
>>>>>> I don't think there's any way to accomplish exactly what OP asked for
>>>>>> without coding (though it wouldn't be that hard once one understands
>>>>>> how
>>>>>> the LM infrastructure works).
>>>>>>
>>>>>> Kenneth
>>>>>>
>>>>>> On 01/14/2016 11:07 PM, Philipp Koehn wrote:
>>>>>> > Hi,
>>>>>> >
>>>>>> > You may get the behavior you want by adding
>>>>>> >   "oov-feature=1"
>>>>>> > to your LM specification line in moses.ini
>>>>>> > and also add a second weight with value "0" to the corresponding LM
>>>>>> > weight setting.
>>>>>> >
>>>>>> > This will then only use the scores
>>>>>> > p(the|<s>)
>>>>>> > p(house|<s>,the,<unk>) ---> backoff to p(house)
>>>>>> > p(in|<s>,the,<unk>,house,<unk>) ---> backoff to p(in)
>>>>>> >
>>>>>> > -phi
>>>>>> >
>>>>>> > On Thu, Jan 14, 2016 at 8:25 AM, LUONG NGOC Quang
>>>>>> > <[email protected] <mailto:[email protected]>> wrote:
>>>>>> >
>>>>>> >     Dear All,
>>>>>> >
>>>>>> >     I am currently using a SRILM Language Model (LM) in my Moses
>>>>>> >     decoder. Does anyone know how can I ask the decoder, at the
>>>>>> decoding
>>>>>> >     time, skip all out-of-vocabulary words when computing the LM
>>>>>> score
>>>>>> >     (instead of doing back-off)?
>>>>>> >
>>>>>> >     For instance, with the n-gram: "the <unk> house <unk> in", I
>>>>>> would
>>>>>> >     like the decoder to assign it the probability of the phrase:
>>>>>> "the
>>>>>> >     house in" (existing in the LM).
>>>>>> >
>>>>>> >     Do I need more options/declarations in moses.ini file?
>>>>>> >
>>>>>> >     Any help is very much appreciated,
>>>>>> >
>>>>>> >     Best,
>>>>>> >     Quang
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >     _______________________________________________
>>>>>> >     Moses-support mailing list
>>>>>> >     [email protected] <mailto:[email protected]>
>>>>>> >     http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > Moses-support mailing list
>>>>>> > [email protected]
>>>>>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>> >
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> [email protected]
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> [email protected]
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Best regards!
>>>>
>>>> Jie Jiang
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> Best regards!
>>
>> Jie Jiang
>>
>>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
Luong Ngoc Quang

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Skip OOV when computing Language Model score

Reply via email to