________________________________________
Van: Els Lefever
Verzonden: vrijdag 3 mei 2013 16:27
Aan: Kenneth Heafield
Onderwerp: RE: [Moses-support] FW: how is calculation of the language model 
costs performed?

Thanks a lot for your answers, Kenneth!

- we built the model with the SIRLM package (with option -unk)

- I have a couple of related questions:

1. If I understand correctly, all OOV words are converted into "unk".
If you talk about changing the weights for the OOV words,
you mean we manually change the logarithmic probability scores in the resulting 
LM file?

2. How are the costs exactly calculated for the LM?
There seems to be no correspondance between the values in the LM for the ngrams
and the actual LM costs we see in the decoding logging ...

Thanks a lot in advance!

Best,
Els Lefever.


________________________________________
Van: [email protected] [[email protected]] namens 
Kenneth Heafield [[email protected]]
Verzonden: dinsdag 30 april 2013 18:32
Aan: [email protected]
Onderwerp: Re: [Moses-support] FW: how is calculation of the language model 
costs performed?

Hi,

        It sounds like you could just call the language model directly i.e.
bin/query or http://kheafield.com/code/kenlm/developers/ .

        You haven't said how you estimated the model.  But you can get a lower
and more theoretically justified probability by using the
--interpolate_unigrams option to bin/lmplz .

        But the real solution here is to look at the OOV count and log
probability yourself and decide on a weighting.  Even if the language
model did a good job at estimating OOV probability (which is
questionable), this would be the OOV rate for the training data, not the
data you're querying.

Kenneth

On 04/30/13 14:59, Els Lefever wrote:
> *Onderwerp: **how is calculation of the language model costs performed?*
>> *Datum: *30 april 2013 14:18:53 GMT+02:00
>> *Aan: *//[email protected]
>>
>> Hi,
>>
>> we are using the Moses decoder to apply normalisation on sloppy text
>> input.
>> We have manually build a phrase table containing different possible
>> normalized versions of out input words, and assigned all equal
>> probabilities to these alternatives,
>> in order to let the language model decide on the best normalized version.
>>
>> As a toy example, we made a phrase table containing three alternatives
>> for the Dutch word "vndg" (abbreviation of "vandaag")
>>
>> vndgvaandag
>> vndgvandaag
>> vndgvndg
>>
>> in the output & logging, we see that the language model cost for the
>> correct normalisation (vandaag) is always higher than for the other
>> two alternatives (vaandag/vndg) that even do not appear in the
>> language model (they are non existing Dutch words and do not appear in
>> the training corpus for the creation of the language model).
>> This seems very strange ... is there some kind of bias to have a lower
>> LM cost for words that do not appear at all in the language model
>> (some kind of smoothening maybe?) ?
>> If this is the case, how can we tune Moses to assign higher
>> probabilities to words that do occur in the language model?
>>
>> Thanks in advance!
>> Els Lefever.
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to