Onderwerp: how is calculation of the language model costs performed?
Datum: 30 april 2013 14:18:53 GMT+02:00
Aan: //[email protected]

Hi,

we are using the Moses decoder to apply normalisation on sloppy text input.
We have manually build a phrase table containing different possible normalized 
versions of out input words, and assigned all equal probabilities to these 
alternatives,
in order to let the language model decide on the best normalized version.

As a toy example, we made a phrase table containing three alternatives for the 
Dutch word "vndg" (abbreviation of "vandaag")

vndg vaandag
vndg vandaag
vndg vndg

in the output & logging, we see that the language model cost for the correct 
normalisation (vandaag) is always higher than for the other two alternatives 
(vaandag/vndg) that even do not appear in the language model (they are non 
existing Dutch words and do not appear in the training corpus for the creation 
of the language model).
This seems very strange ... is there some kind of bias to have a lower LM cost 
for words that do not appear at all in the language model (some kind of 
smoothening maybe?) ?
If this is the case, how can we tune Moses to assign higher probabilities to 
words that do occur in the language model?

Thanks in advance!
Els Lefever.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to