Philipp Koehn <pkoehn@...> writes: > > Hi, > > can I ask a dumb question - > where do these unknown words come from? > > Obviously there are words that are unknown in the source, > hence placed verbatim in the output, which will be likely > be unknown to the language model. But there is really not > much choice about having them or not (besides -drop-unknown). > All translations will have them. > > Otherwise, all words in the translation model should be known. > > So, what is the choice here? > > -phi >
Hi Philipp, I can give you another instance where <unk> matters. I played around with integrating external knowledge through additional translation models, along the lines of Chen et al. (2007). Multi-Engine Machine Translation with an Open-Source Decoder for Statistical Machine Translation. WMT 2007. With this approach, the translation model(s) *do* produce words unknown to the language model, and the probability of <unk> has quite a big effect. (in one experiment, setting <unk> artificially low (-100) produced better results (by about 0.5 BLEU percentage points) than just passing the "-unk" parameter to SRILM. best, Rico _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
