Philipp Koehn <[EMAIL PROTECTED]> writes: > In truecasing, only the words at the start of the sentence > are converted to their natural case.
That does 90% of the job. Other cases(*) you may want to consider are things like: - Headers Where Each Word Is Capitalized - casing (mis-)used for EMPHASIS - casing of honorific titles: "Mr President"(eng), where "President" should be treated as a capitalized noun, i.e. lowercased, translated and recapitalized ==> "M. le Président"(fr). Whereas "Mr White"(eng) would want to be treated as a proper noun (untranslated) ==> "M. White"(fr), not "M. Blanc"(fr). - etc. Of course, these tokenizing/normalizing steps are a never-ending story, where perfection just doesn't exist, and at some point one has to claim "good enough is good enough", and "let the statistics sort it all out"... Which is probably true for preparing the training corpus. However, asssuming you're using the same front-end tokenizer/normalizer at runtime translation, then errors/oversights in this area will likely show blatantly in the final output. E.g. in the first example above (capitalized header), if the words are not lowercased, they will look like unknown proper nouns and will end up untranslated. As I discover and consider all this, I find many commonalities with Speech Recognition: building a good statistical model is of course paramount, so we tend to focus on raw accuracy figures on well-prepared test data. But adding all the little niceties to take care of all the little idiosyncracies, is also a very large piece of work and has an extremely large impact on the perceived accuracy of the whole system in the real world. (*) unintentional pun, please forgive me ;-) -- Hubert Crépy _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
