Philipp Koehn <[EMAIL PROTECTED]> writes:
> In truecasing, only the words at the start of the sentence
> are converted to their natural case.

That does 90% of the job.
Other cases(*) you may want to consider are things like:
- Headers Where Each Word Is Capitalized
- casing (mis-)used for EMPHASIS
- casing of honorific titles: "Mr President"(eng), where "President" should be 
treated as a capitalized noun, i.e. lowercased, translated and recapitalized 
==> "M. le Président"(fr). Whereas "Mr White"(eng) would want to be treated as 
a proper noun (untranslated) ==> "M. White"(fr), not "M. Blanc"(fr).
- etc.

Of course, these tokenizing/normalizing steps are a never-ending story, where 
perfection just doesn't exist, and at some point one has to claim "good enough 
is good enough", and "let the statistics sort it all out"...  Which is 
probably true for preparing the training corpus.

However, asssuming you're using the same front-end tokenizer/normalizer at 
runtime translation, then errors/oversights in this area will likely show 
blatantly in the final output.  E.g. in the first example above (capitalized 
header), if the words are not lowercased, they will look like unknown proper 
nouns and will end up untranslated.

As I discover and consider all this, I find many commonalities with Speech 
Recognition: building a good statistical model is of course paramount, so we 
tend to focus on raw accuracy figures on well-prepared test data.
But adding all the little niceties to take care of all the little 
idiosyncracies, is also a very large piece of work and has an extremely large 
impact on the perceived accuracy of the whole system in the real world.

(*) unintentional pun, please forgive me ;-)

--
Hubert Crépy

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to