Hieu Hoang <[EMAIL PROTECTED]> writes: >> For the WMT task recasing is done by training another Moses model. Is this a >> suitable approach for the online system – would you train the recaser using all >> of the Europarl data?? Another option would be to train the original models on >> data that has not been lowercased, and simply remove that step. I’d >> welcome your thoughts on what you think the best approach would be. > dunno much about the relative > performance of recasing techniques so will let others answer that
I've been wondering about this "lowercasing" business since the very beginning (a couple of weeks ago, <grin>). My background is in speech recognition, and I built a fair number of statistical language models for that. In that landscape, we always tried to maintain as much RELEVANT information as possible about our tokens. Specifically, our tokenizers attempted (as best as possible) to remove the irrelevant uppercasing that came from typography rules (most often, capitalizing the first word of a sentence) and restore the words in their "natural" casing (keeping initial capitals where they are supposed to be). By lowercasing everything, we lose the useful distinction between some words (e.g. the color "white" as opposed to "Barry White", or in German, distinction between adjectives and nouns). Recasing after translation seems like a poor fix to something we shouldn't have broken in the first place: translating english "mister Bill White" into french "monsieur Facture Blanc", even with the initial capitals properly restored, doesn't seem quite right... - apologies if this is all well known, understood and dealt with, as I said, I'm a newcomer in the translation field - _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
