There have been some advocates of preserving case information as you describe, although I've only seen them discussed in the context of small-coverage systems, such as in the IWSLT task. See, for example, the system description of the Carnegie Mellon Univ system from 2006's IWSLT entry: http://www.cs.cmu.edu/~zollmann/publications/iwslt06.pdf
One argument against preserving case information is that some of what you may want to translate in a large-coverage system may be incorrectly cased to begin with (e.g., informal text, such as what is found in emails, newsgroups, etc). For something like a Europarl-only system, where the style is mostly consistent, and where the evaluation set will also be in that style, a "true casing" approach might have some benefit. Best, Chris On Wed, Mar 5, 2008 at 9:50 AM, Hubert Crépy <[EMAIL PROTECTED]> wrote: > Hieu Hoang <[EMAIL PROTECTED]> writes: > >> For the WMT task recasing is done by training another Moses model. Is this > a > >> suitable approach for the online system – would you train the recaser > using > all > >> of the Europarl data?? Another option would be to train the original > models > on > >> data that has not been lowercased, and simply remove that step. I'd > >> welcome your thoughts on what you think the best approach would be. > > dunno much about the relative > > performance of recasing techniques so will let others answer that > > I've been wondering about this "lowercasing" business since the very > beginning > (a couple of weeks ago, <grin>). > My background is in speech recognition, and I built a fair number of > statistical language models for that. > In that landscape, we always tried to maintain as much RELEVANT information > as > possible about our tokens. > > Specifically, our tokenizers attempted (as best as possible) to remove the > irrelevant uppercasing that came from typography rules (most often, > capitalizing the first word of a sentence) and restore the words in > their "natural" casing (keeping initial capitals where they are supposed to > be). > By lowercasing everything, we lose the useful distinction between some words > (e.g. the color "white" as opposed to "Barry White", or in German, > distinction > between adjectives and nouns). Recasing after translation seems like a poor > fix to something we shouldn't have broken in the first place: translating > english "mister Bill White" into french "monsieur Facture Blanc", even with > the initial capitals properly restored, doesn't seem quite right... > > - apologies if this is all well known, understood and dealt with, as I said, > I'm a newcomer in the translation field - > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
