I always thought that lowercasing was about the sparse data problem and not about poor input data. But actually I'm not sure if the GIZA alignments on lowercased europarl data are any better than on the original forms. did anyone carry out a thorough comparison for various language pairs yet?
jörg On Wed, 05 Mar 2008 16:42:44 +0100 Hubert Crépy <[EMAIL PROTECTED]> wrote: > Chris Dyer a écrit : >> One argument against preserving case information is that some of >>what >> you may want to translate in a large-coverage system may be >> incorrectly cased to begin with (e.g., informal text, such as what >>is >> found in emails, newsgroups, etc). >> > Good point, one that I hadn't considered: "poor quality" input (in >other > words: real world input). > I just wonder how much harm we do to the translation of "good >quality" > input, in the hopes of fixing problems with "poor quality" input... > Some would call me rigid, but I personally would try to favor users >who > provide good input, and not worry too much about those who don't. > >Faced with improper input, would it not make more sense to try and >"fix > it" in the source language before translation, rather than >distorting > the translation with the induced errors, then trying to fix the > translation ? > > -- > Hubert Crépy > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
