Hi, see Section 2.2 in our WMT 2009 submission: http://www.statmt.org/wmt09/pdf/WMT-0929.pdf
One practical reason to avoid recasing is the need for a second large cased language model. But there is of course also the practical issue with have a unique truecasing scheme for each data condition, handling of headlines, all-caps emphasis, etc. It would be worth to revisit this issue again under different data conditions / language pairs. Both options are readily available in EMS. Each of the two alternative methods could be improved as well. See for instance: http://www.aclweb.org/anthology/N06-1001 -phi -phi On Wed, May 20, 2015 at 12:31 PM, Lane Schwartz <[email protected]> wrote: > Philipp (and others), > > I'm wondering what people's experience is regarding when truecasing is > applied. > > One option is to truecase the training data, then train your TM and LM > using that truecased data. Another option would be to lowercase the data, > train TM and LM on the lowercased data, and then perform truecasing after > decoding. > > I assume that the former gives better results, but the latter approach has > an advantage in terms of extensibility (namely if you get more data and > update your truecase model, you don't have to re-train all of your TMs and > LMs). > > Does anyone have any insights they would care to share on this? > > Thanks, > Lane > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
