Hi,

see Section 2.2 in our WMT 2009 submission:
http://www.statmt.org/wmt09/pdf/WMT-0929.pdf

One practical reason to avoid recasing is the need
for a second large cased language model.

But there is of course also the practical issue with
have a unique truecasing scheme for each data
condition, handling of headlines, all-caps emphasis,
etc.

It would be worth to revisit this issue again under
different data conditions / language pairs. Both
options are readily available in EMS.

Each of the two alternative methods could be
improved as well. See for instance:
http://www.aclweb.org/anthology/N06-1001

-phi

-phi


On Wed, May 20, 2015 at 12:31 PM, Lane Schwartz <[email protected]> wrote:

> Philipp (and others),
>
> I'm wondering what people's experience is regarding when truecasing is
> applied.
>
> One option is to truecase the training data, then train your TM and LM
> using that truecased data. Another option would be to lowercase the data,
> train TM and LM on the lowercased data, and then perform truecasing after
> decoding.
>
> I assume that the former gives better results, but the latter approach has
> an advantage in terms of extensibility (namely if you get more data and
> update your truecase model, you don't have to re-train all of your TMs and
> LMs).
>
> Does anyone have any insights they would care to share on this?
>
> Thanks,
> Lane
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to