Hi, If your system output is lowercase, you could try SRILM's `disambig` tool for predicting the correct casing in a postprocessing step.
http://www.speech.sri.com/projects/srilm/manpages/disambig.1.html Cheers, Matthias On Fri, 2015-05-22 at 11:20 +0200, Ondrej Bojar wrote: > Hi, > > we also have an experiment on truecasing, see Table 1 in > http://www.statmt.org/wmt13/pdf/WMT08.pdf > > What works best for us is relying on the casing as guessed by the > lemmatizer. (Our lemmatizer recognizes names as separate lemmas and > keeps the lemma upcased; which we then cast to the token in the > sentence.) > > Moses recaser was the worst option, it was actually better to > lowercase only the source side of the parallel data, i.e. have the > main search also pick the casing. > > Cheers, O. > > ----- Original Message ----- > > From: "Lane Schwartz" <dowob...@gmail.com> > > To: "Philipp Koehn" <p...@jhu.edu> > > Cc: moses-support@mit.edu > > Sent: Wednesday, 20 May, 2015 20:50:41 > > Subject: Re: [Moses-support] When to truecase > > > Got it. So then, how was casing handled in the "mbr/mp" column? Was all of > > the data lowercased, then models trained, then recasing applied after > > decoding? Or something else? > > > > On Wed, May 20, 2015 at 1:30 PM, Philipp Koehn <p...@jhu.edu> wrote: > > > >> Hi, > >> > >> no, the changes are made incrementally. > >> > >> So the recesed "baseline" is the previous "mbr/mp" column. > >> > >> -phi > >> > >> On Wed, May 20, 2015 at 2:01 PM, Lane Schwartz <dowob...@gmail.com> wrote: > >> > >>> Philipp, > >>> > >>> In Table 2 of the WMT 2009 paper, are the "baseline" and "truecased" > >>> columns directly comparable? In other words, do the two columns indicate > >>> identical conditions other than a single variable (how and/or when casing > >>> was handled)? > >>> > >>> In the baseline condition, how and when was casing handled? > >>> > >>> Thanks, > >>> Lane > >>> > >>> > >>> On Wed, May 20, 2015 at 12:43 PM, Philipp Koehn <p...@jhu.edu> wrote: > >>> > >>>> Hi, > >>>> > >>>> see Section 2.2 in our WMT 2009 submission: > >>>> http://www.statmt.org/wmt09/pdf/WMT-0929.pdf > >>>> > >>>> One practical reason to avoid recasing is the need > >>>> for a second large cased language model. > >>>> > >>>> But there is of course also the practical issue with > >>>> have a unique truecasing scheme for each data > >>>> condition, handling of headlines, all-caps emphasis, > >>>> etc. > >>>> > >>>> It would be worth to revisit this issue again under > >>>> different data conditions / language pairs. Both > >>>> options are readily available in EMS. > >>>> > >>>> Each of the two alternative methods could be > >>>> improved as well. See for instance: > >>>> http://www.aclweb.org/anthology/N06-1001 > >>>> > >>>> -phi > >>>> > >>>> -phi > >>>> > >>>> > >>>> On Wed, May 20, 2015 at 12:31 PM, Lane Schwartz <dowob...@gmail.com> > >>>> wrote: > >>>> > >>>>> Philipp (and others), > >>>>> > >>>>> I'm wondering what people's experience is regarding when truecasing is > >>>>> applied. > >>>>> > >>>>> One option is to truecase the training data, then train your TM and LM > >>>>> using that truecased data. Another option would be to lowercase the > >>>>> data, > >>>>> train TM and LM on the lowercased data, and then perform truecasing > >>>>> after > >>>>> decoding. > >>>>> > >>>>> I assume that the former gives better results, but the latter approach > >>>>> has an advantage in terms of extensibility (namely if you get more data > >>>>> and > >>>>> update your truecase model, you don't have to re-train all of your TMs > >>>>> and > >>>>> LMs). > >>>>> > >>>>> Does anyone have any insights they would care to share on this? > >>>>> > >>>>> Thanks, > >>>>> Lane > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Moses-support mailing list > >>>>> Moses-support@mit.edu > >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>>>> > >>>>> > >>>> > >>> > >>> > >>> -- > >>> When a place gets crowded enough to require ID's, social collapse is not > >>> far away. It is time to go elsewhere. The best thing about space travel > >>> is that it made it possible to go elsewhere. > >>> -- R.A. Heinlein, "Time Enough For Love" > >>> > >>> _______________________________________________ > >>> Moses-support mailing list > >>> Moses-support@mit.edu > >>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>> > >>> > >> > > > > > > -- > > When a place gets crowded enough to require ID's, social collapse is not > > far away. It is time to go elsewhere. The best thing about space travel > > is that it made it possible to go elsewhere. > > -- R.A. Heinlein, "Time Enough For Love" > > > > _______________________________________________ > > Moses-support mailing list > > Moses-support@mit.edu > > http://mailman.mit.edu/mailman/listinfo/moses-support > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support