Re: [Moses-support] When to truecase

Matthias Huck Fri, 22 May 2015 05:29:46 -0700

Hi,

If your system output is lowercase, you could try SRILM's `disambig`
tool for predicting the correct casing in a postprocessing step.


http://www.speech.sri.com/projects/srilm/manpages/disambig.1.html

Cheers,
Matthias


On Fri, 2015-05-22 at 11:20 +0200, Ondrej Bojar wrote:
> Hi,
> 
> we also have an experiment on truecasing, see Table 1 in
> http://www.statmt.org/wmt13/pdf/WMT08.pdf
> 
> What works best for us is relying on the casing as guessed by the
> lemmatizer. (Our lemmatizer recognizes names as separate lemmas and
> keeps the lemma upcased; which we then cast to the token in the
> sentence.)
> 
> Moses recaser was the worst option, it was actually better to
> lowercase only the source side of the parallel data, i.e. have the
> main search also pick the casing.
> 
> Cheers, O.
> 
> ----- Original Message -----
> > From: "Lane Schwartz" <dowob...@gmail.com>
> > To: "Philipp Koehn" <p...@jhu.edu>
> > Cc: moses-support@mit.edu
> > Sent: Wednesday, 20 May, 2015 20:50:41
> > Subject: Re: [Moses-support] When to truecase
> 
> > Got it. So then, how was casing handled in the "mbr/mp" column? Was all of
> > the data lowercased, then models trained, then recasing applied after
> > decoding? Or something else?
> > 
> > On Wed, May 20, 2015 at 1:30 PM, Philipp Koehn <p...@jhu.edu> wrote:
> > 
> >> Hi,
> >>
> >> no, the changes are made incrementally.
> >>
> >> So the recesed "baseline" is the previous "mbr/mp" column.
> >>
> >> -phi
> >>
> >> On Wed, May 20, 2015 at 2:01 PM, Lane Schwartz <dowob...@gmail.com> wrote:
> >>
> >>> Philipp,
> >>>
> >>> In Table 2 of the WMT 2009 paper, are the "baseline" and "truecased"
> >>> columns directly comparable? In other words, do the two columns indicate
> >>> identical conditions other than a single variable (how and/or when casing
> >>> was handled)?
> >>>
> >>> In the baseline condition, how and when was casing handled?
> >>>
> >>> Thanks,
> >>> Lane
> >>>
> >>>
> >>> On Wed, May 20, 2015 at 12:43 PM, Philipp Koehn <p...@jhu.edu> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> see Section 2.2 in our WMT 2009 submission:
> >>>> http://www.statmt.org/wmt09/pdf/WMT-0929.pdf
> >>>>
> >>>> One practical reason to avoid recasing is the need
> >>>> for a second large cased language model.
> >>>>
> >>>> But there is of course also the practical issue with
> >>>> have a unique truecasing scheme for each data
> >>>> condition, handling of headlines, all-caps emphasis,
> >>>> etc.
> >>>>
> >>>> It would be worth to revisit this issue again under
> >>>> different data conditions / language pairs. Both
> >>>> options are readily available in EMS.
> >>>>
> >>>> Each of the two alternative methods could be
> >>>> improved as well. See for instance:
> >>>> http://www.aclweb.org/anthology/N06-1001
> >>>>
> >>>> -phi
> >>>>
> >>>> -phi
> >>>>
> >>>>
> >>>> On Wed, May 20, 2015 at 12:31 PM, Lane Schwartz <dowob...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Philipp (and others),
> >>>>>
> >>>>> I'm wondering what people's experience is regarding when truecasing is
> >>>>> applied.
> >>>>>
> >>>>> One option is to truecase the training data, then train your TM and LM
> >>>>> using that truecased data. Another option would be to lowercase the 
> >>>>> data,
> >>>>> train TM and LM on the lowercased data, and then perform truecasing 
> >>>>> after
> >>>>> decoding.
> >>>>>
> >>>>> I assume that the former gives better results, but the latter approach
> >>>>> has an advantage in terms of extensibility (namely if you get more data 
> >>>>> and
> >>>>> update your truecase model, you don't have to re-train all of your TMs 
> >>>>> and
> >>>>> LMs).
> >>>>>
> >>>>> Does anyone have any insights they would care to share on this?
> >>>>>
> >>>>> Thanks,
> >>>>> Lane
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Moses-support mailing list
> >>>>> Moses-support@mit.edu
> >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> When a place gets crowded enough to require ID's, social collapse is not
> >>> far away.  It is time to go elsewhere.  The best thing about space travel
> >>> is that it made it possible to go elsewhere.
> >>>                 -- R.A. Heinlein, "Time Enough For Love"
> >>>
> >>> _______________________________________________
> >>> Moses-support mailing list
> >>> Moses-support@mit.edu
> >>> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>>
> >>>
> >>
> > 
> > 
> > --
> > When a place gets crowded enough to require ID's, social collapse is not
> > far away.  It is time to go elsewhere.  The best thing about space travel
> > is that it made it possible to go elsewhere.
> >                -- R.A. Heinlein, "Time Enough For Love"
> > 
> > _______________________________________________
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> 



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] When to truecase

Reply via email to