Re: [Moses-support] lowercasing/recasing

Chris Dyer Wed, 05 Mar 2008 07:08:51 -0800

There have been some advocates of preserving case information as you
describe, although I've only seen them discussed in the context of
small-coverage systems, such as in the IWSLT task.  See, for example,
the system description of the Carnegie Mellon Univ system from 2006's
IWSLT entry:
  http://www.cs.cmu.edu/~zollmann/publications/iwslt06.pdf


One argument against preserving case information is that some of what
you may want to translate in a large-coverage system may be
incorrectly cased to begin with (e.g., informal text, such as what is
found in emails, newsgroups, etc).

For something like a Europarl-only system, where the style is mostly
consistent, and where the evaluation set will also be in that style, a
"true casing" approach might have some benefit.

Best,
Chris

On Wed, Mar 5, 2008 at 9:50 AM, Hubert Crépy <[EMAIL PROTECTED]> wrote:
> Hieu Hoang <[EMAIL PROTECTED]> writes:
>  >> For the WMT task recasing is done by training another Moses model. Is this
>  a
>  >> suitable approach for the online system – would you train the recaser 
> using
>  all
>  >> of the Europarl data?? Another option would be to train the original 
> models
>  on
>  >> data that has not been lowercased, and simply remove that step.  I'd
>  >> welcome your thoughts on what you think the best approach would be.
>  > dunno much about the relative
>  > performance of recasing techniques so will let others    answer that
>
>  I've been wondering about this "lowercasing" business since the very 
> beginning
>  (a couple of weeks ago, <grin>).
>  My background is in speech recognition, and I built a fair number of
>  statistical language models for that.
>  In that landscape, we always tried to maintain as much RELEVANT information 
> as
>  possible about our tokens.
>
>  Specifically, our tokenizers attempted (as best as possible) to remove the
>  irrelevant uppercasing that came from typography rules (most often,
>  capitalizing the first word of a sentence) and restore the words in
>  their "natural" casing (keeping initial capitals where they are supposed to
>  be).
>  By lowercasing everything, we lose the useful distinction between some words
>  (e.g. the color "white" as opposed to "Barry White", or in German, 
> distinction
>  between adjectives and nouns).  Recasing after translation seems like a poor
>  fix to something we shouldn't have broken in the first place: translating
>  english "mister Bill White" into french "monsieur Facture Blanc", even with
>  the initial capitals properly restored, doesn't seem quite right...
>
>  - apologies if this is all well known, understood and dealt with, as I said,
>  I'm a newcomer in the translation field -
>
>  _______________________________________________
>  Moses-support mailing list
>  [email protected]
>  http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] lowercasing/recasing

Reply via email to