[Moses-support] lowercasing/recasing

Hubert Crépy Wed, 05 Mar 2008 06:51:54 -0800

Hieu Hoang <[EMAIL PROTECTED]> writes:
>> For the WMT task recasing is done by training another Moses model. Is this 
a 
>> suitable approach for the online system – would you train the recaser using 
all 
>> of the Europarl data?? Another option would be to train the original models 
on 
>> data that has not been lowercased, and simply remove that step.  I’d 
>> welcome your thoughts on what you think the best approach would be. 
> dunno much about the relative 
> performance of recasing techniques so will let others    answer that


I've been wondering about this "lowercasing" business since the very beginning 
(a couple of weeks ago, <grin>).
My background is in speech recognition, and I built a fair number of  
statistical language models for that.
In that landscape, we always tried to maintain as much RELEVANT information as 
possible about our tokens.

Specifically, our tokenizers attempted (as best as possible) to remove the 
irrelevant uppercasing that came from typography rules (most often, 
capitalizing the first word of a sentence) and restore the words in 
their "natural" casing (keeping initial capitals where they are supposed to 
be).
By lowercasing everything, we lose the useful distinction between some words 
(e.g. the color "white" as opposed to "Barry White", or in German, distinction 
between adjectives and nouns).  Recasing after translation seems like a poor 
fix to something we shouldn't have broken in the first place: translating 
english "mister Bill White" into french "monsieur Facture Blanc", even with 
the initial capitals properly restored, doesn't seem quite right...

- apologies if this is all well known, understood and dealt with, as I said, 
I'm a newcomer in the translation field -

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] lowercasing/recasing

Reply via email to