Hi,

Are truecase models still widely in use?

I have a proposal for a tweak to the train-truecaser.perl script.

Currently, we don't take the first token of a sentence as evidence for the
true casing of that type, on the basis that the first word of a sentence is
always capitalized.  The first token of a segment is always assumed to be
the first word of a sentence, and thus is never taken as casing evidence.

However, if a given segment is only one token long, then the segment is
probably not a sentence, and the token is quite possibly in its natural
case.  So my proposal is to take the sole token of one-token segments as
evidence for true casing.

I attach the code change.

Any objections?  If not, I'll check it in.

Ben

Attachment: train-truecaser.perl
Description: Binary data

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to