Thanks, Philipp and Miles, for the feedback, and sorry for the delayed
reaction on my part.

I agree, the change should be gated.  My new proposal adds a boolean option
called --possiblyUseFirstToken .

I also refined the idea for the change itself, as follows:

Even if a segment is only one token long and is therefore probably not a
sentence, it still might be, say, a heading, so the token should still be
treated with suspicion as evidence for truecasing.  Therefore, such tokens
are taken into account (if the --possiblyUseFirstToken option is selected)
but are given only 10% as much weight as normal tokens.

On the other hand, if a token is sentence-initial and is *not* capitalized,
then there's no reason to be wary that the given casing is only due to it
being sentence-initial.  Therefore, non-capitalized tokens are taken into
account, and given full weight, regardless of whether they are
sentence-initial (again, if the --possiblyUseFirstToken option is selected).

The proposed code change is attached.  Any comments?  If there are no
objections, I'll check it in.

Regards,
Ben

On Mon, Oct 25, 2010 at 7:44 PM, Philipp Koehn <[email protected]> wrote:

> Hi,
>
> Sounds reasonable to me, but it would be good to have this as an option, as
> Miles suggested.
>
> -phi
>
> On 25 Oct 2010 17:40, "Ben Gottesman" <[email protected]> wrote:
> > Hi,
> >
> > Are truecase models still widely in use?
> >
> > I have a proposal for a tweak to the train-truecaser.perl script.
> >
> > Currently, we don't take the first token of a sentence as evidence for
> the
> > true casing of that type, on the basis that the first word of a sentence
> is
> > always capitalized. The first token of a segment is always assumed to be
> > the first word of a sentence, and thus is never taken as casing evidence.
> >
> > However, if a given segment is only one token long, then the segment is
> > probably not a sentence, and the token is quite possibly in its natural
> > case. So my proposal is to take the sole token of one-token segments as
> > evidence for true casing.
> >
> > I attach the code change.
> >
> > Any objections? If not, I'll check it in.
> >
> > Ben
>

Attachment: train-truecaser.perl
Description: Binary data

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to