Thanks, Philipp and Miles, for the feedback, and sorry for the delayed reaction on my part.
I agree, the change should be gated. My new proposal adds a boolean option called --possiblyUseFirstToken . I also refined the idea for the change itself, as follows: Even if a segment is only one token long and is therefore probably not a sentence, it still might be, say, a heading, so the token should still be treated with suspicion as evidence for truecasing. Therefore, such tokens are taken into account (if the --possiblyUseFirstToken option is selected) but are given only 10% as much weight as normal tokens. On the other hand, if a token is sentence-initial and is *not* capitalized, then there's no reason to be wary that the given casing is only due to it being sentence-initial. Therefore, non-capitalized tokens are taken into account, and given full weight, regardless of whether they are sentence-initial (again, if the --possiblyUseFirstToken option is selected). The proposed code change is attached. Any comments? If there are no objections, I'll check it in. Regards, Ben On Mon, Oct 25, 2010 at 7:44 PM, Philipp Koehn <[email protected]> wrote: > Hi, > > Sounds reasonable to me, but it would be good to have this as an option, as > Miles suggested. > > -phi > > On 25 Oct 2010 17:40, "Ben Gottesman" <[email protected]> wrote: > > Hi, > > > > Are truecase models still widely in use? > > > > I have a proposal for a tweak to the train-truecaser.perl script. > > > > Currently, we don't take the first token of a sentence as evidence for > the > > true casing of that type, on the basis that the first word of a sentence > is > > always capitalized. The first token of a segment is always assumed to be > > the first word of a sentence, and thus is never taken as casing evidence. > > > > However, if a given segment is only one token long, then the segment is > > probably not a sentence, and the token is quite possibly in its natural > > case. So my proposal is to take the sole token of one-token segments as > > evidence for true casing. > > > > I attach the code change. > > > > Any objections? If not, I'll check it in. > > > > Ben >
train-truecaser.perl
Description: Binary data
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
