Ken,

We have abandoned the recaser/truecaser + detokenize.perl combination 
altogether. Instead, we developed a proprietary tokenization + 
statistical model approach that restores both tokenization and casing to 
the expected natural state (never tokenized/never lower-cased). Best of 
all, it's language independent. So there's no need for language-specific 
detokenize scripts.

If you're interested, I'm happy to take the discussion off-list.

Tom


On 02/07/2015 04:34 AM, Kenneth Heafield wrote:
> Dear Moses,
>
>       What are the experiences with truecasing v the recaser?  It seems the
> recaser's default does:
>
> 1) Train a truecaser
> 2) Truecase the monolingual data
> 3) Train an LM on the truecased data
>
> There's an option to just directly go to LM training.  Any thoughts on
> which is better?
>
> It just feels weird to use the truecaser, which applies a unigram
> popularity model in some cases, to filter the training data for an
> n-gram model (so it won't be able to make n-gram decisions about thsoe
> words).
>
> Kenneth
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to