Ken, We have abandoned the recaser/truecaser + detokenize.perl combination altogether. Instead, we developed a proprietary tokenization + statistical model approach that restores both tokenization and casing to the expected natural state (never tokenized/never lower-cased). Best of all, it's language independent. So there's no need for language-specific detokenize scripts.
If you're interested, I'm happy to take the discussion off-list. Tom On 02/07/2015 04:34 AM, Kenneth Heafield wrote: > Dear Moses, > > What are the experiences with truecasing v the recaser? It seems the > recaser's default does: > > 1) Train a truecaser > 2) Truecase the monolingual data > 3) Train an LM on the truecased data > > There's an option to just directly go to LM training. Any thoughts on > which is better? > > It just feels weird to use the truecaser, which applies a unigram > popularity model in some cases, to filter the training data for an > n-gram model (so it won't be able to make n-gram decisions about thsoe > words). > > Kenneth > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
