Hi Per, The standard workflow is to run a postprocessing step on the output, e.g. with scripts/tokenizer/detokenizer.perl in Moses.
Usage ./detokenizer.perl (-l [en|fr|it|cs|...]) < tokenizedfile > detokenizedfile Options: -u ... uppercase the first char in the final sentence. -q ... don't report detokenizer revision. -b ... disable Perl buffering. -penn ... assume input is tokenized as per tokenizer.perl's -penn option. If you are using EMS, you might want to integrate this into your pipeline in the following way: [EVALUATION] detokenizer = "$moses-script-dir/tokenizer/detokenizer.perl -l $output-extension" Cheers, Matthias On Fri, 2014-02-14 at 13:14 +0100, Per Tunedal wrote: > Hi, > following the baseline instructions I've tokenized and recased the text > before training. And consequently I get similar output when translating. > > Are there any scripts available to get back a normal text from the > output? Especially the html-encoding for some characters e.g. the french > é, è and ê makes reading uncomfortable. A production system would have > to produce readable output anyway. > > What's the standard work flow? > > Yours, > Per Tunedal > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
