Hi Per,

The standard workflow is to run a postprocessing step on the output,
e.g. with scripts/tokenizer/detokenizer.perl in Moses.

Usage ./detokenizer.perl (-l [en|fr|it|cs|...]) < tokenizedfile > 
detokenizedfile
Options:
  -u     ... uppercase the first char in the final sentence.
  -q     ... don't report detokenizer revision.
  -b     ... disable Perl buffering.
  -penn  ... assume input is tokenized as per tokenizer.perl's -penn option.


If you are using EMS, you might want to integrate this into your
pipeline in the following way:

[EVALUATION]
detokenizer = "$moses-script-dir/tokenizer/detokenizer.perl -l 
$output-extension"

Cheers,
Matthias


On Fri, 2014-02-14 at 13:14 +0100, Per Tunedal wrote:
> Hi,
> following the baseline instructions I've tokenized and recased the text
> before training. And consequently I get similar output when translating.
> 
> Are there any scripts available to get back a normal text from the
> output? Especially the html-encoding for some characters e.g. the french
> é, è and ê makes reading uncomfortable. A production system would have
> to produce readable output anyway.
> 
> What's the standard work flow?
> 
> Yours,
> Per Tunedal
> 
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to