Dear Moses support,
This is somewhat of a follow up question on my earlier mail
" Problems with segmentation mismatch and many unknown words for Chinese
translation‏" .
While I now ran some experiments with MultiUN data which is already in
Simplified Chinese,
and results have improved a bit, I have still problems particularly with
numbers and punctuation.
As Vincent Wang pointed out there is a script "escape-special-chars.perl" in
the
/mosesdecoder/scripts/tokenizer directory, which could make a difference.
Actually there are more scripts there:
deescape-special-chars.perl detokenizer.perl escape-special-chars.perl
lowercase.perl normalize-punctuation.perl replace-unicode-punctuation.perl
tokenizer.perl
I was wondering if anybody could tell me which of these scripts to use, and at
what stage in the preprocessing pipeline.
My own best guess is that normalize-punctuation.perl is possibly essential for
Moses but optional for other decoders such as Joshua, due to the
different grammar format.
I also guess that it is helpful to use replace-unicode-punctuation.perl
followed by normalize-punctuation.perl on the lowercased input
before feeding it to the segmenter or tokenizer.
Does anybody know if this understanding is right, or there is another way these
scripts should be used, in particular for Chinese-English translation?
(Documentation on these scripts seems to be limited, I also searched with
"grep" but could not find where these scripts are used as any larger
preprocessing script in the Moses codebase)
Thanks in advance.
Kind regards,
Gideon Wenniger
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support