Dear Moses support,
This is somewhat of a follow up question on my earlier mail 
" Problems with segmentation mismatch and many unknown words for Chinese 
translation‏" .
While I now ran some experiments with MultiUN data which is already in 
Simplified Chinese,
and results have improved a bit, I have still problems particularly with 
numbers and punctuation.

As Vincent Wang pointed out there is a script "escape-special-chars.perl" in 
the 
/mosesdecoder/scripts/tokenizer directory, which could make a difference.
Actually there are more scripts there:
deescape-special-chars.perl  detokenizer.perl  escape-special-chars.perl  
lowercase.perl  normalize-punctuation.perl  replace-unicode-punctuation.perl  
tokenizer.perl  

I was wondering if anybody could tell me which of these scripts to use, and at 
what stage in the preprocessing pipeline.

My own best guess is that normalize-punctuation.perl is possibly essential for 
Moses but optional for other decoders such as Joshua, due to the 
different grammar format.

I also guess that it is helpful to use  replace-unicode-punctuation.perl  
followed by normalize-punctuation.perl  on the lowercased input 
before feeding it to the segmenter or tokenizer. 

Does anybody know if this understanding is right, or there is another way these 
scripts should be used, in particular for Chinese-English translation?
(Documentation on these scripts seems to be limited, I also searched with 
"grep" but could not find where these scripts are used as any larger 
preprocessing script in the Moses codebase)
Thanks in advance.

Kind regards,

Gideon Wenniger
                                          
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to