I've been looking at the parallel Chinese-English data the LDC scraped
from the UN some time ago, and it appears that there are systematic
encoding problems, at least in the English. Somewhere along the line,
the data was processed in the wrong character encoding, probably
reading Latin1 as if it were Big5 or a GB variant.
This is partly observable in the funny \x{} escape sequences I've
asked about before, which some geniuses on the Unicode list recently
figured out for me - see here if you're interested: http://tr.im/i6h6
In addition to these, as far as I can tell, every non-ASCII character
on the English side is wrong, although I haven't figured out the
sequence of steps that led to this.
These problems only affect about 1% of the data, but it'd be nice to
get it cleaned up. All of this is preface to asking if anyone else
has fixed these things, and whether they'd be willing to share a new
version or a script to effect corrections on the old data.
Thanks!
- John D. Burger
MITRE
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support