[Moses-support] UN data (LDC2004E12)

John Burger Wed, 01 Apr 2009 11:10:36 -0700

I've been looking at the parallel Chinese-English data the LDC scraped  
from the UN some time ago, and it appears that there are systematic  
encoding problems, at least in the English.  Somewhere along the line,  
the data was processed in the wrong character encoding, probably  
reading Latin1 as if it were Big5 or a GB variant.


This is partly observable in the funny \x{} escape sequences I've  
asked about before, which some geniuses on the Unicode list recently  
figured out for me - see here if you're interested: http://tr.im/i6h6

In addition to these, as far as I can tell, every non-ASCII character  
on the English side is wrong, although I haven't figured out the  
sequence of steps that led to this.

These problems only affect about 1% of the data, but it'd be nice to  
get it cleaned up.  All of this is preface to asking if anyone else  
has fixed these things, and whether they'd be willing to share a new  
version or a script to effect corrections on the old data.

Thanks!

- John D. Burger
   MITRE

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] UN data (LDC2004E12)

Reply via email to