Ham, Michael wrote:

> Those escape numbers are Unicode characters.  The Chinese character  
> set
> does not exist in ASCII, so you have to use UTF-8.

Sorry if I wasn't clear:  I'm talking about the Chinese side of  
LDC2004E12, which is not in ASCII or Unicode, it's in GB18030.   
Apparently, there were some characters in the source data which could  
not be encoded in GB18030.  After my first message, I found the  
following in an LDC README:

> Known problems:
>
> There're encodings that defined by WordPrect but not recognized by
> other word processing softwares and converters. We've written scripts
> to correct most of them, but not all. For those which had not been
> corrected, they're in the format of '\x{####}', where '####' is a
> four-digit hexadecimal number. We do look forward to correcting this
> problem in the next release.

(I don't think "the next release" ever happened.)

You may be right, though, that these are Unicode references - many of  
them do indeed seem to map to Chinese characters in Unicode.   
However, some of them map to characters which could, in fact, be  
represented in GB18030, and some map to odd places like currently  
unassigned points in the Arabic blocks.

As for fonts, I don't think I need any particular font to run Moses.   
I can't read Chinese anyway, that's why I'm working on MT!  :)

Thanks for your reply.

- John Burger
   MITRE
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to