On Thu, Feb 28, 2002 at 11:02:43PM +0900, SADAHIRO Tomoyuki wrote: > Gb2312.enc is used for HZ (type H) and ISO-2022-CN (type E) > (they use 7-bit encoding) > as one of their sub-charsets, isn't it?
No. GB2312 isn't really one encoding specification; instead it's a charset that could be encoded in one of three ways: - 'euc-cn', the preferred encoding; it's available both as 'euc-cn' and 'gb2312' in gnu libiconv. - 'hz', a 7-bit escaped encoding. The "raw" doublebyte representation is escaped with ~{...~} sequences. - 'iso-2202-cn', similiar to 'hz', but with 8-bit escape strings. The current gb2312.enc seems to map to the "raw" doublebyte representation instead of any of the above; I tested it with gnu libiconv 1.7, and it can't parse any one of these charsets. Similarily, the text generated by '>:encoding(gb2312)' seems to be a doublebyte charset illegible to euc-cn, hz or iso2202cn. (To further complicate the matter, what Windows means by 'GB2312' is really GBK (the 'extended' GB2312, including Traditional Chinese characters), which is not yet supported in the .enc files.) Executive summary: * Simplified Chinese in Encode.pm may be considered 'working' for what most people uses (gb2312's euc-cn). * Traditional Chinese in Encode.pm may also be considered 'working' for the basic big-5 range; its punctuation mappings was fixed and patched according to the big5p spec. * The gb2312.enc is very broken. Afaik nobody uses the raw/unencoded GB2312, since it's not interoperable with 7-bit ascii. We should either make it synonymous with euc-cn, or remove it. For Chinese usage, following 7 encodings are not here yet, but we can also add them if desired: - 'hz' and 'iso-2022-cn', two different encoding tables for gb2312 described above. - 'gb18030', used in glibc2.2, is a superset of gbk, which is a super set of gb2312; we should use that instead of 'gbk' if we want gbk support. - 'iso-ir-165', a different extension to gb2312, adding gb6345 and gb8565 support. Not in wide use. - 'iso-2022-cn-ext', the iso-2022'ized version of all characters in gb(2312|12345|7589|7590), iso-ir-165, and cns-11643-*. it's a sort of 'unified chinese code'. - 'big5p', the Big5+ Traditional Chinese encoding, is similarily a superset of 'big5', which provides a more complete unicode mapping, which covers most of Taiwan's uses. - 'big5-hkscs', a different extension to big5, adding characters used is Hong Kong, incompatible with big5p. Gnu libiconv has most of the above mappings other than big5p; I'm willing to supply their maps if it's ok with the list. /Autrijus/
msg00687/pgp00000.pgp
Description: PGP signature