Hi all, This may be of interest to someone. See Whistler's comments below on GB18030 mapping table problems.
THomas Chan [EMAIL PROTECTED] ---------- Forwarded message ---------- Date: Sun, 12 Nov 2000 16:07:49 -0800 From: Katsuhiko Momoi <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: GB18030 support in Mozilla Resent-Date: Sun, 12 Nov 2000 16:09:08 -0800 (PST) Resent-From: [EMAIL PROTECTED] Yueheng, We need to resolve some issues concerning GB18030 first. Questions have been raised by knowledgeable people about the details of this standard. Please consult the following 2 messages for more information. Frank Tang is on vacation now and we want him to participate in this discussion also. The link to the GB18030 info file in English (in PDF format) appears in the first message: - Kat ===================== Message 1: -------- Original Message -------- Subject: GB18030 summary and issues Date: Fri, 13 Oct 2000 09:57:00 -0800 (GMT-0800) From: Markus Scherer <[EMAIL PROTECTED]> To: "Unicode List" <[email protected]> Dear Uni-encoders and -decoders, Dirk Meyer from Adobe has put together an extensive summary of the chinese GB 18030 encoding standard that was published on 2000-mar-17. Ken Lunde and I assisted Dirk with reviews and comments. The summary is on the web site of Ken's famous CJKV book "with the fish": ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf To summarize the summary, we now have an english text describing the new encoding in its details. There are a few apparent errors, typos, and inconsistencies in the chinese standard text that need to be resolved. For implementers, there is enough information in the summary to describe the encoding structure and to prepare an implementation. What is still missing - aside from the resolution of the issues mentioned here - is a precise mapping table for how to map between at least the one-byte and two-byte portions of GB 18030 to and from Unicode. In theory, it should be almost the same as GBK, but to be sure, we need precise, complete, and machine-readable mappings. Given the one-byte and two-byte portions and the description in the standard and in the summary, the four-byte portion can be derived with a little bit of Perl or similar. Anyone who needs to implement or know about GB 18030 should probably read this text. Anyone who can contribute precise mapping tables and/or can help resolving the open issues please do so. Best regards, markus ======================================= Message 2: -------- Original Message -------- Subject: [li18nux:753] Fwd: RE: GB18030 summary and issues Resent-Date: Thu, 19 Oct 2000 20:01:09 -0700 (PDT) Resent-From: [EMAIL PROTECTED] Date: Fri, 20 Oct 2000 11:56:55 +0900 From: "Martin J. Duerst" <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] To: [EMAIL PROTECTED] With the permission of the author, I'm sending you a comment on the GB18030 mapping table that have appeared on this list some time ago. Regards, Martin. >X-UML-Sequence: 5977 (2000-10-17 00:36:44 GMT) >From: Kenneth Whistler <[EMAIL PROTECTED]> >Date: Mon, 16 Oct 2000 16:36:41 -0800 (GMT-0800) >Subject: RE: GB18030 summary and issues >I've taken a look at the GB18030.TXT you provided, and unfortunately, >as it stands, the mapping table has *major* problems. > >Most of these problems really derive from the serious flaws in GB 18030-2000 >itself, so I'm not sure exactly what implementers are going to >do about them, but so you can focus in on the issues, here is some >of what I turned up. > >A. GB 18030's encoding and mapping of Annex B (p. 91) -- ideographic >variation indicator, and the ideographic description characters, is >flat-out wrong. The same thing applies to Annex C (p. 92), the CJK >radicals supplement. Essentially, the relevant Chinese committee rushed this >thing to publication without having determined where these characters >were encoded in 10646, *despite* the fact that GB 18030 then makes >normative mappings to the entirety of 10646-1:2000 (actually to >GB 13000.1, but that is just a pointer to 10646-1:2000, unless they >printed *that* wrong, too, in which case we are even more screwed up). >The result is just out-and-out errors. To wit: > > 1. U+303E (GB18030 A989) is mapped to U+E7E7 (user-defined) > >The net result in GB18030.TXT is that GB A989 is mapped into private use, >even though in the chart it is shown as U+303E. But U+303E, as a *code >position*, is mapped to the 4-byte form 0x8139A634. > > 2. U+2FF0..U+2FFB (GB18030 A98A..A995) are mapped correctly in the > main tables of GB18030 (p. 82), but are mapped again incorrectly > in Annex C (U+E7E8..U+E7F3, user-defined). > >The net result in GB18030.TXT is that all the ideographic description >characters are double-mapped. > > 3. U+2E80..2EF3, the CJK radicals supplement, are mapped haphazardly, > from an earlier draft, apparently: GB18030 FE50..FEA0 is mapped > to U+E815..U+E864, instead of the actual Unicode code points. In > addition, some of the characters in Annex C, are actually in > Vertical Extension A, resulting in gapping in the tables. > >The net result in GB18030.TXT is that all the CJK radicals and >other characters in Annex C are double-mapped. > >B. GB 18030 makes the mistake of trying to encode all code positions >in GB 13000.1 (= 10646-1:2000), regardless of their status. That >means, among other things, that all private use code positions >in Unicode on the BMP are given GB 18030 code assignments -- >*regardless* of their status in GB 18030 as assigned characters or >not. This makes a complete hash, compounded by the fact that all the >characters mentioned in A above are erroneously assigned to private >use codes in Unicode. That renders the mapping of the rest of user >space trash. > >C. As an extension of B., GB 18030 also maps surrogate code positions >to GB 18030 4-byte codes, *as if* they were characters. Thus U+D800 >(a surrogate code point, not an unassigned character) is mapped to >0x8336C739, indifferently from U+D7FF (an unassigned character >position) being mapped to 0x8336C738. > >Incidentally, there appears to be an off-by-one error in this area in >GB18030.TXT as well: GB18030.TXT shows 0x8336c830 = U+D800, whereas >the printed text of the GB18030-2000 standard itself shows >0x8336C739 = U+D800. > >I'm not sure what the solution here is, other than to encourage China >to fix its $@&#*^! standard. But if the tables you posted have in >fact already been rolled out in Linux implementations in China, then >we are all going to have to live with horrendous interoperability >problems resulting from bad mapping tables for bad standards. > >Here it is the year 2000, and having lived with the yen/backslash >problem and the fullwidth tilde problem, and the not sign problem for >decades in East Asian implementations, I guess everybody has decided >that we should start off the new century with a brand-spanking new >set of ways to shoot ourselves in both feet at the same time for >Chinese implementations. > >--Ken ========================= End of 2 messages quoted ==================== Yueheng Xu wrote: > > The manditory Chinese national standards of GB18030 is comming by the > end of 2000. Do we (the Mozilla community) have any plans to add that > support in our browser ? > > Currently the largest Chinese character set we supported in mozilla is > GBK which > has a little over 20,000 characters. > > The new GB18030 is a super set of it and has about 27,000 characters. It > contains > one byte, two byte and three byte characters. > > The GB18030 is a manditory standards to be effective by the end of 2000. > After that date, no information system that do not support GB18030 is > not allowed to be > marketed in China. > > WithGB13030, all the simpliefied Chinese, traditional Chinese, all the > characters available in GB2312, GBK, BIG5 etc are included as a subset > and the Gb13080 > is backward compatibel with GB2312 (and possibly also GBK ?). I don't > have > a character set table with me. > > If any one can send me a GB18030, I can fidn time to add the support of > it in > Mozilla. > > Yueheng Xu, > CEO > Network 2000, Inc. > http://www.n2k.net > email: [EMAIL PROTECTED] -- Katsuhiko Momoi Netscape International Client Products Group [EMAIL PROTECTED] What is expressed here is my personal opinion and does not reflect official Netscape views.

