Folks, I am almost ready to release Encode-1.00. While I am waiting for Anton to submit his patch for Encode::Supported, I have written a pod as follows. That explains how CJK encodings are made. I would appreciate if you give me some feedback. There are many good pages on this subject in Japanese but not so many in English....
Dan the Encode Maintainer =head1 NAME Encode::CJKguide -- a guide to CJK encodings =head1 SYNOPSIS This POD document describes how various CJK encodings are structured and its underling history. =head1 The Evolution of CJK encodings This section describes how CJK encodings have evolved before Unicode. =head2 The history before CJK First there was ASCII. ASCII is a seven bit encodings that looks like this; =over 2 =item The ASCII table 0123456789abcdef0123456789abcdef 0x00: <!-- Control Characters --> 0x20: !"#$%&'()*+,-./0123456789:;<=>? 0x40: @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ 0x60: `abcdefghijklmnopqrstuvwxyz{|}~ =back The last one (0x7E) is DEL (*1). ASCII has already been prevalent before any CJK encoding was introduced. ASCII is also needed to implement very CJK handling programs so virtually all CJK encodings are designed to coexist with ASCII (*2) =over 2 =item *1 Why DEL is not assigned to 0x00-0x1f but 0x7F is a funny story. Back when ASCII was designed, punchcards were still widely used. To conserve forest :), instead of throwing out mispunched cards they decided to punch all holes for mistyped byte. so 0x7F, with all 7 holes, punched, became DEL. =item *2 I have heard of EBCDIC-Kanji but does anyone know about it more than its name? =back =head2 Escaped vs. Extended -- two ways to implement multibyte The history of multi-byte character encoding began when Japan Industorial Standard (JIS) has published JIS C 6226 (later became JIS X 0208:1978). It contained 6353 characters and it won't naturally fit in a byte. It wouldn't fit in a byte, it must be encoded with a pair of bytes. But how are you going to do so and still make ASCII available? One way to do so is that you somehow tell the computer the beginning and ending of double-byte character. In practice, we use I<escape sequence> to do so. When the computer catches the sequence of bytes beginning with an escape character (\x1b), it changes the state whether the following bytes are in ASCII or double-byte character. But there are many computers (still today) that has no idea of escape sequence. To allow coexistence with those, you should avoid such cases that a given byte mistaken as a control character, including the half of double-byte character. In other words, avoid 0x00-0x20 (control and space) and 0x7e (DEL). So the resulting double-byte character can map up to 94 x 94, or 8836. Now you can fit everything in JIS C 6226. Now, 7bit-JIS is born. =over 2 =item the JIS Character Table 21 22 .... 7E First Byte +--------------------------- 21| You can now map up to 22| 8836 Characters here . | . | 7E| Second byte =back Escape-based double-byte character implementation is great for transfer encodings. But once you need to develop a text editor, it gets pain in the neck because you can't tell if the byte you are looking for is whether a whole ascii or half of double-byte character simply by looking at the byte itself. Fortunately, ASCII uses only 7 bits out of 8. Most computers back then already used Octet for byte. So a byte has one more extra bit. Why not use that bit double-byte inidicator? Instead of escape sequence, you just add 0x8080 to a JIS character set. That is what Extended Unix Character does. In a way, EUC extends ASCII rather than escapes. =over 2 =item the EUC Character Table A1 A2 .... FE First Byte +--------------------------- A1| You can map up to A2| 8836 Characters here . | . | FE| Second byte =back This concept of 94x94 planar table quickly became standard in other CJK world as well. People's Republic of China (just "China" as follows) has set GB 2312 national standard in 1990 and Republic of Korea (South Korea; simply "Korea" as follows) has set KS C 5601 in 1989. They are both based upon JIS C 6226, could be one of the reasons why these character codes contain Kana, phonetic characters used only in Japan. Though there are escape-based encodings for these two (ISO-2022-CN and ISO-2022-KR, respectively), they are hardly used in favor of EUC. When you say gb2312 and ksc5601, EUC-based encoding is assumed. =head2 Scattered JIS? -- An acrobat called Shift JIS So we have escape-based encodings (ISO-2022-XX) and extention-based encodings (EUC-XX). They coexist with old systems very well, especially EUC. For most cases, programs developed by people who know nothing but ASCII runs unmodified (perl has been one of them). And they lived happily ever after...? NO! Mitsubishi, and ASCII-Microsoft (Now Microsoft Japan) was in troble when they try to introduce Han ideographic character (I simply call them "Kanji" as follows) support in MBASICPlus that runs on Multi 16, Mitubishi's first 16-bit personal computer. Before JIS X 0208 first introduced in 1978, JIS has already introduced what is called JIS X 0201 in 1976. The Japanese try to teach computers how to handle thier language before double-byte character became available. Unlike Chinese which is purely ideographic (*1), Japanese had two variations of Kana, a phonetic representation of thier language as well. So they decided to squeeze Katakana into upper half of the byte. =over 2 =item The JIS X 0201 table 0123456789abcdef0123456789abcdef 0x00: <!-- Control Characters --> 0x20: <!-- 0x40: ASCII isprint() zone 0x60: --> 0x80: 0xa0: <!-- Katakana zone 0xc0: --> 0xe0: =back Mitsubishi, among other companies, have already used this Katakana extention of ASCII. So you can't apply EUC for backward compatibility's sake. Their answer was nothing but acrobatic. - Let's use 0x81-0x9F and 0xE0-0xEF of the first byte. (47 code points); They are the gabs left from JIS X 0201. - Let's use 0x40-0x7E and 0x80-0xFC of the second byte (188 code points). ASCII control codes are still avoided and CP/M (The OS of Multi 16) uses 0xFD-0xFF. Coincidentally, 47 x 188 is also 8836, exactly the same as 94x94. Now all you have to do is lay each characters in JIS X 0208 therein. =over 2 =item The MS Kanji First Byte Second Byte 0123456789abcdef 0123456789abcdef 0x00: cccccccccccccccc cccccccccccccccc 0x10: cccccccccccccccc cccccccccccccccc 0x20: aaaaaaaaaaaaaaaa 0x30: aaaaaaaaaaaaaaaa 0x40: aaaaaaaaaaaaaaaa JJJJJJJJJJJJJJJJ 0x50: aaaaaaaaaaaaaaaa JJJJJJJJJJJJJJJJ 0x60: aaaaaaaaaaaaaaaa JJJJJJJJJJJJJJJJ 0x70: aaaaaaaaaaaaaaac JJJJJJJJJJJJJJJc 0x80: JJJJJJJJJJJJJJJ JJJJJJJJJJJJJJJJ 0x90: JJJJJJJJJJJJJJJJ JJJJJJJJJJJJJJJJ 0xa0: kkkkkkkkkkkkkkk JJJJJJJJJJJJJJJJ 0xb0: kkkkkkkkkkkkkkkk JJJJJJJJJJJJJJJJ 0xc0: kkkkkkkkkkkkkkkk JJJJJJJJJJJJJJJJ 0xd0: kkkkkkkkkkkkkk JJJJJJJJJJJJJJJJ 0xe0: JJJJJJJJJJJJJJJJ JJJJJJJJJJJJJJJJ 0xf0: JJJJJJJJJJJJJJXX c = ASCII control J = MS Kaji a = ASCII printable K = CP/M control k = JIS X 0201 kana =back Simply put, MS Kanji has made double-byte character possible by giving up ASCII/JISX0201 compliance of the second byte. Uglier it may be, now the backward compatibility to thier previous code was promised. NEC has also adopted this new Kanji code when they introduced MS-DOS ver. 2.0 to PC-9801, the most popular line of personal computers in Japan until AT compatible (or the same "PC" as anywhere) finally takes over its reign with a help of Windows 95. So did Apple when they introduced KanjiTalk. With the support of two most popular operating systems for personal computers, This acrobatic encoding, later called Shift JIS, has become the most popular encoding in Japan. But there were prices to be paid. It is harder to port applications than EUC because the second byte may look like ASCII when the second byte is in 0x40-0xFE. It also lacks expandability that EUC had (EUC now suports JIS X 0212-1990, extended Kanji, which is theoretically impossible in Shift JIS). The name "Shift" JIS came from the fact that JIS character sets are "Shifted" when mapped. IMHO, this is more like "Realigned" but the word "Realigned" is hardly appealing to most Japanese speakers. When we talk about "Shift"ing, EUC is far more like shifting, with MSB acting as the shift.... As you see, Shift JIS is more vendor-driven than other JIS encodings. And this was the case for Big5, the most popular encoding for Traditional Chinese. The name Big5 came after the fact that the 5 major PC vendors in Taiwan has worked on the encoding. Well, for Big5, there were better reason to do so because 8836 characters were hardly enough. And fortunately for them, they have no katakana to to silly-walk. Here is how Big5 maps. - First byte: 0xA1-0xC6, 0xC9-0xF9 - Second byte: 0x40-0x7E, 0xA1-0xFE - Source Character Set: Proprietary Back then there was no equivalent to JISX0208 that they could refer to. The Taiwanese were aware of the weakness of this Shift-JISish encodings so they decided to build yet another encoding, this time by the government. And the result was CNS 11643. CNS 11643 consists of 7 94x94 planes. Firs two planes derive form Big5 but tidied with duplicate characters removed. CNS 11643 is EUC-safe and used in EUC-TW. =head1 CJK, Unicode and ISO-2022 This section describes Unicode and its impact on CJK. This section also describes ISO-2022, the biggest contender to Unicode today. =head2 Write once, reads everywhere? -- Unicode Back in the time before Unicode, virtually all given encoding is mere bilingual or biscript, ASCII plus local. With so many encodings emerging, it is only natural to try to set an encoding that coveres as many, if not all, written language. And Unicode was one of the answers. I carefully said "one of" because ISO-2022 has already existed. ISO-2022 is a escape-based encoding (*1) (7bit-JIS is one of them) and by assigning escape sequence to existing character set, ISO-2022, in theory, can swallow as many character sets as possible to form a universal encoding. ISO-2022-JP-2 adopts this idea. In addition to JIS X 0208 and JIS X 0212, it contains GB 2312 and KS C 5601. *1 Exactly speaking, this is not true. As a matter of fact, EUC is ISO-2022-compliant. I'll discuss this later However, what many people, especially vendors and programmers were waiting for, was a fixed-width encoding so you can manipulate each character statelessly. That's Unicode -- or its first goal which is now somewhat diverted. Back in 1987 when the word Unicode was coined, 16 bit was thought to be the practical maximum for a single character; Memories were too expensive and 32-bit OS was not available on desktop. In order to squeeze all (and increasing) CJK encodings, they found that simply realigning the existing character sets would not work. They came up with arguablly the most controversial idea; Han Unification. Many of ideographs used in China, Japan, and Korea not only I<look> the same but also I<mean> the same. Then why not give the same code point for those in common and save the code points? There are two cases to consider. Those they look different but means the same (Case 1) and vise varsa (Case 2). The Han Unification of Unicode decided to unify based upon Case 1; Let's unify the ones with the same shape! As a result, something funny has happed. For example, U+673A means "a machine" in Simplified Chinese but "a desk" in Japanese. "a machine" in Japanese. U+6A5F. So you can't tell what it means just by looking at the code. But the controversy didn't stop there. Han Unification also decided to apply Case 2 for those characters whose origin was the same. These characters that are sheped different but means the same with the same origin is called I<Itaiji>. Unicode does not differenciate Itaiji; should you need to differenciate, use different fonts. The problem is that Itaiji is very common in proper nouns, especially surnames in Japan. "Watanabe", with two characters "Wata" (U+6E21) and "Nabe", is a very popular family name in Japan but there are at least 31 different "Nabe" in existence. But Unicode lists only U+8FBA, U+908A, and U+9089 in the code set. (*2) *2 Actually, Unicode is less to blame on itaiji problem than the Japanese domestic character sets, Because JIS X 0208 only contains 3 of them also. But the point is that Unicode has shut the door for itaiji even when there are rooms for expansions and updates -- at least for the time being. For the better or for the worse, Unicode is still practical, at least as practical as most regional character sets, thanks to Case 3. If the existing character set says they are different, give them different code points so you can convert the string to Unicode then back and get the same string. That is why "Nabe" has 3, not one, code points; In the case above, JIS X 0208 had three of them. Ironically, this move toward Han Unification has reduced the number of code points but bloated the size of Unicode encoders and decoders, including the Encode. For instance, you can convert from Shift JIS to EUC-JP programatically because they both share the same charset. This is impossible in Unicode and you have to have a giant table to do so. As a result, 50% of statically linked perl consists of Encode! =head2 the "Multicode" ? -- the ISO-2022 While Unicode makes multilingualization possible by providing a single, unified character sets, ISO-2022 tries to achieve the same goal by suppling a glue to multiple character sets. Here is how it does that. =over 2 =item 0. In-Use Table Divide the table with 256 elements into 4 sections; 0x00-0x1f C0 0x20-0x7f GL 0x80-0x9f C1 0xA0-0xFF GR The whole table is called I<In-use table>. See C0 and GL correspond to ASCII controls and printables, respectively. =item 1. G0-G3 Buffers Prepare 4 tables with the size equals to GL. We call them buffers and they are named from G0 to G3. =item 2. Single Shift and Charset Invocation When you receive certain control character, swap GR with either G2 or G3. This is called Character Table I<Invocation>. When a whole character is complete, restore the state of GR. Since GR may change character to character basis, the control character used here is called "Single Shift Character", or SS for short. SS2 and SS3 correspond to G2 and G3, respectively. =item 3. Locking Shift and Charset Designation When you receive an escape sequence, swap GL with the character set the escape sequence specifies. This is called Character Set I<Invocation>. You don't have to restore GL until the next escape sequence. Thus this action is called "Locking Shift". =item 4. Character Set Spesifications The character sets that can be invoked or designated must be in 96**n or 94**n (96 - space and DEL). =back Whoa. Complicated? Maybe. But let me show you two examples, EUC-JP and ISO-2022-JP-1 so you get the picture. =over 2 =item EUC-JP sizeof(charset) ESC. seq. ---------------------------------------------------------- GL G0: US-ASCII 96 ** 1 GR G1: JIS X 0208-1983 94 ** 2 G2: JIS X 0201:1973 (Katakana only) 94 ** 1 G3: JIS X 0212:1990 94 ** 2 SS2 = 0x8E SS3 = 0x8F No escape sequence used. =item ISO-2022-JP-1 [RFC2237] ---------------------------------------------------------- GL G0: US-ASCII 96 ** 1 \e ( B JIS X 0208-1978 94 ** 2 \e $ @ JIS X 0208-1983 94 ** 2 \e $ @ JIS X 0201-Roman 96 ** 1 \e ( J JIS X 0212-1990 94 ** 2 \e $ D GR is unused, G1-G3 is unused. =back As you see, can call EUC "Single-shift based ISO-2022" or even ISO-2022-8bit. You may not not know this but ISO-8859-X, also known as Latin X, is also ISO-2022-compliant. They can be defined as follows; sizeof(charset) ESC. seq. ---------------------------------------------------------- GL G0: US-ASCII 96 ** 1 GR G1: Varies 96 ** 1 No G2 and G3, no escape sequence and single shifts. ISO-2022 has advantages over Unicode as follows; =over 2 =item * ISO-2022 differenciates I<charset> and I<encoding> strictly and it specifies only encoding. Charsets are up to regional government bodies. All that EMCA, that maintains ISO-2022 has to do, is to register that. This makes work sharing much easier. On the other hand, Unicode Consortium have to work both on charsets and encodings (even though some of the works are delegated to other parties, such as IRG), resulting more time and arguments for a new character to be introduced. =item * Has no practal size limit, even in EUC-form. EUC-TW is already 4 bytes max. And if you are happy with escape sequences, you can swallow as many charsets as you wish. =item * You have to *pay* the Consortium to become a member, ultimately to vote on what Unicode will be. It is not Open Source :) =back At the same time, Unicode does have advantages over ISO-2022 as follows; =over 2 =item * You have only one and single authority, Unicode Consortium. You don't have to worry about to whom you ask whether and how a give character is mapped. =item * Have a consise set of characters that covers most popular languages. You may not be able to express what you have to say in the fullest extent but you can say most of it. =item * You have to *pay* the Consortium to become a member, ultimately to vote on what Unicode will be. It is not Open Source :) =item * More supports from vendors. Unicode started its life to make vendors happier (or lazier), not poets or liguists by tidying the charset and encoding. Well, it turned out be not as easy as they first hoped, with heaping demand for new codepoints and sarrogate pair. But it is still a bliss enough that you only have to know one charset (Unicode does have several different encodings). That is, except for those who hack converters like Encode :) =item * You *ONLY* have to pay the Consortium to become a member and vote on what Unicode will be. You don't have to be knowledgeable, you don't have to be respected, you don't even have to be a native user of the language you want to poke your nose on. It is not Open Source :) =back =head1 Will Character Sets and Encodings ever be Unified? This section discusses the future of charset and encodings. In doing so, I decided to grok the philosophy of perl one more time =head2 Character Sets and Encodings should be designed to make easy writings easy, without making hard writings impossible Does Unicode meet this criterion? It first opted for the first part, to make easy writings easy by squeezing all you need to 16 bits. But Unicode today seems more forcused on making hard writings possible. The problem is that this move toward making hard writings possible is making Unicode trickier and trickier as time passes. Surrogate pair was introduced in 1996 but I have yet to see an application that makes use of it. On the other hand, ISO-2022 series seems to care little about this. EUC is easy yet it stops when you try to make hard writings (multiple CJK text in a single file). I have to conclude there is no silver bullet here yet. Unicode has tried hard to be but it is quicksilver at best. Quicksilver it may be, that's the bullet we have. That's the bullet Larry has decided to use so I forge the gunmetal to shoot it. And the result was Encode. =head2 There are more than one way to encode it In spite of all advocacies to the Unified Character Set and Encoding, legacy data are there to last. So at very least, you still need "legacy" encodings for the very same reason you can't trash your tape drives while you have a terabyte RAID at your fingertip. Also remeber EBCDIC is still in use (and coexists with Unicode! see L<perlebcdic>). And don't forget there are many scripts which has no character set at all that are waiting to be coded one way or another. And not all scripts are accepted or approved by Unicode Consotium. If you want to spell in Klingon, you have to find your own encoding. =head2 A Module for getting your job done If you are a die-hard Unicode advocate who want to tidy the world by converting everything there into Unicode, or a nihilistic anti-Unicode activist who accept nothing but Mule ctext, or a postmodernistist who think the classic is cool and the modern rocks, this module is for you. Perl 5.6 tackled the modern when it added Unicode support internally. Now in Perl 5.8 we tackled the classic by adding supoprt for other encodings externally. I hope you like it. =head1 Author(s) By Dan Kogai E<lt>[EMAIL PROTECTED]<gt>. Send your comments to E<lt>[EMAIL PROTECTED]<gt>. You can subscribe via L<http://lists.perl.org/showlist.cgi?name=perl-unicode>. =cut