On Wed, Mar 27, 2002 at 07:35:18AM +0900, Dan Kogai wrote: > as follows. That explains how CJK encodings are made. I would > appreciate if you give me some feedback. There are many good pages on > this subject in Japanese but not so many in English....
Thanks a lot of explaining this matter so eloquently. :-) I'l patched it with the usual spellchecking, podchecking and nitpicking, and corrected the Case 1 / Case 2 reversed mistake, and added some Trad. Chinese-related info. Thanks, /Autrijus/ --- bb Wed Mar 27 11:50:31 2002 +++ aa Wed Mar 27 12:21:11 2002 @@ -5,7 +5,7 @@ =head1 SYNOPSIS This POD document describes how various CJK encodings are structured -and its underling history. +and its underlying history. =head1 The Evolution of CJK encodings @@ -30,7 +30,7 @@ The last one (0x7E) is DEL (*1). ASCII has already been prevalent before any CJK encoding was introduced. ASCII is also needed to -implement very CJK handling programs so virtually all CJK encodings +implement various CJK handling programs, so virtually all CJK encodings are designed to coexist with ASCII (*2) =over 2 @@ -38,12 +38,12 @@ =item *1 Why DEL is not assigned to 0x00-0x1f but 0x7F is a funny story. Back -when ASCII was designed, punchcards were still widely used. To +when ASCII was designed, punch cards were still widely used. To conserve forest :), instead of throwing out mispunched cards they decided to punch all holes for mistyped byte. so 0x7F, with all 7 holes, punched, became DEL. -=item *2 +=item *2 I have heard of EBCDIC-Kanji but does anyone know about it more than its name? @@ -53,7 +53,7 @@ =head2 Escaped vs. Extended -- two ways to implement multibyte The history of multi-byte character encoding began when Japan -Industorial Standard (JIS) has published JIS C 6226 (later became JIS +Industrial Standard (JIS) has published JIS C 6226 (later became JIS X 0208:1978). It contained 6353 characters and it won't naturally fit in a byte. It wouldn't fit in a byte, it must be encoded with a pair of bytes. But how are you going to do so and still make ASCII @@ -62,7 +62,7 @@ One way to do so is that you somehow tell the computer the beginning and ending of double-byte character. In practice, we use I<escape sequence> to do so. When the computer catches the sequence of bytes -beginning with an escape character (\x1b), it changes the state +beginning with an escape character (C<\x1b>), it changes the state whether the following bytes are in ASCII or double-byte character. But there are many computers (still today) that has no idea of escape @@ -96,7 +96,7 @@ Fortunately, ASCII uses only 7 bits out of 8. Most computers back then already used Octet for byte. So a byte has one more extra bit. -Why not use that bit double-byte inidicator? +Why not use that bit double-byte indicator? Instead of escape sequence, you just add 0x8080 to a JIS character set. That is what Extended Unix Character does. In a way, EUC @@ -118,36 +118,37 @@ =back This concept of 94x94 planar table quickly became standard in other -CJK world as well. People's Republic of China (just "China" as +CJK world as well. People's Republic of China (just I<China> as follows) has set GB 2312 national standard in 1990 and Republic of -Korea (South Korea; simply "Korea" as follows) has set KS C 5601 in -1989. They are both based upon JIS C 6226, could be one of the +Korea (South Korea; simply I<Korea> as follows) has set KS C 5601 in +1989. They are both based upon JIS C 6226, which could be one of the reasons why these character codes contain Kana, phonetic characters used only in Japan. Though there are escape-based encodings for these two (ISO-2022-CN and ISO-2022-KR, respectively), they are hardly used in favor of EUC. -When you say gb2312 and ksc5601, EUC-based encoding is assumed. +When you specify C<gb2312> and C<ksc5601> in B<Encode>, EUC-based +encoding is assumed. =head2 Scattered JIS? -- An acrobat called Shift JIS -So we have escape-based encodings (ISO-2022-XX) and extention-based -encodings (EUC-XX). They coexist with old systems very well, +So we have escape-based encodings (ISO-2022-XX) and extension-based +encodings (EUC-XX). They both coexist with old systems very well, especially EUC. For most cases, programs developed by people who know nothing but ASCII runs unmodified (perl has been one of them). And they lived happily ever after...? NO! -Mitsubishi, and ASCII-Microsoft (Now Microsoft Japan) was in troble -when they try to introduce Han ideographic character (I simply call -them "Kanji" as follows) support in MBASICPlus that runs on Multi 16, +Mitsubishi, and ASCII-Microsoft (Now Microsoft Japan) was in trouble +when they try to introduce Han ideographic character (I'll simply call +them I<Kanji> as follows) support in MBASICPlus that runs on Multi 16, Mitubishi's first 16-bit personal computer. Before JIS X 0208 first introduced in 1978, JIS has already introduced what is called JIS X 0201 in 1976. -The Japanese try to teach computers how to handle thier language +The Japanese try to teach computers how to handle their language before double-byte character became available. Unlike Chinese which is purely ideographic (*1), Japanese had two variations of Kana, a -phonetic representation of thier language as well. So they decided to +phonetic representation of their language as well. So they decided to squeeze Katakana into upper half of the byte. =over 2 @@ -167,7 +168,7 @@ =back Mitsubishi, among other companies, have already used this -Katakana extention of ASCII. So you can't apply EUC for backward +Katakana extension of ASCII. So you can't apply EUC for backward compatibility's sake. Their answer was nothing but acrobatic. @@ -204,7 +205,7 @@ 0xe0: JJJJJJJJJJJJJJJJ JJJJJJJJJJJJJJJJ 0xf0: JJJJJJJJJJJJJJXX - c = ASCII control J = MS Kaji + c = ASCII control J = MS Kanji a = ASCII printable K = CP/M control k = JIS X 0201 kana @@ -212,12 +213,11 @@ Simply put, MS Kanji has made double-byte character possible by giving up ASCII/JISX0201 compliance of the second byte. Uglier it may be, now -the -backward compatibility to thier previous code was promised. +the backward compatibility to their previous code was promised. NEC has also adopted this new Kanji code when they introduced MS-DOS ver. 2.0 to PC-9801, the most popular line of personal computers in -Japan until AT compatible (or the same "PC" as anywhere) finally takes +Japan until AT compatible (or the same I<PC> as anywhere) finally takes over its reign with a help of Windows 95. So did Apple when they introduced KanjiTalk. @@ -228,23 +228,24 @@ But there were prices to be paid. It is harder to port applications than EUC because the second byte may look like ASCII when the second byte is in 0x40-0xFE. It also lacks expandability that EUC had (EUC -now suports JIS X 0212-1990, extended Kanji, which is theoretically +now supports JIS X 0212-1990, extended Kanji, which is theoretically impossible in Shift JIS). -The name "Shift" JIS came from the fact that JIS character sets are -"Shifted" when mapped. IMHO, this is more like "Realigned" but the -word "Realigned" is hardly appealing to most Japanese speakers. When -we talk about "Shift"ing, EUC is far more like shifting, with MSB +The name I<Shift> JIS came from the fact that JIS character sets are +I<Shifted> when mapped. IMHO, this is more like I<Realigned>, but +that word is hardly appealing to most Japanese speakers. When +we talk about I<Shift>ing, EUC is far more like shifting, with MSB acting as the shift.... As you see, Shift JIS is more vendor-driven than other JIS encodings. And this was the case for Big5, the most popular encoding for Traditional Chinese. The name Big5 came after the fact that the 5 -major PC vendors in Taiwan has worked on the encoding. +major PC vendors in Taiwan (Acer, Eten, FIC, Mitac, and Zerone) has +composed the encoding together. -Well, for Big5, there were better reason to do -so because 8836 characters were hardly enough. And fortunately for -them, they have no katakana to to silly-walk. +Well, for Big5, there were better reason to do so, because 8836 characters +were hardly enough. And fortunately for them, they have no katakana to +silly-walk. Here is how Big5 maps. @@ -268,19 +269,25 @@ Back in the time before Unicode, virtually all given encoding is mere bilingual or biscript, ASCII plus local. With so many encodings -emerging, it is only natural to try to set an encoding that coveres as +emerging, it is only natural to try to set an encoding that covers as many, if not all, written language. And Unicode was one of the answers. -I carefully said "one of" because ISO-2022 has already existed. +I carefully said I<one of>, because ISO-2022 has already existed. ISO-2022 is a escape-based encoding (*1) (7bit-JIS is one of them) and by assigning escape sequence to existing character set, ISO-2022, in theory, can swallow as many character sets as possible to form a universal encoding. ISO-2022-JP-2 adopts this idea. In addition to JIS X 0208 and JIS X 0212, it contains GB 2312 and KS C 5601. - *1 Exactly speaking, this is not true. As a matter of fact, - EUC is ISO-2022-compliant. I'll discuss this later +=over 2 + +=item *1 + +Precisely speaking, this is not true. As a matter of fact, EUC is +ISO-2022-compliant. I'll discuss this later. + +=back However, what many people, especially vendors and programmers were waiting for, was a fixed-width encoding so you can manipulate each @@ -292,55 +299,62 @@ expensive and 32-bit OS was not available on desktop. In order to squeeze all (and increasing) CJK encodings, they found that simply realigning the existing character sets would not work. They came up -with arguablly the most controversial idea; Han Unification. +with arguably the most controversial idea: B<Han Unification>. Many of ideographs used in China, Japan, and Korea not only I<look> the same but also I<mean> the same. Then why not give the same code point for those in common and save the code points? -There are two cases to consider. Those they look different but means -the same (Case 1) and vise varsa (Case 2). The Han Unification of +There are two cases to consider. Those that look the same but has +different meanings (Case 1), or vise versa (Case 2). The Han Unification of Unicode decided to unify based upon Case 1; let's unify the ones with the same shape! -As a result, something funny has happed. For example, U+673A means "a -machine" in Simplified Chinese but "a desk" in Japanese. "a machine" -in Japanese. U+6A5F. So you can't tell what it means just by looking -at the code. +As a result, something funny has happed. For example, U+673A means I<a +machine> in Simplified Chinese but I<a desk> in Japanese. The character +that means I<a machine> is U+6A5F in Japanese and Traditional Chinese. +So you can't tell what it means just by looking at the code. But the controversy didn't stop there. Han Unification also decided to apply Case 2 for those characters whose origin was the same. These -characters that are sheped different but means the same with the same -origin is called I<Itaiji>. Unicode does not differenciate Itaiji; -should you need to differenciate, use different fonts. +characters that are shaped different but means the same with the same +origin is called I<Itaiji> (characters with alternative bodies). +Unicode does not differentiate Itaiji; should you need to differentiate, +use different fonts. The problem is that Itaiji is very common in proper nouns, especially -surnames in Japan. "Watanabe", with two characters "Wata" (U+6E21) -and "Nabe", is a very popular family name in Japan but there are at -least 31 different "Nabe" in existence. But Unicode lists only +surnames in Japan. For example: I<Watanabe>, with two characters I<Wata> +(U+6E21) and I<Nabe>, is a very popular family name in Japan -- but there +are at least 31 different I<Nabe> in existence. But Unicode lists only U+8FBA, U+908A, and U+9089 in the code set. (*2) - *2 Actually, Unicode is less to blame on itaiji problem than the - Japanese domestic character sets, Because JIS X 0208 only contains 3 - of them also. But the point is that Unicode has shut the door for - itaiji even when there are rooms for expansions and updates -- at - least for the time being. +=over 2 + +=item *1 + +Actually, Unicode is less to blame on itaiji problem than the +Japanese domestic character sets, because JIS X 0208 only contains 3 +of them too. But the point is that Unicode has shut the door for +Itaiji even when there are rooms for expansions and updates -- at +least for the time being. + +=back For the better or for the worse, Unicode is still practical, at least -as practical as most regional character sets, thanks to Case 3. If +as practical as most regional character sets, thanks to Case 3: If the existing character set says they are different, give them different code points so you can convert the string to Unicode then -back and get the same string. That is why "Nabe" has 3, not one, code -points; In the case above, JIS X 0208 had three of them. +back and get the same string. That is why I<Nabe> has 3, not one, code +points; in the case above, JIS X 0208 had three of them. Ironically, this move toward Han Unification has reduced the number of code points but bloated the size of Unicode encoders and decoders, -including the Encode. For instance, you can convert from Shift JIS to -EUC-JP programatically because they both share the same charset. This +including the Encode module. For instance, you can convert from Shift JIS to +EUC-JP programmatically because they both share the same charset. This is impossible in Unicode and you have to have a giant table to do so. -As a result, 50% of statically linked perl consists of Encode! +As a result, 50% of statically linked perl consists of B<Encode>! -=head2 the "Multicode" ? -- the ISO-2022 +=head2 the I<Multicode> ? -- the ISO-2022 While Unicode makes multilingualization possible by providing a single, unified character sets, ISO-2022 tries to achieve the same @@ -366,13 +380,13 @@ Prepare 4 tables with the size equals to GL. We call them buffers and they are named from G0 to G3. -=item 2. Single Shift and Charset Invocation +=item 2. Single Shift and Charset Invocation When you receive certain control character, swap GR with either G2 or G3. This is called Character Table I<Invocation>. When a whole character is complete, restore the state of GR. Since GR may change character to character basis, the control character used here is -called "Single Shift Character", or SS for short. SS2 and SS3 +called I<Single Shift Character>, or SS for short. SS2 and SS3 correspond to G2 and G3, respectively. =item 3. Locking Shift and Charset Designation @@ -380,9 +394,9 @@ When you receive an escape sequence, swap GL with the character set the escape sequence specifies. This is called Character Set I<Invocation>. You don't have to restore GL until the next escape -sequence. Thus this action is called "Locking Shift". +sequence. Thus this action is called I<Locking Shift>. -=item 4. Character Set Spesifications +=item 4. Character Set Specifications The character sets that can be invoked or designated must be in 96**n or 94**n (96 - space and DEL). @@ -420,7 +434,7 @@ =back -As you see, can call EUC "Single-shift based ISO-2022" or even +As you see, we can call EUC I<Single-shift based ISO-2022> or even ISO-2022-8bit. You may not not know this but ISO-8859-X, also known as Latin X, is also ISO-2022-compliant. They can be defined as follows; @@ -432,13 +446,13 @@ No G2 and G3, no escape sequence and single shifts. -ISO-2022 has advantages over Unicode as follows; +ISO-2022 has advantages over Unicode as follows: =over 2 =item * -ISO-2022 differenciates I<charset> and I<encoding> strictly and it +ISO-2022 differentiates I<charset> and I<encoding> strictly and it specifies only encoding. Charsets are up to regional government bodies. All that EMCA, that maintains ISO-2022 has to do, is to register that. This makes work sharing much easier. @@ -450,19 +464,19 @@ =item * -Has no practal size limit, even in EUC-form. EUC-TW is already 4 +Has no practical size limit, even in EUC-form. EUC-TW is already 4 bytes max. And if you are happy with escape sequences, you can swallow as many charsets as you wish. =item * -You have to *pay* the Consortium to become a member, ultimately to +You have to B<pay> the Consortium to become a member, ultimately to vote on what Unicode will be. It is not Open Source :) =back At the same time, Unicode does have advantages over ISO-2022 as -follows; +follows: =over 2 @@ -474,7 +488,7 @@ =item * -Have a consise set of characters that covers most popular languages. +Have a concise set of characters that covers most popular languages. You may not be able to express what you have to say in the fullest extent but you can say most of it. @@ -486,16 +500,16 @@ =item * More supports from vendors. Unicode started its life to make vendors -happier (or lazier), not poets or liguists by tidying the charset and +happier (or lazier), not poets or linguists by tidying the charset and encoding. Well, it turned out be not as easy as they first hoped, -with heaping demand for new codepoints and sarrogate pair. But it is +with heaping demand for new codepoints and surrogate pair. But it is still a bliss enough that you only have to know one charset (Unicode does have several different encodings). That is, except for those who -hack converters like Encode :) +hack converters like B<Encode>. :) =item * -You *ONLY* have to pay the Consortium to become a member and vote on +You B<ONLY> have to pay the Consortium to become a member and vote on what Unicode will be. You don't have to be knowledgeable, you don't have to be respected, you don't even have to be a native user of the language you want to poke your nose on. It is not Open Source :) @@ -505,14 +519,14 @@ =head1 Will Character Sets and Encodings ever be Unified? This section discusses the future of charset and encodings. In doing -so, I decided to grok the philosophy of perl one more time +so, I decided to grok the philosophy of perl one more time. =head2 Character Sets and Encodings should be designed to make easy writings easy, without making hard writings impossible Does Unicode meet this criterion? It first opted for the first part, to make easy writings easy by squeezing all you need to 16 bits. But -Unicode today seems more forcused on making hard writings possible. +Unicode today seems more focused on making hard writings possible. The problem is that this move toward making hard writings possible is making Unicode trickier and trickier as time passes. Surrogate pair @@ -526,36 +540,36 @@ I have to conclude there is no silver bullet here yet. Unicode has tried hard to be but it is quicksilver at best. Quicksilver it may be, that's the bullet we have. That's the bullet Larry has decided to -use so I forge the gunmetal to shoot it. And the result was Encode. +use so I forge the gunmetal to shoot it. And the result was B<Encode>. =head2 There are more than one way to encode it In spite of all advocacies to the Unified Character Set and Encoding, -legacy data are there to last. So at very least, you still need -"legacy" encodings for the very same reason you can't trash your tape +legacy data are there to last. So at very least, you still need +I<legacy> encodings for the very same reason you can't trash your tape drives while you have a terabyte RAID at your fingertip. -Also remeber EBCDIC is still in use (and coexists with Unicode! see +Also remember EBCDIC is still in use (and coexists with Unicode! see L<perlebcdic>). And don't forget there are many scripts which has no character set at all that are waiting to be coded one way or another. And not all -scripts are accepted or approved by Unicode Consotium. If you want to -spell in Klingon, you have to find your own encoding. +scripts are accepted or approved by Unicode Consortium. If you want to +spell in Klingon alphabets, you have to find your own encoding. =head2 A Module for getting your job done -If you are a die-hard Unicode advocate who want to tidy the world by +If you are a die-hard Unicode advocate who wants to tidy the world by converting everything there into Unicode, or a nihilistic anti-Unicode -activist who accept nothing but Mule ctext, or a postmodernistist who -think the classic is cool and the modern rocks, this module is for +activist who accepts nothing but Mule ctext, or a postmodernist who +thinks the classic is cool and the modern rocks, this module is for you. Perl 5.6 tackled the modern when it added Unicode support internally. -Now in Perl 5.8 we tackled the classic by adding supoprt for other +Now in Perl 5.8 we tackled the classic by adding support for other encodings externally. I hope you like it. -=head1 Author(s) +=head1 AUTHORS By Dan Kogai E<lt>[EMAIL PROTECTED]<gt>. Send your comments to E<lt>[EMAIL PROTECTED]<gt>. You can subscribe via
msg00933/pgp00000.pgp
Description: PGP signature