[reply embedded] peternilsson42 wrote:
> Languages are not multibyte. Even character codings are not > multibyte. I'm not quite following you. We may be talking about totally different things, so I'll explain what I meant. Leaving Unicode out of the discussion for the moment, Chinese (traditional and simplified), Japanese, and Korean are multibyte. That is, combinations of single- and double-byte codes to represent a single character. I'm pretty familiar with this stuff, because a fair part of my career has been spent in localization and internationalization. At one point, I wrote a library for Sony to display Kanji (Chinese characters) on English Windows, so I got pretty intimate with GB, Big 5, and the like. > In C, the term multibyte only comes into play when a character > code is too large to fit into a single byte on a given > implementation. Specifically, it only comes into play for > characters in the extended character set. Again, I'm not quite following. Before I go off on another tangent, what do you mean by "extended character set"? To me, extended characters are single-byte characters with the high bit set--i.e., >127 and <=255. > C requires that all the characters in the basic character set > (source and execution) fit into 1 byte. Even in Unicode > character sets, they do, because all the required characters > of the basic character set have a value less than 128. Thus > they fit into 1 byte of _any_ conforming C implementation > using unicode as a character set. I'm not as famliar with Unicode as I should be, so I'll leave Unicode for other experts. Also, the following is pretty Windows-oriented. Mac has its own encoding scheme, and I don't know Unix well enough to discuss it in this forum. ASCII is a 7-bit standard, with every character represented by a value less than 128. However, ASCII is an outdated standard. Windows uses ANSI, an 8-bit standard that encompasses Western European languages, and uses the full range of the byte value up to 255. ANSI is equivalent to the ISO standard 8859.1 (Latin 1). While it's true that English characters reside below 128, German umlauted characters, French and Spanish accented characters, and others--part of the basic character set, by my definition--live up above 128. Other single-byte languages are represented by other ISO 8859 standards. For example, 8859.2 covers Central European languages like Czech and Polish; 8859.5 defines Cyrillic; and so on. All use the full range up to 255. I said I'll leave Unicode to other experts, but I will point out that there are languages whose basic character set contains far more than 128 or 256 characters. As noted, Chinese, Japanese, and Korean fall in this category. An educated Japanese speaker is expected to know close to 2,000 Chinese characters (Kanji), plus Hiragana, Katakana, and Romaji. You need to know about 2,500 characters to read a Chinese newspaper, and a highly-educated Chinese speaker may know upwards of 8,000 characters. These all fall within my definition of a "basic character set." > But the more important question is: "Can you have a 0 byte > in a multibyte character coding?" The answer is no... By golly, I think you're right. I did some further research in my specialty, Asian languages, and found that, at least in GB and Big-5 encodings, the second byte of double-byte characters starts at hex 40. I didn't dig further, but I'll assume that holds true for other encodings. Cordially, Kerry Thompson To unsubscribe, send a blank message to <mailto:[EMAIL PROTECTED]>. Yahoo! Groups Links <*> To visit your group on the web, go to: http://groups.yahoo.com/group/c-prog/ <*> To unsubscribe from this group, send an email to: [EMAIL PROTECTED] <*> Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/
