[reply embedded]

peternilsson42 wrote:

> Languages are not multibyte. Even character codings are not
> multibyte.

I'm not quite following you. We may be talking about totally different
things, so I'll explain what I meant.

Leaving Unicode out of the discussion for the moment, Chinese (traditional
and simplified), Japanese, and Korean are multibyte. That is, combinations
of single- and double-byte codes to represent a single character.

I'm pretty familiar with this stuff, because a fair part of my career has
been spent in localization and internationalization. At one point, I wrote a
library for Sony to display Kanji (Chinese characters) on English Windows,
so I got pretty intimate with GB, Big 5, and the like.

> In C, the term multibyte only comes into play when a character
> code is too large to fit into a single byte on a given
> implementation. Specifically, it only comes into play for
> characters in the extended character set.

Again, I'm not quite following. Before I go off on another tangent, what do
you mean by "extended character set"? To me, extended characters are
single-byte characters with the high bit set--i.e., >127 and <=255.

> C requires that all the characters in the basic character set
> (source and execution) fit into 1 byte. Even in Unicode
> character sets, they do, because all the required characters
> of the basic character set have a value less than 128. Thus
> they fit into 1 byte of _any_ conforming C implementation
> using unicode as a character set.

I'm not as famliar with Unicode as I should be, so I'll leave Unicode for
other experts. Also, the following is pretty Windows-oriented. Mac has its
own encoding scheme, and I don't know Unix well enough to discuss it in this
forum.

ASCII is a 7-bit standard, with every character represented by a value less
than 128. However, ASCII is an outdated standard. Windows uses ANSI, an
8-bit standard that encompasses Western European languages, and uses the
full range of the byte value up to 255. ANSI is equivalent to the ISO
standard 8859.1 (Latin 1). While it's true that English characters reside
below 128, German umlauted characters, French and Spanish accented
characters, and others--part of the basic character set, by my
definition--live up above 128.

Other single-byte languages are represented by other ISO 8859 standards. For
example, 8859.2 covers Central European languages like Czech and Polish;
8859.5 defines Cyrillic; and so on. All use the full range up to 255.

I said I'll leave Unicode to other experts, but I will point out that there
are languages whose basic character set contains far more than 128 or 256
characters. As noted, Chinese, Japanese, and Korean fall in this category.
An educated Japanese speaker is expected to know close to 2,000 Chinese
characters (Kanji), plus Hiragana, Katakana, and Romaji. You need to know
about 2,500 characters to read a Chinese newspaper, and a highly-educated
Chinese speaker may know upwards of 8,000 characters. These all fall within
my definition of a "basic character set."

> But the more important question is: "Can you have a 0 byte
> in a multibyte character coding?" The answer is no...

By golly, I think you're right. I did some further research in my specialty,
Asian languages, and found that, at least in GB and Big-5 encodings, the
second byte of double-byte characters starts at hex 40. I didn't dig
further, but I'll assume that holds true for other encodings.

Cordially,

Kerry Thompson




To unsubscribe, send a blank message to <mailto:[EMAIL PROTECTED]>. 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/c-prog/

<*> To unsubscribe from this group, send an email to:
    [EMAIL PROTECTED]

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



Reply via email to