"Kerry Thompson" <[EMAIL PROTECTED]> wrote:
> ...
> K&R's strcpy function, the one I posted, is byte-oriented.

As is every other strcpy.

> (Yeah, I know, a char isn't always a byte,

Actually it is. If you mean character (not char), that's a
different story.

> and a byte isn't always 8 bits, but I'll use those
> definitions as reference points).

Why bother?

> That strcpy function breaks down with multibyte languages like
> Chinese. The second byte of a double-byte character could be 0,
> so a bytewise copy could easily stop before the end of a string.

Languages are not multibyte. Even character codings are not
multibyte.

In C, the term multibyte only comes into play when a character
code is too large to fit into a single byte on a given
implementation. Specifically, it only comes into play for
characters in the extended character set.

C requires that all the characters in the basic character set
(source and execution) fit into 1 byte. Even in Unicode
character sets, they do, because all the required characters
of the basic character set have a value less than 128. Thus
they fit into 1 byte of _any_ conforming C implementation
using unicode as a character set.

Note that functions like printf are _required_ to work with
multibyte character strings. The fact that a character may
span more than one char (byte) doesn't matter. So long as
the strings being manipulated begin and end with the initial
shift state, everything is fine.

Now, if for example unicode character codings are used for
extended characters, then a mechanism is required to encode
characters with codings outside the byte limit into multiple
bytes. In such cases, auxiliary encodings like UTF-8 and
UTF-16 can be used.

> I wonder what the ramifications will be for Windows Vista.

C implementations on Windows machines tend to use 16-bit
wchar_t. But even wchar_t characters can be 'multibyte',
i.e. a character spans more than one wchar_t.

[Note that wchar_t needn't be 16-bit.]

> Will a char still be an 8-bit byte?

Who knows?

With good programming practices, you don't need to care.

> And how about Unicode? Can you have a 0 byte in a
> Unicode character?

In your sense, yes. Capital Letter A With Macron has the
code 0x100, which would have a null byte on most 8-bit
byte machines.

But the more important question is: "Can you have a 0 byte
in a multibyte character coding?" The answer is no...

  "A byte with all bits zero shall not occur in the
   second or subsequent bytes of a multibyte character."

> I suspect there is a fair amount of strcpy rewriting
> in our future.

I suspect closer to none. ;)

-- 
Peter






To unsubscribe, send a blank message to <mailto:[EMAIL PROTECTED]>. 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/c-prog/

<*> To unsubscribe from this group, send an email to:
    [EMAIL PROTECTED]

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



Reply via email to