On Tue, Dec 18, 2001 at 04:49:08PM -0500, Richard, Francois M wrote:
> > Don't be--there's been a lot of work done to make glibc honor locales.
> >
> OK.
> If strncpy() recognizes n characters encoded in utf-8, it means that when it
strncpy doesn't and can't recognize the locale:
The strncpy() function is similar, except that not more than n
bytes of src are copied. Thus, if there is no null byte among the
first n bytes of src, the result wil not be null-terminated.
It's defined in terms of bytes. This is often used in this way:
char buf[256];
strncpy(buf, src, sizeof(buf)-1); buf[255]=0;
so this can't be changed. (It's arguable that it shouldn't copy only part of
a UTF-8 character at the end; I don't know if it does this.)
This function sucks, anyway (I don't remember the last time I used it
without having to follow it up to make sure the buffer is terminated.
It's a string function; it should *terminate the string at all times*.)
So, if it doesn't do this, I don't mind using my own function anyway.
> reads the bytes, leading and trailing bytes are detected/understood. There
> is some utf-8 decoding operation going on.
> In this case, why strlen() can count only bytes?
http://www.cl.cam.ac.uk/~mgk25/unicode.html:
"A small modification will be necessary for all programs that determine
the number of characters in a string by counting the bytes. In UTF-8
mode, they must not count any bytes in the range 0x80 - 0xBF, because
these are just continuation bytes and not characters of their own. C's
strlen(s) counts the number of bytes, but not necessarily the number of
characters in a string correctly. Instead, mbstowcs(NULL,s,0) can be
used to count characters if a UTF-8 locale has been selected."
strlen is often used in this style:
char *str = (char *) malloc(strlen(buf)+1);
memcpy(str, buf, strlen(buf)+1);
and so it can't be locale-specific. (Yes, I'm aware of strdup; this is,
in fact, only an example. :)
--
Glenn Maynard
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/