Re: Utf-8 support in C functions on Linux

Glenn Maynard Tue, 18 Dec 2001 14:04:02 -0800

On Tue, Dec 18, 2001 at 04:49:08PM -0500, Richard, Francois M wrote:
> > Don't be--there's been a lot of work done to make glibc honor locales.
> > 
> OK. 
> If strncpy() recognizes n characters encoded in utf-8, it means that when it


strncpy doesn't and can't recognize the locale:

       The strncpy() function is similar, except that not more than n
       bytes of src are copied. Thus, if there is no null byte among the
       first n bytes of src, the result wil not be null-terminated.

It's defined in terms of bytes.  This is often used in this way:
char buf[256];
strncpy(buf, src, sizeof(buf)-1); buf[255]=0;
so this can't be changed.  (It's arguable that it shouldn't copy only part of
a UTF-8 character at the end; I don't know if it does this.)

This function sucks, anyway (I don't remember the last time I used it
without having to follow it up to make sure the buffer is terminated.
It's a string function; it should *terminate the string at all times*.)
So, if it doesn't do this, I don't mind using my own function anyway.

> reads the bytes, leading and trailing bytes are detected/understood. There
> is some utf-8 decoding operation going on.
> In this case, why strlen() can count only bytes?

http://www.cl.cam.ac.uk/~mgk25/unicode.html:

"A small modification will be necessary for all programs that determine
the number of characters in a string by counting the bytes. In UTF-8
mode, they must not count any bytes in the range 0x80 - 0xBF, because
these are just continuation bytes and not characters of their own. C's
strlen(s) counts the number of bytes, but not necessarily the number of
characters in a string correctly. Instead, mbstowcs(NULL,s,0) can be
used to count characters if a UTF-8 locale has been selected."

strlen is often used in this style:
char *str = (char *) malloc(strlen(buf)+1);
memcpy(str, buf, strlen(buf)+1);
and so it can't be locale-specific.  (Yes, I'm aware of strdup; this is,
in fact, only an example. :)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Utf-8 support in C functions on Linux

Reply via email to