RE: Utf-8 support in C functions on Linux

Richard, Francois M Wed, 19 Dec 2001 06:04:14 -0800

> strncpy doesn't and can't recognize the locale:
> 
>        The strncpy() function is similar, except that not more than n
>        bytes of src are copied. Thus, if there is no null 
> byte among the
>        first n bytes of src, the result wil not be null-terminated.
>


Here is what I read from "The GNU C Library"
(http://www.gnu.org/manual/glibc-2.2.3/html_mono/libc.html) in which I
thought the words "bytes" and "character" were carefully chosen:

Function: char * strncpy (char *restrict to, const char *restrict from,
size_t size) 
This function is similar to strcpy but always copies exactly size CHARACTERS
into to. 

> It's defined in terms of bytes.  This is often used in this way:
> char buf[256];
> strncpy(buf, src, sizeof(buf)-1); buf[255]=0;
> so this can't be changed.  (It's arguable that it shouldn't 
> copy only part of
> a UTF-8 character at the end; I don't know if it does this.)
> 
> This function sucks, anyway (I don't remember the last time I used it
> without having to follow it up to make sure the buffer is terminated.
> It's a string function; it should *terminate the string at 
> all times*.)
> So, if it doesn't do this, I don't mind using my own function anyway.
> 
> > reads the bytes, leading and trailing bytes are 
> detected/understood. There
> > is some utf-8 decoding operation going on.
> > In this case, why strlen() can count only bytes?
> 
> http://www.cl.cam.ac.uk/~mgk25/unicode.html:
> 
> "A small modification will be necessary for all programs that 
> determine
> the number of characters in a string by counting the bytes. In UTF-8
> mode, they must not count any bytes in the range 0x80 - 0xBF, because
> these are just continuation bytes and not characters of their own. C's
> strlen(s) counts the number of bytes, but not necessarily the 
> number of
> characters in a string correctly. Instead, mbstowcs(NULL,s,0) can be
> used to count characters if a UTF-8 locale has been selected."
> 

So what about saying:

C functions do process only bytes.
If character boundary is not an issue (in particular when only detecting the
end of string is used) then they will perform correctly even on utf-8 data.
None of these C functions are aware of what is a leading byte or
continuation byte in the utf-8 encoding.
As a result, any C function with an argument "size_t size" or "int c" does
not handle utf-8 data properly.

Examples:
Locale-dependent and utf-8 support (as a result works properly on any Locale
including the "utf-8" ones):
char * strcpy (char *restrict to, const char *restrict from)  
strcoll (const char *s1, const char *s2)
char *strcat(char * restrict s1, const char * restrict s2);

Locale-dependent, but no utf-8 support (as a result works properly only on
"8-bit" Locales):
char * strncpy (char *restrict to, const char *restrict from, size_t size) 
size_t strxfrm (char *restrict to, const char *restrict from, size_t size)
char *strncat(char * restrict s1, const char * restrict s2, size_t n);
int strncasecmp (const char *s1, const char *s2, size_t n)
int islower (int c) 

Not Locale-dependent and utf-8 support:
int strcmp(const char *s1, const char *s2);

Not Locale-dependent and no utf-8 support:
int strncmp(const char *s1, const char *s2, size_t n);
size_t strspn (const char *string, const char *skipset)
char * strpbrk (const char *string, const char *stopset)
char * strtok (char *restrict newstring, const char *restrict delimiters) 

/Fran�ois
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

RE: Utf-8 support in C functions on Linux

Reply via email to