Kaixo!

On Sun, Aug 24, 2003 at 06:41:53PM -0300, Alex J. Dam wrote:

> >>>   $ echo 'AB�' | tr [:upper:] [:lower:]
> >>>   ab�
> >>>   (the last character is an uppercase cedilla)
> >>>   I expecte its output to be:
> >>>   ab�

>>But sed and tr and other utilities just use the locale data provided
>>on the system by glibc among other places.  These programs are table

Most probably the bug is due to the use of the str* functions.
str* functions ARE NOT SUITABLE to deal with locales; sometimes
they work, but most often they fail.

in particular, the tolower() and toupper() functions only work for
8bit encodings, and fail miserably for multibyte ones:

[EMAIL PROTECTED] root]# locale charmap ; echo 'AB�' | tr [:upper:] [:lower:]
UTF-8
ab�

[EMAIL PROTECTED] root]# locale charmap ; echo 'AB�' | tr [:upper:] [:lower:]
ISO-8859-15
ab�

as you can see, the lowercasing works for an 8bit encoding like iso-8859-15
but fails with UTF-8.

tolower() must not be used, never, it's broken.

instead, towlower() should be used. it works.

> Looking at sed 4.0.7 source code, execeute.c:
> 
>  /* Now do the required modifications.  First \[lu]... */
>  if (type & repl_uppercase_first)
>    {
>      *start = toupper(*start);
>      start++;
>      type &= ~repl_uppercase_first;
>    }

yes, what I expected.
it should be instead something like:

  if (type & repl_uppercase_first)
    {
      wint_t *startw;
      startw=malloc(strlen(start)*sizeof(wint_t));
      mbstowcs(startw, start, strlen(start));
      *startw = towupper(*startw);
      startw++;
      type &= ~repl_uppercase_first;
    }

(well, improved and corrected, but you get the idea: towupper must be used)
 
> Ok, as I said above, I am NOT a Linux programmer and this could be 
> nonsense.

You pointed exactly what the problem is.
 
> Alex
> --
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/linux-utf8/


-- 
Ki �a vos v�ye b�n,
Pablo Saratxaga

http://chanae.walon.org/pablo/          PGP Key available, key ID: 0xD9B85466
[you can write me in Walloon, Spanish, French, English, Italian or Portuguese]

Attachment: pgp00000.pgp
Description: PGP signature

Reply via email to