Kaixo! On Sun, Aug 24, 2003 at 06:41:53PM -0300, Alex J. Dam wrote:
> >>> $ echo 'AB�' | tr [:upper:] [:lower:]
> >>> ab�
> >>> (the last character is an uppercase cedilla)
> >>> I expecte its output to be:
> >>> ab�
>>But sed and tr and other utilities just use the locale data provided
>>on the system by glibc among other places. These programs are table
Most probably the bug is due to the use of the str* functions.
str* functions ARE NOT SUITABLE to deal with locales; sometimes
they work, but most often they fail.
in particular, the tolower() and toupper() functions only work for
8bit encodings, and fail miserably for multibyte ones:
[EMAIL PROTECTED] root]# locale charmap ; echo 'AB�' | tr [:upper:] [:lower:]
UTF-8
ab�
[EMAIL PROTECTED] root]# locale charmap ; echo 'AB�' | tr [:upper:] [:lower:]
ISO-8859-15
ab�
as you can see, the lowercasing works for an 8bit encoding like iso-8859-15
but fails with UTF-8.
tolower() must not be used, never, it's broken.
instead, towlower() should be used. it works.
> Looking at sed 4.0.7 source code, execeute.c:
>
> /* Now do the required modifications. First \[lu]... */
> if (type & repl_uppercase_first)
> {
> *start = toupper(*start);
> start++;
> type &= ~repl_uppercase_first;
> }
yes, what I expected.
it should be instead something like:
if (type & repl_uppercase_first)
{
wint_t *startw;
startw=malloc(strlen(start)*sizeof(wint_t));
mbstowcs(startw, start, strlen(start));
*startw = towupper(*startw);
startw++;
type &= ~repl_uppercase_first;
}
(well, improved and corrected, but you get the idea: towupper must be used)
> Ok, as I said above, I am NOT a Linux programmer and this could be
> nonsense.
You pointed exactly what the problem is.
> Alex
> --
> Linux-UTF8: i18n of Linux on all levels
> Archive: http://mail.nl.linux.org/linux-utf8/
--
Ki �a vos v�ye b�n,
Pablo Saratxaga
http://chanae.walon.org/pablo/ PGP Key available, key ID: 0xD9B85466
[you can write me in Walloon, Spanish, French, English, Italian or Portuguese]
pgp00000.pgp
Description: PGP signature
