Hi Jim,
In coreutils/src/join.c, there is a FIXME mentioning that the -i option for
case insensitive comparison of the input lines does not work in multibyte
locales. And indeed, in an UTF-8 locale, I see this:
$ cat > in1 <<EOF
müsste
EOF
$ cat > in2 <<EOF
MÜSSTE
EOF
$ join -i in1 in2
[empty result]
The expected result is:
$ join -i in1 in2
müsste
Similarly, with a German word in lower and upper case:
$ cat > in1 <<EOF
Ruß
EOF
$ cat > in2 <<EOF
RUSS
EOF
$ join -i in1 in2
[empty result]
The expected result is:
$ join -i in1 in2
Ruß
Before going on, let me summarize the case comparison functions for strings
that we have available with gnulib:
| on NUL terminated | on memory areas or
| strings | strings with embedded NULs
----------------------+----------------------+---------------------------
For ASCII strings | c_strcasecmp, |
only | STRCASEEQ |
----------------------+----------------------+---------------------------
For unibyte locales | strcasecmp | memcasecmp
only | |
----------------------+----------------------+---------------------------
Support for multibyte | mbscasecmp | mbmemcasecmp
locales | |
------------------+----------------------+---------------------------
+ German, Greek etc.| | ulc_casecmp
----------------------+----------------------+---------------------------
Support for multibyte | | mbmemcasecoll
locales and locale | |
collation order | |
------------------+----------------------+---------------------------
+ German, Greek etc.| | ulc_casecoll
----------------------+----------------------+---------------------------
Find attached a draft patch for the 'join' program, that fixes the bug
mentioned above by use of the mbmemcasecmp or ulc_casecmp functions. It
is not ready to apply, because there are three big questions:
1) Which functions to use for case comparison in coreutils?
The difference between mbmemcasecmp and ulc_casecmp (or between
mbmemcasecoll and ulc_casecoll) is:
mbmemcasecmp treats only English and a few European languages correctly,
- Turkish i / I is halfway correct, but not fully,
whereas ulc_casecmp handles all known specialities of languages:
- Turkish i / I is fully correct,
- German ß is equivalent to ss,
- Croatian and Bosnian: Characters with 3 forms, such as DZ dz Dz, are
considered equivalent,
- Greek final sigma (lowercase) is considered equivalent to uppercase
sigma, (There is no difference between final and non-final sigma in the
upper case.)
- Lithuanian soft-dot,
- etc.
I think ulc_casecmp is "correct", whereas mbmemcasecmp is only "half
correct".
The reason is that mbmemcasecmp is based on the POSIX APIs, but these APIs
have some assumptions built-in that are not valid in some languages:
- It assumes that there is only uppercase and lowercase - not true for
DZ dz Dz.
- It assumes that uppercasing of 1 character leads to 1 character - not
true for German ß.
- It assumes that there is 1:1 mapping between uppercase and lowercase
forms - not true for Greek sigma.
- It assumes that the upper/lowercase mappings are position independent -
not true for Greek sigma and Lithuanian i.
2) There is a problem with the case comparison in "sort -f": POSIX specifies
how this option should behave, in terms of the old POSIX terms
("all lowercase characters that have uppercase equivalents").
How to deal with that?
a) Use mbmemcasecmp for the option -f, and introduce a long option that
works with ulc_casecmp?
b) Use mbmemcasecmp if the environment variable POSIXLY_CORRECT is set,
and ulc_casecmp otherwise?
3) There is also a problem with the executable size: the ulc_casecmp (and
ulc_casecoll) functions are implemented using a couple of tables. I
squeezed them already, while still guaranteeing O(1) time for each
access. Most of the tables are about 10 KB large, the largest one ca. 45 KB.
But it sums up:
join executable size (decimal)
coreutils-7.1 unmodified 35436
with mbmemcasecmp 36473
with ulc_casecmp 174336
with ulc_casecmp and mbmemcasecmp 176521
(switched at runtime)
When an executable grows from 35 KB to 175 KB, just for correct string
comparisons, some people will certainly complain. Especially embedded
developers, like the busybox guys, try to reduce total executable size.
And that's not only about 'join', it's ultimately about every coreutils
program that has an option to perform case-insensitive comparisons on
user's data.
How do deal with that?
a) Add a configure option --disable-extra-i18n, that will refrain from
using the ulc_casecmp function?
b) Let coreutils build and install a shared library for these large
modules?
c) Should these Unicode string functions be packaged externally to
coreutils, and coreutils can link to it as an external dependency
(like it does for libiconv, libintl, libacl, etc.)?
d) any other idea?
Bruno
--- coreutils-7.1/src/join.c.bak 2008-11-10 14:17:52.000000000 +0100
+++ coreutils-7.1/src/join.c 2009-03-10 03:48:45.000000000 +0100
@@ -1,5 +1,5 @@
/* join - join lines of two files on a common field
- Copyright (C) 91, 1995-2006, 2008 Free Software Foundation, Inc.
+ Copyright (C) 91, 1995-2006, 2008-2009 Free Software Foundation, Inc.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -25,6 +25,9 @@
#include "system.h"
#include "error.h"
#include "linebuffer.h"
+#include "unicase.h"
+#include "uninorm.h"
+#include "mbmemcasecmp.h"
#include "memcasecmp.h"
#include "quote.h"
#include "stdio--.h"
@@ -92,6 +95,9 @@
want to overwrite the previous buffer before we check order. */
static struct line *spareline[2] = {NULL, NULL};
+/* True if the LC_CTYPE locale is hard. */
+static bool hard_LC_CTYPE;
+
/* True if the LC_COLLATE locale is hard. */
static bool hard_LC_COLLATE;
@@ -321,8 +327,23 @@
if (ignore_case)
{
- /* FIXME: ignore_case does not work with NLS (in particular,
- with multibyte chars). */
+ if (hard_LC_CTYPE)
+ {
+#if EXTRA_I18N
+ /* The ulc_casecmp function handles not only multibyte characters
+ correctly, but also the German sharp s, the Greek final sigma,
+ the Turkish dotless i, etc. */
+ if (ulc_casecmp (beg1, len1, beg2, len2, uc_locale_language (),
+ UNINORM_NFD, &diff) >= 0)
+ return diff;
+ if (errno == ENOMEM)
+ xalloc_die ();
+#endif
+ /* IF ulc_casecmp failed due to some conversion error, fall back to
+ a comparison that at least handles multibyte characters and the
+ Turkish dotless i correctly. */
+ return mbmemcasecmp (beg1, len1, beg2, len2);
+ }
diff = memcasecmp (beg1, beg2, MIN (len1, len2));
}
else
@@ -942,6 +963,7 @@
setlocale (LC_ALL, "");
bindtextdomain (PACKAGE, LOCALEDIR);
textdomain (PACKAGE);
+ hard_LC_CTYPE = hard_locale (LC_CTYPE);
hard_LC_COLLATE = hard_locale (LC_COLLATE);
atexit (close_stdout);
--- coreutils-7.1/bootstrap.conf.bak 2009-02-16 14:35:18.000000000 +0100
+++ coreutils-7.1/bootstrap.conf 2009-03-10 03:52:46.000000000 +0100
@@ -67,6 +67,7 @@
inttostr inttypes isapipe
lchmod lchown lib-ignore linebuffer link-follow
long-options lstat malloc
+ mbmemcasecmp
mbrtowc
mbswidth
memcasecmp mempcpy
@@ -96,7 +97,9 @@
strdup
strftime
strpbrk strtoimax strtoumax strverscmp sys_stat timespec tzset
- unicodeio unistd-safer unlink-busy unlinkdir unlocked-io
+ unicase/ulc-casecmp unicase/locale-language
+ unicodeio uninorm/nfd
+ unistd-safer unlink-busy unlinkdir unlocked-io
uptime
useless-if-before-free
userspec utimecmp utimens
_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils