Re: uniq i18n implementation

Paul Eggert Tue, 08 Aug 2006 00:21:32 -0700

Pádraig Brady <[EMAIL PROTECTED]> writes:

> memcoll does 2 errno accesses per call, which shows up significantly
> in profiles. Does strcoll even set errno?


<http://www.opengroup.org/susv3/functions/strcoll.html> says it's
allowed to.  I assume some platforms do.  I wouldn't be surprised if
errno were set to EILSEQ on some platforms, for example, if the
strings contain byte sequences that are not valid multibyte
characters.  Perhaps if you investigate glibc's source code you can
see what it does in this case; it might be worth making a special case
for glibc at any rate.  Or maybe we could even use an Autoconf-style
test.

> Using strcoll is inefficient anyway

Don't we know it!  If we can avoid it, we'd like to.

> I noticed coreutils doesn't shortcut the string comparisons
> by checking lengths before doing memcoll if !C locale,
> which is fair enough, but maybe a bit restrictive?
> Can't one just check lengths when MB_CUR_MAX == 1 ?

I don't know whether that would be portable.  I can easily imagine
locales where it wouldn't be.

> In general can someone give a non theoretical example
> of 2 different byte sequences (even of the same length),
> that compare equal with strcoll() and/or transform to the same
> wide character with mbstowcs() in any locale.

I'd expect that these two sequences:

U+006D LATIN SMALL LETTER M
U+00ED LATIN SMALL LETTER I WITH ACUTE

U+006D LATIN SMALL LETTER M
U+0069 LATIN SMALL LETTER I
U+0301 COMBINING ACUTE ACCENT 

would compare equal, at least on some platforms.  However, I haven't
tested this.  For lots more on this subject, please see
<http://www.unicode.org/unicode/reports/tr10/>.

> I.E. how to get strcoll &/or wcscoll to only compare the primary weights.
> I don't think this functionality is in glibc

I think you're right.

> but it probably is possible in ICU?

Sorry, don't know.

> My test version of uniq treats the whole line as "C"
> if it isn't all a valid multibyte sequence,

I don't think we need to worry overmuch about performance for invalid
multibyte sequences.  I'd rather have correctness.

An obvious way to define "correctness" would be to break the sequence
of bytes into valid multibyte sequences separated by stray bytes, and
to sort lexicographically, where we use memcoll for the multibyte
sequences and memcmp for the stray bytes.  If we do this consistently
in 'sort', 'uniq', 'comm', 'join', etc., I think that would be a win
over the current situation, where the programs report an error when
strcoll fails.


_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: uniq i18n implementation

Reply via email to