Pádraig Brady <[EMAIL PROTECTED]> writes: > memcoll does 2 errno accesses per call, which shows up significantly > in profiles. Does strcoll even set errno?
<http://www.opengroup.org/susv3/functions/strcoll.html> says it's allowed to. I assume some platforms do. I wouldn't be surprised if errno were set to EILSEQ on some platforms, for example, if the strings contain byte sequences that are not valid multibyte characters. Perhaps if you investigate glibc's source code you can see what it does in this case; it might be worth making a special case for glibc at any rate. Or maybe we could even use an Autoconf-style test. > Using strcoll is inefficient anyway Don't we know it! If we can avoid it, we'd like to. > I noticed coreutils doesn't shortcut the string comparisons > by checking lengths before doing memcoll if !C locale, > which is fair enough, but maybe a bit restrictive? > Can't one just check lengths when MB_CUR_MAX == 1 ? I don't know whether that would be portable. I can easily imagine locales where it wouldn't be. > In general can someone give a non theoretical example > of 2 different byte sequences (even of the same length), > that compare equal with strcoll() and/or transform to the same > wide character with mbstowcs() in any locale. I'd expect that these two sequences: U+006D LATIN SMALL LETTER M U+00ED LATIN SMALL LETTER I WITH ACUTE U+006D LATIN SMALL LETTER M U+0069 LATIN SMALL LETTER I U+0301 COMBINING ACUTE ACCENT would compare equal, at least on some platforms. However, I haven't tested this. For lots more on this subject, please see <http://www.unicode.org/unicode/reports/tr10/>. > I.E. how to get strcoll &/or wcscoll to only compare the primary weights. > I don't think this functionality is in glibc I think you're right. > but it probably is possible in ICU? Sorry, don't know. > My test version of uniq treats the whole line as "C" > if it isn't all a valid multibyte sequence, I don't think we need to worry overmuch about performance for invalid multibyte sequences. I'd rather have correctness. An obvious way to define "correctness" would be to break the sequence of bytes into valid multibyte sequences separated by stray bytes, and to sort lexicographically, where we use memcoll for the multibyte sequences and memcmp for the stray bytes. If we do this consistently in 'sort', 'uniq', 'comm', 'join', etc., I think that would be a win over the current situation, where the programs report an error when strcoll fails. _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils