2015-03-31 06:36:54 -0600, Eric Blake: > FYI: This thread on the Austin Group mailing list claims that coreutils > has a bug in at least uniq (although Stephane has not yet filed formal > bug reports against the standard, so we may instead be able to get the > standard relaxed to allow our behavior of collating rather than > comparing strings). [...]
Well, The problem is that as per POSIX and clarified by Geoff in that thread, uniq should report unique lines, not just the first of a sequence of lines that sort the same. However, that would mean that uniq can no longer be used on the output of sort in those locales that have collating elements that sort the same (all UTF-8 locales with glibc). For instance, in a en_US.UTF-8 locale (the default in the US for most modern GNU systems). printf '%b\n' '\u2461' '\u2460' '\u2461' | sort | recode ..dump outputs: $ printf '%b\n' '\u2461' '\u2460' '\u2461' | sort | recode ..dump UCS2 Mne Description 2461 2-o circled digit two 000A LF line feed (lf) 2460 1-o circled digit one 000A LF line feed (lf) 2461 2-o circled digit two 000A LF line feed (lf) That's because the sorting order of those character is not defined so they all sort the same (and in that case their order is not modified as GNU sort implements a stable sort). a POSIX uniq is required to leave that output untouched, while a POSIX sort -u is required to output only one of those (either U+2460 or U+2461) GNU uniq behaviour is a bit more consistent in that sort|uniq behaves like sort -u. $ printf '%b\n' '\u2461' '\u2460' '\u2461' | sort | uniq | recode ..dump UCS2 Mne Description 2461 2-o circled digit two 000A LF line feed (lf) Now, those would not be problems if all locales provided with strict total orders, where there is not two collating elements sorting the same. That's really what I have an issue with as that breaks most people assumptions. On the other hand, we have GNU awk not conformant because its "==" operator checks for "equality" while POSIX requires it to check for "sorting the same". POSIX requires U+2460 == U+2461 (in awk) to return true in locales where those two characters sort the same. I'm rather glad awk is not conformant here even if that means that none of U+2460 < U+2461, U+2461 > U+2460 or U+2460 == U+2461 is true (note that GNU expr says yes to U+2460 = U+2461 (as required by POSIX)). Note that GNU comm and GNU join are potentially non-conformant here as well (the discussion is not over on the Austin group ML). I'm not sure the GNU tools should be modified here (more POSIX relaxed to allow GNU behaviour), but I'd be in favour of the locales shipped with the GNU libc to be modified so all colating elements have different order. -- Stephane
