Bug#695489: sort -u and uniq "lose" non-identical lines with some locales

Edmund Grimley Evans Sun, 21 Jan 2018 01:37:13 -0800

Control: retitle -1 sort -u and uniq "lose" non-identical lines with some 
locales


I was hurt by this bug, too. I had a simple-minded script to check
files for dodgy characters before publishing them. How was I to know
that em-dash and en-dash would be considered identical in a standard
GB locale, as provided by Debian's installer? Spotting inconsistent
use of characters that look alike is exactly what my script was
supposed to achieve.

LANG=en_GB.UTF-8

$ printf "\xe2\x80\x93\n\xe2\x80\x94\n"
–
—
$ printf "\xe2\x80\x93\n\xe2\x80\x94\n" | od -An -tx1
 e2 80 93 0a e2 80 94 0a
$ printf "\xe2\x80\x93\n\xe2\x80\x94\n" | uniq | od -An -tx1
 e2 80 93 0a

It's true that the man page for "uniq" mentions LC_COLLATE, though I
don't consider that adequate warning.

However, it's also true that the official-looking spec at
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html
says:

> To remove duplicate lines based on whether they collate equally
> instead of whether they are identical, applications should use:
>
> sort -u
>
> instead of:
>
> sort | uniq

Also, the spec does not mention LC_COLLATE in the ENVIRONMENT
VARIABLES section.

Does coreutils attempt to follow that spec?



The work-around, of course, is to set LC_COLLATE to C when uniq is
invoked:

$ printf "\xe2\x80\x93\n\xe2\x80\x94\n" | uniq | od -An -tx1
 e2 80 93 0a
$ printf "\xe2\x80\x93\n\xe2\x80\x94\n" | LC_COLLATE=C uniq | od -An -tx1
 e2 80 93 0a e2 80 94 0a

Bug#695489: sort -u and uniq "lose" non-identical lines with some locales

Reply via email to