tags 42340 notabug
close 42340
stop
Hello,
On 2020-07-12 5:57 p.m., Beth Andres-Beck wrote:
In trying to use `join` with `sort` I discovered odd behavior: even after
running a file through `sort` using the same delimiter, `join` would still
complain that it was out of order.
[...]
Here is a way to reproduce the problem:
printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt
printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt
join -t, a.txt b.txt
join: b.txt:2: is not sorted: 1.1.1,b
The expected behavior would be that if a file has been sorted by "sort" it
will also be considered sorted by join.
[...]
I traced this back to what I believe to be a bug in sort.c
This is not a bug in sort or join, just a side-effect of the locale on
your system on the sorting results.
By forcing a C locale with "LC_ALL=C" (meaning simple ASCII order),
the files are ordered in the same way 'join' expected them to be:
$ printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | LC_ALL=C sort -t, > a.txt
$ printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | LC_ALL=C sort -t, > b.txt
$ join -t, a.txt b.txt
1.1.1,2,b
1.1.12,2,a
---
More details:
I'm going to assume your system uses some locale based on UTF-8.
You can check it by running 'locale', e.g. on my system:
$ locale
LANG=en_CA.utf8
LANGUAGE=en_CA:en
LC_CTYPE="en_CA.utf8"
..
..
Under most UTF-8 locales, punctuation characters are *ignored* in the
compared input lines. This might be confusing and non-intuitive, but
that's the way most systems have been working for many years (locale
ordering is defined in the GNU C Library, and coreutils has no way to
change it).
Observe the following:
$ printf '12,a\n1,b\n' | LC_ALL=en_CA.utf8 sort
12,a
1,b
$ printf '12,a\n1,b\n' | LC_ALL=C sort
1,b
12,a
With a UTF-8 locale, the comma character is ignored, and then "12a"
appears before "1b" (since the character '2' comes before the character
'b').
With "C" locale, forcing ASCII or "byte comparison", punctuation
characters are not ignored, and "1,b" appears before "12,a" (because
the comma ',' ASCII value is 44 , which is smaller then the ASCII value
digit '2').
---
Somewhat related:
Your sort command defines the delimiter ("-t,") but does not define
which columns to sort by; sort then uses the entire input line - and
there's no need to specify delimiter at all.
---
As such, I'm closing this as "not a bug", but discussion can continue by
replying to this thread.
regards,
- assaf