[coreutils] Bug (?) in sort -R

Jason Mon, 16 Aug 2010 12:23:20 -0700

I can't decide if this is a bug or not. Apologies if this has already been
discussed I am pretty new to the list. I'm using the latest git version,
8.5.136-6d78c.


If you do

sort -R -k 4,4 a > b

the relative ordering of column 4 is different then if you do

sort -R -k 4,5 a > b.

(obviously the actual order in the output file is different on every run
unless you pass in the same random data to get the same ordering)

It'd seem that the individual columns should be hashed and sorted
independently in order to maintain the normal ordering of the primary sort
column. It appears that the sort is on the hash of concatenated key list, so
the same values of the primary sort column do not appear next to each other
when sorting on multiple columns. e.g., if have an input file called "a":

a b c d e
a b c d f
a b c d g
a b c e e
a b c e f

The output file should always contain all the "a b c d" lines contiguously,
and all the "a b c e" lines contiguously. As it is, the output might be

~/coreutils/coreutils> src/sort -R -k 4,5 a
a b c d e
a b c d g
a b c e e
a b c e f
a b c d f

~/coreutils/coreutils> src/sort --version
sort (GNU coreutils) 8.5.136-6d78c

This is also true if you use the -s flag with only one field specified,
which is a slightly different flavor of the same bug.

~/coreutils/coreutils> src/sort -s -R -k 4 a
a b c d g
a b c e f
a b c d f
a b c d e
a b c e e

Whereas

src/sort -s -R -k 4,4 a
a b c e e
a b c e f
a b c d e
a b c d f
a b c d g

src/sort -s -R -k 4,4 a
a b c d e
a b c d f
a b c d g
a b c e e
a b c e f

yields expected results.

The real-world use case is to prevent sequential scanning of sharded
databases by using the flag when grouping data from multiple sources.

Jason

[coreutils] Bug (?) in sort -R

Reply via email to