bug#7182: sort -R slow

2011-08-07 Thread Jim Meyering
Davide Brini wrote:
 On Sat, 9 Oct 2010 14:52:41 +0200 Ole Tange ta...@gnu.org wrote:

 I recently needed to randomize some lines. So I tried using 'sort -R'.
 I was astonished how slow that was. So I tested how slow a competing
 strategies are. GNU sort is two magnitudes slower than unsort and more
 than one magnitude slower than perl:

 $ time unsort file
 real0m1.388s

 $ unsort --version
 unsort 1.1.2

 $ time perl -e 'print sort { rand() = rand() } ' file
 real0m6.621s

 $ time sort -R file
 real4m8.403s

 $ sort --version
 sort (GNU coreutils) 8.5

 What is even scarier: sort without -R is faster than sort -R:

 $ time sort file
 real0m53.553s

 I would expect sort -R to be faster than sort and faster than Perl if
 not as fast as unsort.

 On my system, locale settings seem to impact the runtime significantly:

 $ wc -l bigfile
 100 bigfile

 $ time LC_ALL=en_US.utf8 sort -R bigfile  /dev/null

 real  1m29.302s
 user  1m21.009s
 sys   0m0.155s

 $ time LC_ALL=C sort -R bigfile  /dev/null

 real  0m38.881s
 user  0m35.276s
 sys   0m0.118s


 However, shuf is much faster, and seems mostly unaffected by the locale
 used:

 $ time shuf bigfile  /dev/null

 real  0m1.044s
 user  0m0.833s
 sys   0m0.042s

Thanks for the report.
I think the performance of sort -R will often be worse
than that of shuf (by design, since it accesses each byte of each line
once more, to compute the hash), except when the input size is larger
than available memory.

The info documentation for sort -R does refer to shuf.

Any suggestions for improvements are welcome.
I'm closing this.

You're welcome to reopen or file a new report.





bug#7182: sort -R slow

2010-10-09 Thread Ole Tange
I recently needed to randomize some lines. So I tried using 'sort -R'.
I was astonished how slow that was. So I tested how slow a competing
strategies are. GNU sort is two magnitudes slower than unsort and more
than one magnitude slower than perl:

$ time unsort file
real0m1.388s

$ unsort --version
unsort 1.1.2

$ time perl -e 'print sort { rand() = rand() } ' file
real0m6.621s

$ time sort -R file
real4m8.403s

$ sort --version
sort (GNU coreutils) 8.5

What is even scarier: sort without -R is faster than sort -R:

$ time sort file
real0m53.553s

I would expect sort -R to be faster than sort and faster than Perl if
not as fast as unsort.


/Ole





bug#7182: sort -R slow

2010-10-09 Thread Alan Curry
Ole Tange writes:
 
 I recently needed to randomize some lines. So I tried using 'sort -R'.
 I was astonished how slow that was. So I tested how slow a competing
 strategies are. GNU sort is two magnitudes slower than unsort and more
 than one magnitude slower than perl:

Never heard of unsort. Why didn't you try shuf(1)?

Also, your perl is not valid:

 
 $ time perl -e 'print sort { rand() = rand() } ' file
 real0m6.621s

That comparison function is not consistent (unless very lucky).

 I would expect sort -R to be faster than sort and faster than Perl if
 not as fast as unsort.

How big is your test file? I expect sort(1) to be optimized for big jobs. I
bet it would win the contest if you are shuffling a file that's bigger than
available RAM.






bug#7182: sort -R slow

2010-10-09 Thread Davide Brini
On Sat, 9 Oct 2010 14:52:41 +0200 Ole Tange ta...@gnu.org wrote:

 I recently needed to randomize some lines. So I tried using 'sort -R'.
 I was astonished how slow that was. So I tested how slow a competing
 strategies are. GNU sort is two magnitudes slower than unsort and more
 than one magnitude slower than perl:
 
 $ time unsort file
 real0m1.388s
 
 $ unsort --version
 unsort 1.1.2
 
 $ time perl -e 'print sort { rand() = rand() } ' file
 real0m6.621s
 
 $ time sort -R file
 real4m8.403s
 
 $ sort --version
 sort (GNU coreutils) 8.5
 
 What is even scarier: sort without -R is faster than sort -R:
 
 $ time sort file
 real0m53.553s
 
 I would expect sort -R to be faster than sort and faster than Perl if
 not as fast as unsort.

On my system, locale settings seem to impact the runtime significantly:

$ wc -l bigfile 
100 bigfile

$ time LC_ALL=en_US.utf8 sort -R bigfile  /dev/null

real1m29.302s
user1m21.009s
sys 0m0.155s

$ time LC_ALL=C sort -R bigfile  /dev/null

real0m38.881s
user0m35.276s
sys 0m0.118s


However, shuf is much faster, and seems mostly unaffected by the locale
used:

$ time shuf bigfile  /dev/null

real0m1.044s
user0m0.833s
sys 0m0.042s

-- 
D.