On Sun, 15 Mar 2015 09:53:34 -0400
"sort problem" <[email protected]> wrote:

> Whoops. At least I thought it helped. The default sort with the "-H"
> worked for 132 minutes then said: no space left in /home (that had
> before the sort command: 111 GBytes FREE). 

That's not surprising. -H implements a merge sort, meaning it's split
into lots and lots of files, each of which is again split into lots of
files, etc. It wouldn't surprise me to see a 60Mline file consume a
huge multiple of itself during a merge sort.

And of course, the algorithm might be swapping.

> And btw, df command said
> for free space: "-18 GByte", 104%.. what? Some kind of reserved space
> for root?
> 
> 
> Why does it takes more then 111 GBytes to "sort -u" ~600 MByte sized
> files? This in nonsense. 
> 
> 
> So the default "sort" command is a  big pile of shit when it comes to
> files bigger then 60 MByte? .. lol

That doesn't surprise me. You originally said you have 60 million
lines. Sorting 60 million items is a difficult task for any algorithm.
You don't say how long each line is, or what they contain, or whether
they're all the same line length.

How would *you* sort so many items, and sort them in a fast yet generic
way? I mean, if RAM and disk space are at a premium, you could always
use a bubble sort, and in-place sort your array in a year or two.

If I were in your shoes, I'd write my own sort routine for the task.
Perhaps using qsort() (see
http://calmerthanyouare.org/2013/05/31/qsort-shootout.html). If there's
a way you can convert line contents into a number reflecting
alpha-order, you could even qsort() in RAM if you have quite a bit of
RAM, and then the last step is to run through the sorted list of
numbers and line numbers, and write the original file by line number.
There are probably a thousand other ways to do it.

But IMHO, sorting 60megalines isn't something I would expect a generic
sort command to easily and timely do out of the box.

SteveT

Steve Litt                *  http://www.troubleshooters.com/
Troubleshooting Training  *  Human Performance

Reply via email to