On Sun, 15 Mar 2015 09:53:34 -0400 "sort problem" <[email protected]> wrote:
> Whoops. At least I thought it helped. The default sort with the "-H" > worked for 132 minutes then said: no space left in /home (that had > before the sort command: 111 GBytes FREE). That's not surprising. -H implements a merge sort, meaning it's split into lots and lots of files, each of which is again split into lots of files, etc. It wouldn't surprise me to see a 60Mline file consume a huge multiple of itself during a merge sort. And of course, the algorithm might be swapping. > And btw, df command said > for free space: "-18 GByte", 104%.. what? Some kind of reserved space > for root? > > > Why does it takes more then 111 GBytes to "sort -u" ~600 MByte sized > files? This in nonsense. > > > So the default "sort" command is a big pile of shit when it comes to > files bigger then 60 MByte? .. lol That doesn't surprise me. You originally said you have 60 million lines. Sorting 60 million items is a difficult task for any algorithm. You don't say how long each line is, or what they contain, or whether they're all the same line length. How would *you* sort so many items, and sort them in a fast yet generic way? I mean, if RAM and disk space are at a premium, you could always use a bubble sort, and in-place sort your array in a year or two. If I were in your shoes, I'd write my own sort routine for the task. Perhaps using qsort() (see http://calmerthanyouare.org/2013/05/31/qsort-shootout.html). If there's a way you can convert line contents into a number reflecting alpha-order, you could even qsort() in RAM if you have quite a bit of RAM, and then the last step is to run through the sorted list of numbers and line numbers, and write the original file by line number. There are probably a thousand other ways to do it. But IMHO, sorting 60megalines isn't something I would expect a generic sort command to easily and timely do out of the box. SteveT Steve Litt * http://www.troubleshooters.com/ Troubleshooting Training * Human Performance

