Re: [coreutils] added ability in sort to skip n number of lines for each file

Jim Hester Tue, 23 Nov 2010 07:58:34 -0800

Below I have an updated proper patch, it is quite a bit larger than my
first, but should address all of the concerns from Assaf and Pádraig.

My main motivation here is not just to make this common operation less
annoying, it was mostly for increased performance.  I made a test dataset of
10 files with 3 header lines each and 500,000 lines to sort, then ran sort
by using head and tail as Pádraig suggests, and then again using my
implemented header skip on an 8 core machine.  Larger files seem to show
similar speed up as well.  I believe this speedup comes from the fact that
the multithreaded sort is trying to read from the buffer faster than tail
can write to the buffer.

>time { (head -q -n 3 test[0-9] | head -n 3; tail -q -n+4 test[0-9] | ./sort
-n ) > out2; }

real    0m51.660s
user    2m0.324s
sys     0m4.115s

>time ./sort -n -l 3 test[0-9] > out

real    0m31.834s
user    2m17.775s
sys     0m3.981s
>diff out out2
>

2010/11/22 Pádraig Brady <[email protected]>

> On 22/11/10 22:21, Pádraig Brady wrote:
> > Perhaps something like:
> >
> > (head --no-header -n1 file.* | head -n1; tail --no-header -n+2 file.* |
> sort)
> >
> > I.E. add the --no-header option to suppress the ==> file name <==
> annotations
> > which would allow using `head` and `tail` in general for this.
>
> Of course this being useful, it's already supported:
>
> (head -q -n1 file.* | head -n1; tail -q -n+2 file.* | sort)
>
> cheers,
> Pádraig
>

sort_skip_lines_2.diff
Description: Binary data

Re: [coreutils] added ability in sort to skip n number of lines for each file

Reply via email to