Re: What am I doing wrong?

Rasmus Villemoes Tue, 12 May 2015 02:28:49 -0700

On Mon, May 11 2015, Arun Seetharam <arns...@gmail.com> wrote:

> Hi all,
>
> I am trying to use parallel for regular Linux commands, as I have to deal
> with huge files on daily basis. But few times I have tried, I don't see any
> improvement. Is there a threshold for the file size after which the
> parallel is beneficial? Or am I doing it wrong?
> Eg.,
>
> $ time head -n 1000000 huge.vcf  | parallel --pipe "awk '{print $123}'"  |
> wc -l
> 1000000
>
> Wall Time       0m29.326s
> User Mode       0m22.489s
> Kernel Mode     17m55.061s
> CPU Usage       3745.90%
>
> $ time head -n 1000000 huge.vcf | awk '{print $123}' | wc -l
> 1000000
>
> Wall Time       0m10.329s
> User Mode       0m12.447s
> Kernel Mode     0m4.540s
> CPU Usage       164.46%


Two things spring to mind: First, when comparing two runs like this, always
ensure that (the relevant part of) the file is in the page cache before
both runs; otherwise what you see in the first test may be entirely due
to reading the file from disk, and the second run then benefits greatly
from reading the file from RAM. Could you try repeating the above, but
start by doing 'time head -n 1000000 huge.vcf > /dev/null' before each?

Second, how long are the lines in huge.vcf? If the lines are _extremely_
long (say, 50k), each awk instance ends up getting passed only a few
lines, which means that almost all the time is spent in overhead (spawning
and reaping subprocesses and managing their output). See the --block
option if this is an issue.

However, I'm not sure either of these could explain the huge CPU usage
in the first case.

Rasmus

Re: What am I doing wrong?

Reply via email to