On Mon, May 11 2015, Arun Seetharam <arns...@gmail.com> wrote: > Hi all, > > I am trying to use parallel for regular Linux commands, as I have to deal > with huge files on daily basis. But few times I have tried, I don't see any > improvement. Is there a threshold for the file size after which the > parallel is beneficial? Or am I doing it wrong? > Eg., > > $ time head -n 1000000 huge.vcf | parallel --pipe "awk '{print $123}'" | > wc -l > 1000000 > > Wall Time 0m29.326s > User Mode 0m22.489s > Kernel Mode 17m55.061s > CPU Usage 3745.90% > > $ time head -n 1000000 huge.vcf | awk '{print $123}' | wc -l > 1000000 > > Wall Time 0m10.329s > User Mode 0m12.447s > Kernel Mode 0m4.540s > CPU Usage 164.46%
Two things spring to mind: First, when comparing two runs like this, always ensure that (the relevant part of) the file is in the page cache before both runs; otherwise what you see in the first test may be entirely due to reading the file from disk, and the second run then benefits greatly from reading the file from RAM. Could you try repeating the above, but start by doing 'time head -n 1000000 huge.vcf > /dev/null' before each? Second, how long are the lines in huge.vcf? If the lines are _extremely_ long (say, 50k), each awk instance ends up getting passed only a few lines, which means that almost all the time is spent in overhead (spawning and reaping subprocesses and managing their output). See the --block option if this is an issue. However, I'm not sure either of these could explain the huge CPU usage in the first case. Rasmus