Re: Job Processing Was RE: Parallel Merge

Ole Tange Tue, 23 Aug 2011 01:36:44 -0700

On Tue, Aug 23, 2011 at 8:38 AM, Nathan Watson-Haigh
<[email protected]> wrote:
> Hi Ole,
>
> I'm in the middle of optimising the processing of such file. While I'm at it 
> I have a quick question:
>
> I'm processing my input file with -N 2500 as it seems to give me the best 
> processing time. My CPU usage and disk IO are well below their maximum 
> capacity


I often face this problem aswell. Starting more processes normally
solves the problem for me. I believe the issue is that while the disk
I/O is below maximum capacity on average, there are spikes where it is
over capacity (e.g. every time it has to seek). During theses spikes
the CPU is waiting for data to process. By having more processes
running than processors the "extra" processes can buffer up some input
which can then be processed when there is idle CPU time.

So: Try starting twice as many processes (-j 200%).

Low throughput on disks can also be due to disk seeks.  Until recently
I did not know of a tool that could detect disk seeks, but I have now
found iostat. The '%util' column is very useful to see how busy the
disk is:

  iostat -xd 1

Using -N indicates that your processing program can take more than one
argument. If the startup time for the processing program is large, try
using -X to distribute all the arguments amongst the processes.

> and wondered how GNU parallel processes/submits new jobs as others are 
> completed? I'm thinking that GNU parallel is somehow stalling my pipeline. 
> Could you provide some information on this aspect of GNU parallel?

GNU Parallel's main loop (drain_job_queue) is basically one big sleep.
It gets woken up by a child dying. When a child dies GNU Parallel
prints out stdout and stderr for that child and a new child is
spawned. Then GNU Parallel sleeps again.


/Ole

Re: Job Processing Was RE: Parallel Merge

Reply via email to