On Tue, Aug 23, 2011 at 8:38 AM, Nathan Watson-Haigh <[email protected]> wrote: > Hi Ole, > > I'm in the middle of optimising the processing of such file. While I'm at it > I have a quick question: > > I'm processing my input file with -N 2500 as it seems to give me the best > processing time. My CPU usage and disk IO are well below their maximum > capacity
I often face this problem aswell. Starting more processes normally solves the problem for me. I believe the issue is that while the disk I/O is below maximum capacity on average, there are spikes where it is over capacity (e.g. every time it has to seek). During theses spikes the CPU is waiting for data to process. By having more processes running than processors the "extra" processes can buffer up some input which can then be processed when there is idle CPU time. So: Try starting twice as many processes (-j 200%). Low throughput on disks can also be due to disk seeks. Until recently I did not know of a tool that could detect disk seeks, but I have now found iostat. The '%util' column is very useful to see how busy the disk is: iostat -xd 1 Using -N indicates that your processing program can take more than one argument. If the startup time for the processing program is large, try using -X to distribute all the arguments amongst the processes. > and wondered how GNU parallel processes/submits new jobs as others are > completed? I'm thinking that GNU parallel is somehow stalling my pipeline. > Could you provide some information on this aspect of GNU parallel? GNU Parallel's main loop (drain_job_queue) is basically one big sleep. It gets woken up by a child dying. When a child dies GNU Parallel prints out stdout and stderr for that child and a new child is spawned. Then GNU Parallel sleeps again. /Ole
