I have had similar issues with --load for a long time, which is a shame because it's a great feature I'd really like to use. I can reproduce it every time on 4-core and 16-core RedHat 5 64-bit boxes (v 20120322):
PARALLEL=--load 100% --verbose cores on XXX: 16 creating 4 files: touch 1 touch 2 touch 3 touch 4 deleting 4 files: rm 1 rm 2 rm 3 rm 4 ...hang. Unfortunately if I try to strace the offending parallel, it works fine. :) If I add --debug this is what I get: deleting 4 files: 1 128 128 0 2 2048 2048 0 3 32768 32768 0 4 524288 524288 -1 Maxlen: 32768,524288,278528 5 278528 278528 -1 Maxlen: 32768,278528,155648 6 155648 155648 -1 Maxlen: 32768,155648,94208 7 94208 94208 0 Maxlen: 94208,155648,124928 8 124928 124928 0 Maxlen: 124928,155648,140288 9 140288 140288 -1 Maxlen: 124928,140288,132608 10 132608 132608 -1 Maxlen: 124928,132608,128768 11 128768 128768 0 Maxlen: 128768,132608,130688 12 130688 130688 0 Maxlen: 130688,132608,131648 13 131648 131648 -1 Maxlen: 130688,131648,131168 14 131168 131168 -1 Maxlen: 130688,131168,130928 15 130928 130928 0 Maxlen: 130928,131168,131048 16 131048 131048 0 Maxlen: 131048,131168,131108 17 131108 131108 -1 Maxlen: 131048,131108,131078 18 131078 131078 -1 Maxlen: 131048,131078,131063 19 131063 131063 0 Maxlen: 131063,131078,131070 20 131070 131070 0 Maxlen: 131070,131078,131074 21 131074 131074 -1 Maxlen: 131070,131074,131072 22 131072 131072 -1 Maxlen: 131070,131072,131071 23 131071 131071 0 Wanted procs: 16 MultifileQueue->empty RecordQueue->empty read 1 Time to fork 1 procs: 0 (processes so far: 1) MultifileQueue->empty RecordQueue->empty read 2 Time to fork 2 procs: 0 (processes so far: 2) MultifileQueue->empty RecordQueue->empty read 3 Time to fork 3 procs: 0 (processes so far: 3) MultifileQueue->empty RecordQueue->empty read 4 Time to fork 4 procs: 0 (processes so far: 4) MultifileQueue->empty 1 RecordQueue->empty 1 MultifileQueue->empty 1 RecordQueue->empty 1 CommandLineQueue->empty 1 JobQueue->empty 1 MultifileQueue->empty 1 RecordQueue->empty 1 CommandLineQueue->empty 1 JobQueue->empty 1 RecordQueue-unget 'ARRAY(0x8db4d30) ARRAY(0x8db9510) ARRAY(0x8db9740) ARRAY(0x8db95d0)' Limited to procs: 4 Running jobs before on :: 0 No loadavg file: /home/XXX/.parallel/tmp/loadavg-11743-:Updating loadavg file/home/XXX/.parallel/tmp/loadavg-11743-:Reaper called 1 Reaper exit 1 Start draining RecordQueue->empty CommandLineQueue->empty JobQueue->empty Running jobs before on :: 0 New loadavg: 0.01Last update: 1333376942max_loadavg: : 16RecordQueue->empty CommandLineQueue->empty JobQueue->empty : has 0 out of 4 jobs running. Start another. RecordQueue->empty CommandLineQueue->empty JobQueue->empty RecordQueue->empty RecordQueue->empty RecordQueue->empty RecordQueue->empty MultifileQueue->empty 1 RecordQueue->empty 1 MultifileQueue->empty 1 RecordQueue->empty 1 RecordQueue-unget 'ARRAY(0x8db4d30) ARRAY(0x8db9510) ARRAY(0x8db9740) ARRAY(0x8db95d0)' cmd_line->number_of_args 1 Command to run on 'SSHLogin=HASH(0x8a528d0)': 'rm 1' rm 1 1 processes. Starting (1): rm 1 Started as seq 1 Job started on : RecordQueue->empty CommandLineQueue->empty JobQueue->empty : has 1 out of 4 jobs running. Start another. RecordQueue->empty CommandLineQueue->empty JobQueue->empty RecordQueue->empty cmd_line->number_of_args 1 Command to run on 'SSHLogin=HASH(0x8a528d0)': 'rm 2' rm 2 2 processes. Starting (2): rm 2 Reaper called 1 died (0): 1>>joboutput rm 1 ERR: OUT: <<joboutput rm 1 Running jobs before on :: 0 New loadavg: 0.01Last update: 1333376942max_loadavg: : 16RecordQueue->empty CommandLineQueue->empty JobQueue->empty : has 0 out of 4 jobs running. Start another. RecordQueue->empty CommandLineQueue->empty JobQueue->empty RecordQueue->empty cmd_line->number_of_args 1 Command to run on 'SSHLogin=HASH(0x8a528d0)': 'rm 3' rm 3 2 processes. Starting (3): rm 3 Started as seq 3 Job started on : RecordQueue->empty CommandLineQueue->empty JobQueue->empty : has 1 out of 4 jobs running. Start another. RecordQueue->empty CommandLineQueue->empty JobQueue->empty RecordQueue->empty cmd_line->number_of_args 1 Command to run on 'SSHLogin=HASH(0x8a528d0)': 'rm 4' rm 4 3 processes. Starting (4): rm 4 Started as seq 4 Job started on : MultifileQueue->empty 1 RecordQueue->empty 1 CommandLineQueue->empty 1 JobQueue->empty 1 Running jobs after on :: 2 of 4 died (0): 3>>joboutput rm 3 ERR: OUT: <<joboutput rm 3 Running jobs before on :: 1 New loadavg: 0.01Last update: 1333376942max_loadavg: : 16MultifileQueue->empty 1 RecordQueue->empty 1 CommandLineQueue->empty 1 JobQueue->empty 1 Running jobs after on :: 1 of 4 Reaper exit 1 Reaper called 1 Reaper exit 1 Started as seq 2 Job started on : MultifileQueue->empty 1 RecordQueue->empty 1 CommandLineQueue->empty 1 JobQueue->empty 1 Running jobs after on :: 2 of 4 Sleeping 0.22 millisecs jobs running: 2==2 slots: 4 Memory usage:98131968 Sleeping 0.242 millisecs Reaper called 1 died (0): 4>>joboutput rm 4 ERR: OUT: <<joboutput rm 4 Running jobs before on :: 1 New loadavg: 0.01Last update: 1333376942max_loadavg: : 16MultifileQueue->empty 1 RecordQueue->empty 1 CommandLineQueue->empty 1 JobQueue->empty 1 Running jobs after on :: 1 of 4 Reaper exit 1 Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.2662 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.29282 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.322102 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.3543122 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.38974342 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.428717762 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.4715895382 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.51874849202 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.570623341222 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.6276856753442 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.69045424287862 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.759499667166483 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.835449633883131 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.918994597271444 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 1.01089405699859 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 1.11198346269845 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 1.22318180896829 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 1.34549998986512 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 1.48004998885163 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 1.6280549877368 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 1.79086048651048 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 1.96994653516153 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 2.16694118867768 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 2.38363530754545 millisecs Reaper called 1 Reaper exit 1 jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 2.62199883829999 millisecs ...and this just goes on forever. On Mon, Apr 2, 2012 at 4:59 AM, Thomas Sattler <[email protected]> wrote: >>> As you probably can imagine that is hard to reproduce. See if >>> you can make smaller example fail - preferably something that >>> can run on smaller machines. >> >> I wrote a small script that shows the problem. It completes >> in less than 10 seconds on my desktop (two cores), but hangs >> (read: "does not complete within hours") on two other >> machines (8/32 cores). > > I left the script running and it did not complete within 3 days! > A modified version of the trigger is attached. Having a look at > the temporary directory, 'parallel' hangs _after_ all files > have been created (or removed). > > I just tested the new script on all machines again: "2core" and > "8core" successfully completed 10 consecutive runs, but "32core" > still hungs _everytime_ a script is run. > > Could someone with 8-32 (or even more?) cores please try to > reproduce the issue? > > Thomas
