On Tue, Apr 3, 2012 at 2:35 AM, Ole Tange <[email protected]> wrote: > On Mon, Apr 2, 2012 at 10:59 AM, Thomas Sattler > <[email protected]> wrote: >>>> As you probably can imagine that is hard to reproduce. See if >>>> you can make smaller example fail - preferably something that >>>> can run on smaller machines. >>> >>> I wrote a small script that shows the problem. It completes >>> in less than 10 seconds on my desktop (two cores), but hangs >>> (read: "does not complete within hours") on two other >>> machines (8/32 cores). > > Great! I can reproduce this error. > > With the minor modification of -j100 I can even reproduce it with 90% > certainty on my dual core: > > export PARALLEL="--load 100% --verbose" > echo PARALLEL=$PARALLEL > > for i in $(seq 2 10); do > i2=$[i*i] > echo creating $i2 files: > seq $i2 | parallel -j100 -DX echo {} > echo > done > > The problem seems to be that Parallel thinks there is a job running > when there is not, so all jobs are executed but the exit handling of > one of the jobs is not done. This will block one job slot and given > enough jobs it will clog up all job slots and in any case it will > never finish, as it is waiting for a job that is already done. > > Since it does not happen if you do not have --load I thought it might > have something to do with spawning 'uptime'. However, if I change the > code to put "load average: 0.00, 0.00, 0.00" into the loadavg file > (without running uptime) I still can provoke the error, so it is > clearly not due to spawining uptime. > > I currently cannot see where the bug lies, but the good part is that I > can now reproduce it. > > Workaround until the bug is fixed: Use niceload. > > seq $i2 | niceload -L 100% parallel -j100% -X echo {}
I have now solved the issue. Unfortunately the current solution breaks a lot of other stuff, but there is no need for futher debugging. /Ole
