Re: starvation with "--load" on multicore (was: parallel stops working for no obvious reason)

Ole Tange Tue, 03 Apr 2012 04:50:06 -0700

On Tue, Apr 3, 2012 at 2:35 AM, Ole Tange <[email protected]> wrote:
> On Mon, Apr 2, 2012 at 10:59 AM, Thomas Sattler
> <[email protected]> wrote:
>>>> As you probably can imagine that is hard to reproduce. See if
>>>> you can make smaller example fail - preferably something that
>>>> can run on smaller machines.
>>>
>>> I wrote a small script that shows the problem. It completes
>>> in less than 10 seconds on my desktop (two cores), but hangs
>>> (read: "does not complete within hours") on two other
>>> machines (8/32 cores).
>
> Great! I can reproduce this error.
>
> With the minor modification of -j100 I can even reproduce it with 90%
> certainty on my dual core:
>
> export PARALLEL="--load 100% --verbose"
> echo PARALLEL=$PARALLEL
>
> for i in $(seq 2 10); do
>  i2=$[i*i]
>  echo creating $i2 files:
>  seq $i2 | parallel -j100 -DX echo {}
>  echo
> done
>
> The problem seems to be that Parallel thinks there is a job running
> when there is not, so all jobs are executed but the exit handling of
> one of the jobs is not done. This will block one job slot and given
> enough jobs it will clog up all job slots and in any case it will
> never finish, as it is waiting for a job that is already done.
>
> Since it does not happen if you do not have --load I thought it might
> have something to do with spawning 'uptime'. However, if I change the
> code to put "load average: 0.00, 0.00, 0.00" into the loadavg file
> (without running uptime) I still can provoke the error, so it is
> clearly not due to spawining uptime.
>
> I currently cannot see where the bug lies, but the good part is that I
> can now reproduce it.
>
> Workaround until the bug is fixed: Use niceload.
>
> seq $i2 | niceload -L 100% parallel -j100% -X echo {}


I have now solved the issue. Unfortunately the current solution breaks
a lot of other stuff, but there is no need for futher debugging.


/Ole

Re: starvation with "--load" on multicore (was: parallel stops working for no obvious reason)

Reply via email to