Re: Revision of GNU Parallel's processing of SIGTERM

Ole Tange Sun, 12 Apr 2015 04:15:38 -0700

On Sat, Apr 11, 2015 at 12:56 AM, Martin d'Anjou
<[email protected]> wrote:
> Hello Ole,
>
> I worked on the SIGTERM propagation feature today. I have questions, the
> questions are also in the code in the form of comments, if you prefer to
> read them there (search for "Question"):
> https://github.com/martinda/gnu-parallel/compare/sigterm-1?expand=1#diff-5379ba718ef5b0a2feb45981e768a9fd
>
> Q1:
> Inside sub wait_and_exit, job->kill(TERM") is called twice. As I am trying
> to update the documentation, I find this complex to explain.
> Do you know why the call is made twice?
> Should I write my own "wait_and_exit" for the SIGTERM propagation feature?


It think it is a left over from when $job->kill() did not send 2 TERMs.

The idea for this is if programs like GNU Parallel (that needs 2 TERMs
to exit) are started from GNU Parallel.

> Q2:
> I have added a [--wait-for-children [GRACE_PERIOD]] option for the user to
> extend the grace period of $sleepsum in case the user is dealing with
> processes that are long to "put to rest".
> My question: should this option be available in general, or just for the
> propagation feature?

Do we really need an option for this? I would like to see at least 2
real life scenarios, where this makes sense and for which a hard coded
value will not work.

> Q3:
> Still in the wait_and_exit subroutine, the grace period is "ANDed" with the
> family_pids[0].
> Why just the 0'th element? Why not the entire array?

You mean in sub Job::kill():

            # Wait up to 200 ms between TERMs - but only if any pids
are alive
            my $sleep = 1;
            for (my $sleepsum = 0; kill 0, $family_pids[0] and $sleepsum < 200;
                 $sleepsum += $sleep) {
                $sleep = ::reap_usleep($sleep);
            }

'kill 0, pid' returns true if the process is still running.
$family_pids[0] is the immediate child (i.e. the parent of any
(grand)*children)).
There is no need to see if any (grand)*children are running: it is the
job of $family_pids[0] to kill those.
The for loop runs up to 200 ms, but if the pid dies earlier, then the
loop exits.

But maybe this should be revised:

When a job times out (--timeout) we want to kill it. It is OK to give
it 200 - 1000 ms to clean up, so 'kill TERM', wait, 'kill TERM', wait,
'kill KILL'.
When GNU Parallel receives 2 TERMs, it should for all jobs 'kill
TERM', wait, 'kill TERM', wait, 'kill KILL'.
The wait should always be an upper limit: Do not wait a full second,
if the job finishes faster.

I am not sure whether GNU Parallel should also kill the
(grand*)children, and if so how that should be done to work well for
most cases. Maybe:

'kill TERM', wait, 'kill TERM', wait, 'kill KILL', 'kill KILL
@grandchildren_pid'

This way the parent is given a chance to cleanup, but if it did not
manage, then GNU Parallel does the cleaning. It would be good to have
testcases for this kind of scenario.

> Q4:
> My other questions are about how to integrate my test with the existing
> suite of tests and how to run them all.

Let's look at that when the rest is in place.


/Ole

Re: Revision of GNU Parallel's processing of SIGTERM

Reply via email to