On Sat, Apr 11, 2015 at 12:56 AM, Martin d'Anjou <martin.danjo...@gmail.com> wrote: > Hello Ole, > > I worked on the SIGTERM propagation feature today. I have questions, the > questions are also in the code in the form of comments, if you prefer to > read them there (search for "Question"): > https://github.com/martinda/gnu-parallel/compare/sigterm-1?expand=1#diff-5379ba718ef5b0a2feb45981e768a9fd > > Q1: > Inside sub wait_and_exit, job->kill(TERM") is called twice. As I am trying > to update the documentation, I find this complex to explain. > Do you know why the call is made twice? > Should I write my own "wait_and_exit" for the SIGTERM propagation feature?
It think it is a left over from when $job->kill() did not send 2 TERMs. The idea for this is if programs like GNU Parallel (that needs 2 TERMs to exit) are started from GNU Parallel. > Q2: > I have added a [--wait-for-children [GRACE_PERIOD]] option for the user to > extend the grace period of $sleepsum in case the user is dealing with > processes that are long to "put to rest". > My question: should this option be available in general, or just for the > propagation feature? Do we really need an option for this? I would like to see at least 2 real life scenarios, where this makes sense and for which a hard coded value will not work. > Q3: > Still in the wait_and_exit subroutine, the grace period is "ANDed" with the > family_pids[0]. > Why just the 0'th element? Why not the entire array? You mean in sub Job::kill(): # Wait up to 200 ms between TERMs - but only if any pids are alive my $sleep = 1; for (my $sleepsum = 0; kill 0, $family_pids[0] and $sleepsum < 200; $sleepsum += $sleep) { $sleep = ::reap_usleep($sleep); } 'kill 0, pid' returns true if the process is still running. $family_pids[0] is the immediate child (i.e. the parent of any (grand)*children)). There is no need to see if any (grand)*children are running: it is the job of $family_pids[0] to kill those. The for loop runs up to 200 ms, but if the pid dies earlier, then the loop exits. But maybe this should be revised: When a job times out (--timeout) we want to kill it. It is OK to give it 200 - 1000 ms to clean up, so 'kill TERM', wait, 'kill TERM', wait, 'kill KILL'. When GNU Parallel receives 2 TERMs, it should for all jobs 'kill TERM', wait, 'kill TERM', wait, 'kill KILL'. The wait should always be an upper limit: Do not wait a full second, if the job finishes faster. I am not sure whether GNU Parallel should also kill the (grand*)children, and if so how that should be done to work well for most cases. Maybe: 'kill TERM', wait, 'kill TERM', wait, 'kill KILL', 'kill KILL @grandchildren_pid' This way the parent is given a chance to cleanup, but if it did not manage, then GNU Parallel does the cleaning. It would be good to have testcases for this kind of scenario. > Q4: > My other questions are about how to integrate my test with the existing > suite of tests and how to run them all. Let's look at that when the rest is in place. /Ole