Re: termination algorithm

Leo Simons Fri, 15 Apr 2005 05:55:02 -0700

Hi Greg, Ed, gang,

I've integrated this into our development version of gump. There were a few
kinks to hammer out (for example catch OSError and not IOError, exit out of
the "while not timed out" as early as possible, no "-" for killpg), but it
seems to be basically working at this point. For now I'm just using a single
process group for all children and not really object-orienting things as
much as I could, but it does what we need. The code is at


https://svn.apache.org/repos/asf/gump/branches/Gump3/pygump/python/gump/util
/executor.py

It's kinda cool how easy it was to combine this with the new "subprocess"
module in python2.4, which is kind enough to provide a hook for inserting
the "set child group" functionality.

It's still a little ugly in some ways, since we want to be sure that a
progress stays around even if mostly empty. That makes stuff like
introspection a little easier. I just created a child that runs for the
length of the application to visibly keep the group around. Ugly, but the
only thing I could think of atm.

Thanks for your help!

Cheers,

Leo

On 21-03-2005 03:46, "Greg Stein" <[EMAIL PROTECTED]> wrote:
> Hi all,
> 
> Was talking with Leo here at the infrastructure gathering, and he
> mentioned that Gump was having issues cleaning up zombie processes. He
> asked me how to make that happen in Linux. The general reply is "use
> os.waitpid()" (in Python; waitpid() is the general POSIX thing).
> 
> Ed Korthof and I explored this a bit more to come up with a general
> algorithm for cleaning up "everything" after a Gump run.
> 
> To start with, Gump should put all fork'd children into their own process
> groups, and then remember those groups' ids. This will enable you to kill
> any grandchild process or other things that get spawned. Even if the
> process gets re-parented to the init process, you can give it the
> smackdown via the process group. Of course, if somebody else monkeys with
> process groups, you'll lose track of them. There are limits to cleanup :-)
> 
> When you want to clean up, you can send every process group SIGTERM. If
> any killpg() call throws an exception with ESRCH (no processes in that
> group), then remove it from the saved list of groups. Next, you would
> start looping to wait for all processes to exit, or to reach a timer on
> that wait. You want to quickly loop on everything that exits, terminate
> the loop when there is nothing more, and then pause a second if stuff is
> still busy shutting down. If you timeout and some are left, then SIGKILL
> them and go reap again. The algorithm would look like:
> 
> def clean_up_processes(pgrp_list):
>   # send SIGTERM to everything, and update pgrp_list to just those
>   # process groups which have processes in them.
>   kill_groups(pgrp_list, signal.SIGTERM)
>   
>   # pass a copy of the process groups. we want to remember every
>   # group that we SIGTERM'd so that we can SIGKILL them later. it
>   # is possible that a process in the pgrp was reparented to the
>   # init process. those will be invisible to wait(), so we don't
>   # want to mistakenly think we've killed all processes in the
>   # group. thus, we preserve the list and SIGKILL it later.
>   reap_children(pgrp_list[:])
> 
>   # SIGKILL everything, editing pgrp_list again.
>   kill_groups(pgrp_list, signal.SIGKILL)
>   
>   # reap everything left, but don't really bother waiting on them.
>   # if we exit, then init will reap them.
>   reap_children(pgrp_list, 60)
> 
> def kill_groups(pgrp_list, sig)
>   # NOTE: this function edits pgrp_list
> 
>   for pgrp in pgrp_list[:]:
>     try:
>       os.killpg(-pgrp, sig)
>     except IOError, e:
>       if e.errno == errno.ESRCH:
>         pgrp_list.remove(pgrp)
> 
> def reap_children(pgrp_list, timeout=300):
>   # NOTE: this function edits pgrp_list
> 
>   # keep reaping until the timeout expires, or we finish
>   end_time = time.time() + timeout
> 
>   # keep reaping until all pgrps are done, or we run out of time
>   while pgrp_list and time.time() < end_time:
>     # pause for a bit while processes work on exiting. this pause is
>     # at the top, so we can also pause right after the killpg()
>     time.sleep(1)
> 
>     # go through all pgrps to reap them
>     for pgrp in pgrp_list[:]:
>       # loop quickly to clean everything in this pgrp
>       while 1:
>         try:
>           pid, status = os.waitpid(-pgrp, os.WNOHANG)
>         except IOError, e:
>           if e.errno == errno.ECHILD:
>             # no more children in this pgrp.
>             pgrp_list.remove(pgrp)
>             break
>           raise
>         if pid == 0:
>  # some stuff has not exited yet, and WNOHANG avoided
>  # blocking. go ahead and move to the next pgrp.
>  break
> 
> That should clean up everything. If stuff *still* hasn't exited, then
> there isn't much you can do. But you will have tried :-)
> 
> Hope that helps! EdK and I haven't built test cases for the above, but it
> has been doubly-reviewed, so we think the algorithm/code should work.
> 
> Cheers,
> -g
> 
> p.s. note that we aren't on [EMAIL PROTECTED], so CC: if you reply...



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: termination algorithm

Reply via email to