Hi Greg, Ed, gang, I've integrated this into our development version of gump. There were a few kinks to hammer out (for example catch OSError and not IOError, exit out of the "while not timed out" as early as possible, no "-" for killpg), but it seems to be basically working at this point. For now I'm just using a single process group for all children and not really object-orienting things as much as I could, but it does what we need. The code is at
https://svn.apache.org/repos/asf/gump/branches/Gump3/pygump/python/gump/util /executor.py It's kinda cool how easy it was to combine this with the new "subprocess" module in python2.4, which is kind enough to provide a hook for inserting the "set child group" functionality. It's still a little ugly in some ways, since we want to be sure that a progress stays around even if mostly empty. That makes stuff like introspection a little easier. I just created a child that runs for the length of the application to visibly keep the group around. Ugly, but the only thing I could think of atm. Thanks for your help! Cheers, Leo On 21-03-2005 03:46, "Greg Stein" <[EMAIL PROTECTED]> wrote: > Hi all, > > Was talking with Leo here at the infrastructure gathering, and he > mentioned that Gump was having issues cleaning up zombie processes. He > asked me how to make that happen in Linux. The general reply is "use > os.waitpid()" (in Python; waitpid() is the general POSIX thing). > > Ed Korthof and I explored this a bit more to come up with a general > algorithm for cleaning up "everything" after a Gump run. > > To start with, Gump should put all fork'd children into their own process > groups, and then remember those groups' ids. This will enable you to kill > any grandchild process or other things that get spawned. Even if the > process gets re-parented to the init process, you can give it the > smackdown via the process group. Of course, if somebody else monkeys with > process groups, you'll lose track of them. There are limits to cleanup :-) > > When you want to clean up, you can send every process group SIGTERM. If > any killpg() call throws an exception with ESRCH (no processes in that > group), then remove it from the saved list of groups. Next, you would > start looping to wait for all processes to exit, or to reach a timer on > that wait. You want to quickly loop on everything that exits, terminate > the loop when there is nothing more, and then pause a second if stuff is > still busy shutting down. If you timeout and some are left, then SIGKILL > them and go reap again. The algorithm would look like: > > def clean_up_processes(pgrp_list): > # send SIGTERM to everything, and update pgrp_list to just those > # process groups which have processes in them. > kill_groups(pgrp_list, signal.SIGTERM) > > # pass a copy of the process groups. we want to remember every > # group that we SIGTERM'd so that we can SIGKILL them later. it > # is possible that a process in the pgrp was reparented to the > # init process. those will be invisible to wait(), so we don't > # want to mistakenly think we've killed all processes in the > # group. thus, we preserve the list and SIGKILL it later. > reap_children(pgrp_list[:]) > > # SIGKILL everything, editing pgrp_list again. > kill_groups(pgrp_list, signal.SIGKILL) > > # reap everything left, but don't really bother waiting on them. > # if we exit, then init will reap them. > reap_children(pgrp_list, 60) > > def kill_groups(pgrp_list, sig) > # NOTE: this function edits pgrp_list > > for pgrp in pgrp_list[:]: > try: > os.killpg(-pgrp, sig) > except IOError, e: > if e.errno == errno.ESRCH: > pgrp_list.remove(pgrp) > > def reap_children(pgrp_list, timeout=300): > # NOTE: this function edits pgrp_list > > # keep reaping until the timeout expires, or we finish > end_time = time.time() + timeout > > # keep reaping until all pgrps are done, or we run out of time > while pgrp_list and time.time() < end_time: > # pause for a bit while processes work on exiting. this pause is > # at the top, so we can also pause right after the killpg() > time.sleep(1) > > # go through all pgrps to reap them > for pgrp in pgrp_list[:]: > # loop quickly to clean everything in this pgrp > while 1: > try: > pid, status = os.waitpid(-pgrp, os.WNOHANG) > except IOError, e: > if e.errno == errno.ECHILD: > # no more children in this pgrp. > pgrp_list.remove(pgrp) > break > raise > if pid == 0: > # some stuff has not exited yet, and WNOHANG avoided > # blocking. go ahead and move to the next pgrp. > break > > That should clean up everything. If stuff *still* hasn't exited, then > there isn't much you can do. But you will have tried :-) > > Hope that helps! EdK and I haven't built test cases for the above, but it > has been doubly-reviewed, so we think the algorithm/code should work. > > Cheers, > -g > > p.s. note that we aren't on [EMAIL PROTECTED], so CC: if you reply... --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
