after patch, it killed child processes but kept mpirun ... itself. before that patch - all processes were killed (and you are right, "mpirun died right at the end of the timeout" was reported) but at least it left the cluster in the clean state w/o leftovers. now many "orphan" launchers are alive from previous invocations.
On Mon, Jun 23, 2014 at 2:18 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > There was actually quite a bit of testing before this was committed. This > commit resolved a lot of hangs across multiple organizations. > > Can you be more specific as to what is happening? > > The prior code was killing child processes before mpirun itself, for > example, which has led MTT to erroneously report that mpirun died right at > the end of the timeout without being killed. This has been ongoing for > many months, at a minimum. > > > > > On Jun 23, 2014, at 4:37 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > > > this commit does more harm then good. > > we experience following: > > > > - some child processes still running after timeout and mtt killed the > job. > > > > before this commit - it worked fine. > > please revert and test more. > > > > > > > > On Sat, Jun 21, 2014 at 3:30 PM, MPI Team <mpit...@crest.iu.edu> wrote: > > The branch, master has been updated > > via 016088f2a0831b32ab5fd6f60f4cabe67e92e594 (commit) > > via 7fb4c6a4c9d71be127ea53bd463178510577f71f (commit) > > via 381ba177d835a54c3197d846f5a4edfc314efe27 (commit) > > via cfdd29de2012eeb7592706f00dd07a52dd48cf6b (commit) > > via 940030ca20eb1eaf256e898b83866c1cb83aca5c (commit) > > from c99ed7c7b159a2cab58a251bd7c0dad8972ff901 (commit) > > > > Those revisions listed above that are new to this repository have > > not appeared on any other notification email; so we list those > > revisions in full, below. > > > > - Log ----------------------------------------------------------------- > > > https://github.com/open-mpi/mtt/commit/016088f2a0831b32ab5fd6f60f4cabe67e92e594 > > > > commit 016088f2a0831b32ab5fd6f60f4cabe67e92e594 > > Author: Jeff Squyres <jsquy...@cisco.com> > > Date: Sat Jun 21 04:58:45 2014 -0700 > > > > DoCommand: several fixes to kill_proc logic > > > > 1. Fix the kill(0, $pid) test to see if the process was still alive. > > > > 2. Rename _kill_proc() to _kill_proc_tree() to indicate that it's > > really killing not only the PID in question, but also all of its > > descendants. > > > > 3. In _kill_proc_tree(), change the order to kill the main PID first, > > and ''then'' kill all the descendants. > > > > The main use case is when killing mpirun: if we kill mpirun's > > descendants ''first'', mpirun will detect its childrens' deaths and > > then cleanup and exit. Later, when MTT finally gets around to > killing > > mpirun, MTT will detect that mpirun is already dead and therefore > emit > > a confusing "mpirun died right at end of timeout" message. This is > > misleading at best; it doesn't indicate what actually happened. > > > > However, if we kill mpirun first, it will take care of killing all of > > its descendants. MTT will therefore emit the right messages about > > killing mpirun. MTT will then redundantly try to kill a bunch of > > now-nonexistent descendant processes of mpirun, but that's ok/safe. > > We actually ''want'' this try-to-kill-mpirun's-descendants behavior > to > > handle the case when mpirun is misbehaving / not cleaning up its > > descendants. > > > > 4. DoCommand() is used for more than launching mpirun, so pass down > > $argv0 so that we can print the actual command name that is being > > killed in various Verbose/Debug messages, not the hard-coded "mpirun" > > string (which, in practice, was probably almost always correct, but > > still...). > > --- > > lib/MTT/DoCommand.pm | 78 > ++++++++++++++++++++++++++++++++++++---------------- > > 1 file changed, 55 insertions(+), 23 deletions(-) > > > > diff --git a/lib/MTT/DoCommand.pm b/lib/MTT/DoCommand.pm > > index 02cdb94..646ca31 100644 > > --- a/lib/MTT/DoCommand.pm > > +++ b/lib/MTT/DoCommand.pm > > @@ -2,7 +2,7 @@ > > # > > # Copyright (c) 2005-2006 The Trustees of Indiana University. > > # All rights reserved. > > -# Copyright (c) 2006-2013 Cisco Systems, Inc. All rights reserved. > > +# Copyright (c) 2006-2014 Cisco Systems, Inc. All rights reserved. > > # Copyright (c) 2007-2008 Sun Microsystems, Inc. All rights reserved. > > # Copyright (c) 2007-2012 High Performance Computing Center Stuttgart, > > # University of Stuttgart. All rights reserved. > > @@ -57,23 +57,27 @@ sub DoCommand { > > ($time_arg, $no_execute) = @_; > > } > > > > +# This function is called for killing both mpirun and each of its > > +# descendants. We really only want to see verbose output for when we > > +# kill mpirun itself, so only show output when the caller provides us > > +# with a $argv0 value. > > sub _kill_proc_one { > > - my ($pid) = @_; > > + my ($pid, $argv0) = @_; > > > > # How long to wait after each kill() > > my $wait_time = 5; > > > > # See if the proc is alive first > > - my $kid; > > - kill(0, $pid); > > - $kid = waitpid($pid, WNOHANG); > > - return "mpirun died right at end of timeout (MTT did not have to > kill it)" > > - if (-1 == $kid); > > + my $num_alive = kill(0, $pid); > > + return "$argv0 died right at end of timeout (MTT did not have to > kill it)" > > + if (0 == $num_alive); > > > > # Try an easy kill > > kill("TERM", $pid); > > - Verbose("*** Killing mpirun with SIGTERM\n"); > > + Verbose("*** Killing $argv0 with SIGTERM\n") > > + if (defined($argv0)); > > # Give mpirun some time to cleanup before we try to reap it. > > + my $kid; > > my $i = $wait_time; > > while ($i > 0) { > > sleep(1); > > @@ -85,45 +89,74 @@ sub _kill_proc_one { > > # process no longer exists (i.e., we get -1 back from > > # waitpid), or we successfully killed it (i.e., we got the PID > > # back from waitpid). > > - return "MTT killed mpirun via SIGTERM" if (0 != $kid); > > + return "MTT killed $argv0 via SIGTERM" if (0 != $kid); > > > > --$i; > > } > > - Verbose("** Kill TERM (after $wait_time seconds) didn't work!\n"); > > + Verbose("** Kill TERM (after $wait_time seconds) didn't work!\n") > > + if (defined($argv0)); > > > > # That didn't work either. Try SIGINT; > > - Verbose("*** Killing mpirun with SIGINT\n"); > > + Verbose("*** Killing $argv0 with SIGINT\n") > > + if (defined($argv0)); > > kill("INT", $pid); > > my $i = $wait_time; > > while ($i > 0) { > > sleep(1); > > $kid = waitpid($pid, WNOHANG); > > - return "MTT killed mpirun via SIGINT" if (0 != $kid); > > + return "MTT killed $argv0 via SIGINT" if (0 != $kid); > > --$i; > > } > > - Verbose("** Kill INT (after $wait_time seconds) didn't work!\n"); > > + Verbose("** Kill INT (after $wait_time seconds) didn't work!\n") > > + if (defined($argv0)); > > > > # Ok, now we're mad. Be violent. > > - Verbose("*** Now I'm mad. Killing mpirun with SIGKILL\n"); > > + Verbose("*** Now I'm mad. Killing $argv0 with SIGKILL\n") > > + if (defined($argv0)); > > my $count = 0; > > while (1) { > > kill("KILL", $pid); > > ++$count; > > $kid = waitpid($pid, WNOHANG); > > - return "MTT killed mpirun via $count SIGKILLs" if (0 != $kid); > > - Verbose("** Kill KILL didn't work! Sleeping and trying > again...\n"); > > + return "MTT killed $argv0 via $count SIGKILLs" if (0 != $kid); > > + Verbose("** Kill KILL didn't work! Sleeping and trying > again...\n") > > + if (defined($argv0)); > > sleep(1); > > } > > } > > > > > > -sub _kill_proc { > > - my ($pid) = @_; > > - # kill the group, take the names later > > - foreach my $process (descendant_processes($pid)) { > > +sub _kill_proc_tree { > > + my ($pid, $argv0) = @_; > > + > > + # Find all descendent processes of the main PID > > + my @children = descendant_processes($pid); > > + > > + # Kill the main PID. Note that _kill_proc_one() will give the > > + # process time to react/cleanup, so there's no need for an > > + # additional delay after it returns. > > + my $ret = _kill_proc_one($pid, $argv0); > > + > > + # Now kill all the processes that descended from the base PID. > > + # > > + # Note: when mpirun is working properly (which is one of the > > + # biggest use cases for DoCommand), this is redundant -- all the > > + # children will already be dead (because killing mpirun will > > + # ensure that all descendant processes are also killed). > > + # > > + # That being said, a) DoCommand() is used to launch more than just > > + # mpirun, b) mpirun breaks sometimes and doesn't clean up its > > + # descendants, and c) it's safe to call _kill_proc_one() on a PID > > + # that is already dead. > > + foreach my $process (@children) { > > + Debug("=== killing child proc: $process->{pid}, > $process->{argv0}\n"); > > + > > + # Ignore the return. For example, we don't care if the child > > + # is already dead > > _kill_proc_one($process->{pid}); > > } > > - return _kill_proc_one($pid); > > + > > + return $ret; > > } > > > > > #-------------------------------------------------------------------------- > > @@ -338,7 +371,6 @@ sub Cmd { > > > > my $tokens = _quote_escape($cmd); > > > > - > > my $pid; > > if (! $no_execute) { > > > > @@ -568,7 +600,7 @@ sub Cmd { > > > > $done = 0; > > } > > - $timeout_message = _kill_proc($pid); > > + $timeout_message = _kill_proc_tree($pid, ${$tokens}[0]); > > > > # We don't care about the exit status if we timed out > > # -- fill it with a bogus value. > > > > > https://github.com/open-mpi/mtt/commit/7fb4c6a4c9d71be127ea53bd463178510577f71f > > > > commit 7fb4c6a4c9d71be127ea53bd463178510577f71f > > Author: Jeff Squyres <jsquy...@cisco.com> > > Date: Sat Jun 21 04:42:19 2014 -0700 > > > > DoCommand: replace obscure perl code with easier-to-grok version > > > > The descendant_processes() and flatten() codes were written in a > > nicely efficient yet very difficult to understand manner. This > commit > > replaces these two subs with admittedly less efficient but much > easier > > to understand code. > > > > The new code also saves the first token of the command for each PID > so > > that it can be used in Debug/Verbose statements (e.g., identify the > > command that is being killed, etc.). > > --- > > lib/MTT/DoCommand.pm | 63 > ++++++++++++++++++++++++++++++++++++++++------------ > > 1 file changed, 49 insertions(+), 14 deletions(-) > > > > diff --git a/lib/MTT/DoCommand.pm b/lib/MTT/DoCommand.pm > > index c3be8a8..02cdb94 100644 > > --- a/lib/MTT/DoCommand.pm > > +++ b/lib/MTT/DoCommand.pm > > @@ -120,8 +120,8 @@ sub _kill_proc_one { > > sub _kill_proc { > > my ($pid) = @_; > > # kill the group, take the names later > > - foreach my $p (descendant_processes($pid)) { > > - _kill_proc_one($p); > > + foreach my $process (descendant_processes($pid)) { > > + _kill_proc_one($process->{pid}); > > } > > return _kill_proc_one($pid); > > } > > @@ -527,7 +527,8 @@ sub Cmd { > > if (defined($timeout_backtrace_program) and > !$got_backtrace) { > > $backtrace = ""; > > if ( $timeout_before_backtrace_program ) { > > - foreach my $p (descendant_processes($pid)) { > > + foreach my $process > (descendant_processes($pid)) { > > + my $p = $process->{pid}; > > my $c = $timeout_before_backtrace_program; > > $c =~ s/%PID%/$p/g; > > Debug("_pre_backtrace cmd: $c\n"); > > @@ -538,7 +539,8 @@ sub Cmd { > > $backtrace .= > _get_backtrace($timeout_backtrace_program, $pid, $pre_pernode); > > > > if ( $timeout_after_backtrace_program ) { > > - foreach my $p (descendant_processes($pid)) { > > + foreach my $process > (descendant_processes($pid)) { > > + my $p = $process->{pid}; > > my $c = $timeout_after_backtrace_program; > > $c =~ s/%PID%/$p/g; > > Debug("_post_backtrace cmd: $c\n"); > > @@ -738,19 +740,50 @@ sub _do_email_timeout_notification { > > close(TIMEOUT_SENTINEL_FILE); > > } > > > > -sub flatten { > > - map{ (ref($_) eq "ARRAY") ? map{flatten($_)}@$_ : $_ } @_; > > +sub find_children { > > + my ($pids, $start) = @_; > > + > > + my @ret; > > + # Scan the PID list and see if any other PIDs list $start as their > > + # PPID. If so, save that PID and then recursively look for *that* > > + # PID's children. > > + foreach my $pid (sort(keys(%{$pids}))) { > > + if ($pids->{$pid}->{ppid} == $start) { > > + push(@ret, { > > + pid => $pid, > > + argv0 => $pids->{$pid}->{argv0}, > > + }); > > + > > + # Add this PID's descendants to the return array > > + push(@ret, find_children($pids, $pid)); > > + } > > + } > > + > > + return @ret; > > } > > > > sub descendant_processes { > > my ($base) = (@_, $$); > > - my %parentage = (`ps -eo pid,ppid` =~ /\d+/g); > > - my %reverse = map { ($_, [$_]) } %parentage; > > - while (my ($pid,$ppid) = each %parentage){ > > - push @{$reverse{$ppid}}, $reverse{$pid}; > > + open(PS, "ps -eo pid,ppid,cmd|") || die "Can't open ps"; > > + > > + # Skip the title line > > + <PS>; > > + > > + # Read all the data lines > > + my $pids; > > + while (<PS>) { > > + # Grab the PID, PPID, and first token of the command > > + $_ =~ m/(\d+)\s+(\d+)\s+(\S+)/; > > + $pids->{$1} = { > > + pid => $1, > > + ppid => $2, > > + argv0 => $3, > > + }; > > } > > - shift @{$reverse{$base}}; > > - flatten($reverse{$base}); > > + close(PS); > > + > > + # Find all the descendants of the $base PID > > + return find_children($pids, $base); > > } > > > > sub _get_backtrace { > > @@ -774,7 +807,8 @@ sub _get_backtrace { > > > > # Use ps -Af output to fetch the child pids, > > # and grab a stack trace from each one > > - foreach my $p (descendant_processes($pid)) { > > + foreach my $process (descendant_processes($pid)) { > > + my $p = $process->{pid}; > > $gdb_cmd = "gdb - $p --command=$gdb_command_filename > --batch"; > > $ret .= "\n $gdb_cmd"; > > $ret .= "\n" . `$gdb_cmd`; > > @@ -814,8 +848,9 @@ sub _get_backtrace { > > } > > } > > Debug("Stacktrace: base name $return_basename\n"); > > - #foreach my $p (descendant_processes($pid)) > > + #foreach my $process (descendant_processes($pid)) > > #{ > > + # my $p = $process->{pid}; > > # my $gstack_cmd = "gstack $p"; > > # $ret .= "\n $gstack_cmd"; > > # $ret .= "\n" . `$gstack_cmd`; > > > > > https://github.com/open-mpi/mtt/commit/381ba177d835a54c3197d846f5a4edfc314efe27 > > > > commit 381ba177d835a54c3197d846f5a4edfc314efe27 > > Author: Jeff Squyres <jsquy...@cisco.com> > > Date: Sat Jun 21 03:48:55 2014 -0700 > > > > DoCommand: ensure to only kill the process once > > > > When the DoCommand timeout fires, MTT will kill the entire process > > tree of the launched command. However, the stdout/stderr sockets may > > still be draining, and therefore the main DoCommand loop may still > > execute more iterations. In cases like these, ensure that we don't > > try to backtrace/kill the process tree again. > > > > Also add a few more Debug statements for detailed output. > > --- > > lib/MTT/DoCommand.pm | 23 ++++++++++++++++------- > > 1 file changed, 16 insertions(+), 7 deletions(-) > > > > diff --git a/lib/MTT/DoCommand.pm b/lib/MTT/DoCommand.pm > > index 0e253f0..c3be8a8 100644 > > --- a/lib/MTT/DoCommand.pm > > +++ b/lib/MTT/DoCommand.pm > > @@ -471,6 +471,7 @@ sub Cmd { > > my $len = sysread(OUTread, $data, 99999); > > if (0 == $len) { > > vec($rin, fileno(OUTread), 1) = 0; > > + Debug("*** Child process stdout closed\n"); > > --$done; > > } else { > > _append($data, $print_timestamp ? localtime() : "", > > @@ -485,6 +486,7 @@ sub Cmd { > > my $len = sysread(ERRread, $data, 99999); > > if (0 == $len) { > > vec($rin, fileno(ERRread), 1) = 0; > > + Debug("*** Child process stderr closed\n"); > > --$done; > > } else { > > _append($data, $print_timestamp ? localtime() : "", > > @@ -493,11 +495,19 @@ sub Cmd { > > } > > } > > > > - # If we're running with a timeout, bail if we're past the end > > - # time > > - if (defined($end_time) && time() > $end_time) { > > + # If we're running with a timeout, check to see if a) the > > + # process is still running (i.e., stdout/stderr is still > > + # open), and b) we're past the end time. > > + if ($done > 0 && defined($end_time) && time() > $end_time) { > > my $over = time() - $end_time; > > - if ($over > $last_over) { > > + > > + # Note that we only want to backtrace/kill the process > > + # *once*. Consider: it is possible that we kill the > > + # process tree, but then stdout/stderr are still draining, > > + # and therefore we loop around here again *even though the > > + # process tree is already dead*. So put a little > > + # do-this-only-once protection in here. > > + if (!defined($timeout_message) && $over > $last_over) { > > Verbose("*** Past timeout of $timeout seconds by $over > seconds\n"); > > > > # Handle timeout file > > @@ -555,10 +565,8 @@ sub Cmd { > > ); > > > > $done = 0; > > - $timeout_message = _kill_proc($pid); > > - } else { > > - $timeout_message = _kill_proc($pid); > > } > > + $timeout_message = _kill_proc($pid); > > > > # We don't care about the exit status if we timed out > > # -- fill it with a bogus value. > > @@ -576,6 +584,7 @@ sub Cmd { > > } > > } > > } > > + Debug("*** Child process now dead\n"); > > close OUTerr; > > close OUTread > > if (!$merge_output); > > > > > https://github.com/open-mpi/mtt/commit/cfdd29de2012eeb7592706f00dd07a52dd48cf6b > > > > commit cfdd29de2012eeb7592706f00dd07a52dd48cf6b > > Author: Jeff Squyres <jsquy...@cisco.com> > > Date: Sat Jun 21 03:36:08 2014 -0700 > > > > DoCommand: remove unused variables > > --- > > lib/MTT/DoCommand.pm | 2 -- > > 1 file changed, 2 deletions(-) > > > > diff --git a/lib/MTT/DoCommand.pm b/lib/MTT/DoCommand.pm > > index 0740b28..0e253f0 100644 > > --- a/lib/MTT/DoCommand.pm > > +++ b/lib/MTT/DoCommand.pm > > @@ -445,7 +445,6 @@ sub Cmd { > > $end_time = time() + $timeout; > > Debug("Timeout: $timeout - $end_time (vs. now: " . time() . > ")\n"); > > } > > - my $killed_status = undef; > > my $last_over = 0; > > my $partial_out; > > my $partial_err; > > @@ -900,7 +899,6 @@ sub Win_Cmd { > > $end_time = time() + $timeout; > > Debug("Timeout: $timeout - $end_time (vs. now: " . time() . > ")\n"); > > } > > - my $killed_status = undef; > > my $last_over = 0; > > > > while(<OUTread>) { > > > > > https://github.com/open-mpi/mtt/commit/940030ca20eb1eaf256e898b83866c1cb83aca5c > > > > commit 940030ca20eb1eaf256e898b83866c1cb83aca5c > > Author: Jeff Squyres <jsquy...@cisco.com> > > Date: Fri Jun 20 17:13:08 2014 -0700 > > > > Diskfree.pm: fix df handle leak > > > > This left a zombied "df" process in the process table. The Perl > > runtime eventually cleaned it up, but it was weird to see a zombied > df > > in ps listings. > > > > There's no point in reporting this bug upstream; according to the > > Changes file, v0.06 was published in 1998. There's been no activity > > on http://search.cpan.org/dist/Filesys-DiskFree/DiskFree.pm since > > v0.06. > > --- > > lib/Filesys/DiskFree.pm | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/lib/Filesys/DiskFree.pm b/lib/Filesys/DiskFree.pm > > index 1fa655d..41a27d7 100644 > > --- a/lib/Filesys/DiskFree.pm > > +++ b/lib/Filesys/DiskFree.pm > > @@ -124,8 +124,9 @@ sub df(){ > > $cmd=$self->command() or > > croak "No df command known for format ".$self->{'FORMAT'}; > > open(HANDLE,"$cmd|") or croak("Cannot fork \"$cmd\": $!"); > > - return $self->load(\*HANDLE); > > + my $ret = $self->load(\*HANDLE); > > close(HANDLE) or croak("Cannot df $!"); > > + return $ret; > > } > > > > sub load() { > > > > ----------------------------------------------------------------------- > > > > Summary of changes: > > lib/Filesys/DiskFree.pm | 3 +- > > lib/MTT/DoCommand.pm | 162 > +++++++++++++++++++++++++++++++++++------------- > > 2 files changed, 120 insertions(+), 45 deletions(-) > > > > > > hooks/post-receive > > -- > > MTT > > _______________________________________________ > > mtt-svn mailing list > > mtt-...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-svn > > > > _______________________________________________ > > mtt-devel mailing list > > mtt-de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel > > Searchable archives: > http://www.open-mpi.org/community/lists/mtt-devel/2014/06/index.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > mtt-devel mailing list > mtt-de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel > Link to this post: > http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0627.php >