Further to the mail linked below, padb is able to perform diagnostics, including backtraces on hung jobs and integrates well into automated testing environments.
The attached patch is a minimal change which should enable the functionality. I don't however have access to a working MTT installation to test this however. http://www.open-mpi.org/community/lists/mtt-devel/2009/06/0415.php This will require a HEAD version of padb, at least r273 to allow it to accept the pid of mpirun rather than a jobid assigned by the underlying resource manager. Yours, Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Index: lib/MTT/DoCommand.pm =================================================================== --- lib/MTT/DoCommand.pm (revision 1322) +++ lib/MTT/DoCommand.pm (working copy) @@ -359,6 +359,7 @@ } my $killed_status = undef; my $last_over = 0; + my $padb_output; while ($done > 0) { my $nfound = select($rout = $rin, undef, undef, $t); if (vec($rout, fileno(OUTread), 1) == 1) { @@ -410,6 +411,8 @@ my $timeout_email_recipient = $MTT::Globals::Values->{docommand_timeout_notify_email}; my $timeout_notify_timeout = $MTT::Globals::Values->{docommand_timeout_notify_timeout}; + $padb_output = `padb --config-option rmgr=mpirun --full-report=$pid`; + if (defined($timeout_sentinel_file)) { # Email someone, if an email address has been specified @@ -493,6 +496,9 @@ # Return an anonymous hash containing the relevant data $ret->{result_stdout} = join('', @out); + if ( defined $padb_output ) { + $ret->{result_stdout} .= "\n$padb_output"; + } $ret->{result_stderr} = join('', @err), if (!$merge_output); return $ret;