Further to the mail linked below, padb is able to perform diagnostics,
including backtraces on hung jobs and integrates well into automated
testing environments.

The attached patch is a minimal change which should enable the
functionality.  I don't however have access to a working MTT
installation to test this however.

http://www.open-mpi.org/community/lists/mtt-devel/2009/06/0415.php

This will require a HEAD version of padb, at least r273 to allow it to
accept the pid of mpirun rather than a jobid assigned by the underlying
resource manager.

Yours,

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
Index: lib/MTT/DoCommand.pm
===================================================================
--- lib/MTT/DoCommand.pm	(revision 1322)
+++ lib/MTT/DoCommand.pm	(working copy)
@@ -359,6 +359,7 @@
     }
     my $killed_status = undef;
     my $last_over = 0;
+    my $padb_output;
     while ($done > 0) {
         my $nfound = select($rout = $rin, undef, undef, $t);
         if (vec($rout, fileno(OUTread), 1) == 1) {
@@ -410,6 +411,8 @@
                 my $timeout_email_recipient = $MTT::Globals::Values->{docommand_timeout_notify_email};
                 my $timeout_notify_timeout  = $MTT::Globals::Values->{docommand_timeout_notify_timeout};

+		$padb_output = `padb --config-option rmgr=mpirun --full-report=$pid`;
+
                 if (defined($timeout_sentinel_file)) {

                     # Email someone, if an email address has been specified
@@ -493,6 +496,9 @@
     # Return an anonymous hash containing the relevant data

     $ret->{result_stdout} = join('', @out);
+    if ( defined $padb_output ) {
+      $ret->{result_stdout} .= "\n$padb_output";
+    }
     $ret->{result_stderr} = join('', @err),
         if (!$merge_output);
     return $ret;

Reply via email to