Actually, I think this would be fine for the trunk.  Some random notes:

1. It might be nice to move this logic out of the docommand sub itself and into its own sub. 2. it would also be good to generalize the ps and gdb commands for systems where those variants are not relevant 3. it might even be good to generally develop the backtrace functionality overall -- backtraces would be really good to capture in the database... 4. how about having a[n optional] timeout with the sentinel file? that is, it'll send a mail, then wait another timeout (e.g., 1 hour) and if the sentinel file still exists, mtt will remove the file and keep going


On Jun 19, 2009, at 2:47 PM, Ethan Mallove wrote:

Folks,

I came up with a feature, which does not seem quite appropriate to go
into the MTT trunk, but is still possibly useful for someone other
than me. I have posted a note about it on the MTT wiki:

  http://svn.open-mpi.org/trac/mtt/wiki/EmailTimeoutNotification

Here's the text of the Wiki page:

We (Sun) were trying to track down a hang in an MPI test that we were
seeing in our MTT runs which was difficult to reproduce manually. The
problem is that MTT kills the hanging process before a developer has a
chance to investigate the issue. To address this, I patched an MTT
client (see attached patch file) to send out a notification email
containing an mpirun command line and GDB back trace for the hanging
test. A predefined sentinel file is touched, which can later be
removed to force MTT to move on and continue testing. Here are the INI
parameters to activate the timeout email notification:

 * {{{docommand_timeout_sentinel_file}}}
 * {{{docommand_timeout_email_recipient}}}

Example usage:

{{{
$ client/mtt --scratch /foo/bar --file foo.ini
docommand_timeout_sentinel_file=/tmp/mtt-timeout-sentinel-file- \&random_string\(10\)
  docommand_timeout_email_recipient=fred.flints...@sun.com,barney.rub...@sun.com
}}}

-Ethan
_______________________________________________
mtt-devel mailing list
mtt-de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel



--
Jeff Squyres
Cisco Systems

Reply via email to