In r1300, I incorporated your notes, except for #2 and #3. I dug around for a CPAN module for stack traces, but couldn't find anything. Maybe MTT could contain a stripped down version of GDB for the sole purpose of gathering stack traces? Or is there some other open source "stack trace grabber" tool out there?
-Ethan On Mon, Jun/22/2009 07:37:16AM, Jeff Squyres wrote: > Actually, I think this would be fine for the trunk. Some random notes: > > 1. It might be nice to move this logic out of the docommand sub itself and > into its own sub. > 2. it would also be good to generalize the ps and gdb commands for systems > where those variants are not relevant > 3. it might even be good to generally develop the backtrace functionality > overall -- backtraces would be really good to capture in the database... > 4. how about having a[n optional] timeout with the sentinel file? that is, > it'll send a mail, then wait another timeout (e.g., 1 hour) and if the > sentinel file still exists, mtt will remove the file and keep going > > > On Jun 19, 2009, at 2:47 PM, Ethan Mallove wrote: > >> Folks, >> >> I came up with a feature, which does not seem quite appropriate to go >> into the MTT trunk, but is still possibly useful for someone other >> than me. I have posted a note about it on the MTT wiki: >> >> http://svn.open-mpi.org/trac/mtt/wiki/EmailTimeoutNotification >> >> Here's the text of the Wiki page: >> >> We (Sun) were trying to track down a hang in an MPI test that we were >> seeing in our MTT runs which was difficult to reproduce manually. The >> problem is that MTT kills the hanging process before a developer has a >> chance to investigate the issue. To address this, I patched an MTT >> client (see attached patch file) to send out a notification email >> containing an mpirun command line and GDB back trace for the hanging >> test. A predefined sentinel file is touched, which can later be >> removed to force MTT to move on and continue testing. Here are the INI >> parameters to activate the timeout email notification: >> >> * {{{docommand_timeout_sentinel_file}}} >> * {{{docommand_timeout_email_recipient}}} >> >> Example usage: >> >> {{{ >> $ client/mtt --scratch /foo/bar --file foo.ini >> >> docommand_timeout_sentinel_file=/tmp/mtt-timeout-sentinel-file-\&random_string\(10\) >> >> docommand_timeout_email_recipient=fred.flints...@sun.com,barney.rub...@sun.com >> }}} >> >> -Ethan >> _______________________________________________ >> mtt-devel mailing list >> mtt-de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel >> > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > mtt-devel mailing list > mtt-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel