I'm not 100% sure, but this looks like the changeset that caused all of IU's trunk MTT runs last night to segfault... yes, all. :-(
Here's the magnitude of the problem: http://www.open-mpi.org/mtt/index.php?do_redir=883 Note how pretty much everything was passing for 1.4a1r19979, and everything failing for 1.4a1r19991. I am not sure why there are only results from absoft and IU. Maybe the sun MTT runs just haven't finished yet from last night. Take a look at these MTT results for a manageable sample where you could click on the "details" button to see the various segfaults: http://www.open-mpi.org/mtt/index.php?do_redir=884 Most of the segfaults look something like this that involve the mca_iof_hnp.so: ====================== [odin093:06882] *** Process received signal *** [odin093:06882] Signal: Segmentation fault (11) [odin093:06882] Signal code: Address not mapped (1) [odin093:06882] Failing at address: 0x8 [odin093:06882] [ 0] /lib64/libpthread.so.0 [0x2aaaaba4ee70] [odin093:06882] [ 1] /nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_3/installs/TqMo/install/lib/openmpi/mca_iof_hnp.so [0x2aaaadc1c3fd] [odin093:06882] [ 2] /nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_3/installs/TqMo/install/lib/libopen-pal.so.0 [0x2aaaaaf29b0b] [odin093:06882] [ 3] mpirun [0x4033e3] [odin093:06882] [ 4] mpirun [0x402b13] [odin093:06882] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2aaaabc788b4] [odin093:06882] [ 6] mpirun [0x402a49] [odin093:06882] *** End of error message *** ====================== But there are a few that don't have mca_iof_hnp.so in the stacktrace, so I could be wrong about which changeset caused this: ====================== [odin090:12437] *** Process received signal *** [odin090:12437] Signal: Segmentation fault (11) [odin090:12437] Signal code: Address not mapped (1) [odin090:12437] Failing at address: 0x4 [odin090:12437] [ 0] [0xffffe600] [odin090:12437] [ 1] /nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_2/installs/U_ro/install/lib/libopen-pal.so.0 [0xf7f5b118] [odin090:12437] [ 2] /nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_2/installs/U_ro/install/lib/libopen-pal.so.0(opal_event_loop+0x27) [0xf7f5b367] [odin090:12437] [ 3] /nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_2/installs/U_ro/install/lib/libopen-pal.so.0(opal_event_dispatch+0x1e) [0xf7f5b38e] [odin090:12437] [ 4] mpirun [0x804a8f8] [odin090:12437] [ 5] mpirun [0x8049f36] [odin090:12437] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) [0xf7d8ddec] [odin090:12437] [ 7] mpirun [0x8049e61] [odin090:12437] *** End of error message *** ====================== On Wed, Nov 12, 2008 at 6:32 PM, <r...@osl.iu.edu> wrote: > Author: rhc > Date: 2008-11-12 18:32:01 EST (Wed, 12 Nov 2008) > New Revision: 19991 > URL: https://svn.open-mpi.org/trac/ompi/changeset/19991 > > Log: > Fix the iof race conditions wrt proc termination. This is comprised of two > sections: > > 1. modify the iof to track when a proc actually closes all of its open iof > output pipes. When this occurs, notify the odls that the proc's iof is > complete. This is done via a zero-time event so that we can step out of the > read event before processing the notification. > > 2. in the odls, modify the waitpid callback so it only flags that it was > called. Add a function to receive the iof-complete notification, and a > function that checks for both iof complete and waitpid callback before > declaring a proc fully terminated. This ensures that we read and deliver > -all- of the IO prior to declaring the job complete. > > Also modified the odls call to orte_iof.close (and the component's > implementation) so it only closes stdin, leaving the other io channels alone. > This fixes the other half of the known problem. > > This should fix the ticket on this subject, but I'll wait to close it pending > further testing in the trunk. > > Text files modified: > trunk/orte/mca/iof/base/base.h | 30 +++- > trunk/orte/mca/iof/base/iof_base_open.c | 32 ++++ > trunk/orte/mca/iof/hnp/iof_hnp.c | 98 +++++++------ > trunk/orte/mca/iof/hnp/iof_hnp.h | 2 > trunk/orte/mca/iof/hnp/iof_hnp_component.c | 14 - > trunk/orte/mca/iof/hnp/iof_hnp_read.c | 62 +++++++- > trunk/orte/mca/iof/orted/iof_orted.c | 85 +++++++---- > trunk/orte/mca/iof/orted/iof_orted.h | 2 > trunk/orte/mca/iof/orted/iof_orted_component.c | 6 > trunk/orte/mca/iof/orted/iof_orted_read.c | 39 +++++ > trunk/orte/mca/odls/base/base.h | 5 > trunk/orte/mca/odls/base/odls_base_default_fns.c | 280 > +++++++++++++++++++++++++-------------- > trunk/orte/mca/odls/base/odls_base_open.c | 2 > trunk/orte/mca/odls/base/odls_private.h | 2 > trunk/orte/mca/odls/odls_types.h | 3 > trunk/orte/runtime/orte_wait.c | 17 ++ > trunk/orte/runtime/orte_wait.h | 33 ++++ > 17 files changed, 491 insertions(+), 221 deletions(-) > > Modified: trunk/orte/mca/iof/base/base.h > -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/