I'm not 100% sure, but this looks like the changeset that caused all
of IU's trunk MTT
runs last night to segfault... yes, all. :-(

Here's the magnitude of the problem:
http://www.open-mpi.org/mtt/index.php?do_redir=883

Note how pretty much everything was passing for 1.4a1r19979,
and everything failing for 1.4a1r19991.

I am not sure why there are only results from absoft and IU.  Maybe the
sun MTT runs just haven't finished yet from last night.

Take a look at these MTT results for a manageable sample where you could
click on the "details" button to see the various segfaults:
http://www.open-mpi.org/mtt/index.php?do_redir=884

Most of the segfaults look something like this that involve the mca_iof_hnp.so:

======================
[odin093:06882] *** Process received signal ***
[odin093:06882] Signal: Segmentation fault (11)
[odin093:06882] Signal code: Address not mapped (1)
[odin093:06882] Failing at address: 0x8
[odin093:06882] [ 0] /lib64/libpthread.so.0 [0x2aaaaba4ee70]
[odin093:06882] [ 1]
/nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_3/installs/TqMo/install/lib/openmpi/mca_iof_hnp.so
[0x2aaaadc1c3fd]
[odin093:06882] [ 2]
/nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_3/installs/TqMo/install/lib/libopen-pal.so.0
[0x2aaaaaf29b0b]
[odin093:06882] [ 3] mpirun [0x4033e3]
[odin093:06882] [ 4] mpirun [0x402b13]
[odin093:06882] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2aaaabc788b4]
[odin093:06882] [ 6] mpirun [0x402a49]
[odin093:06882] *** End of error message ***
======================

But there are a few that don't have mca_iof_hnp.so in the stacktrace, so
I could be wrong about which changeset caused this:

======================
[odin090:12437] *** Process received signal ***
[odin090:12437] Signal: Segmentation fault (11)
[odin090:12437] Signal code: Address not mapped (1)
[odin090:12437] Failing at address: 0x4
[odin090:12437] [ 0] [0xffffe600]
[odin090:12437] [ 1]
/nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_2/installs/U_ro/install/lib/libopen-pal.so.0
[0xf7f5b118]
[odin090:12437] [ 2]
/nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_2/installs/U_ro/install/lib/libopen-pal.so.0(opal_event_loop+0x27)
[0xf7f5b367]
[odin090:12437] [ 3]
/nfs/rinfs/san/homedirs/mpiteam/mtt-runs/odin/20081112-Nightly/pb_2/installs/U_ro/install/lib/libopen-pal.so.0(opal_event_dispatch+0x1e)
[0xf7f5b38e]
[odin090:12437] [ 4] mpirun [0x804a8f8]
[odin090:12437] [ 5] mpirun [0x8049f36]
[odin090:12437] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) [0xf7d8ddec]
[odin090:12437] [ 7] mpirun [0x8049e61]
[odin090:12437] *** End of error message ***
======================

On Wed, Nov 12, 2008 at 6:32 PM,  <r...@osl.iu.edu> wrote:
> Author: rhc
> Date: 2008-11-12 18:32:01 EST (Wed, 12 Nov 2008)
> New Revision: 19991
> URL: https://svn.open-mpi.org/trac/ompi/changeset/19991
>
> Log:
> Fix the iof race conditions wrt proc termination. This is comprised of two 
> sections:
>
> 1. modify the iof to track when a proc actually closes all of its open iof 
> output pipes. When this occurs, notify the odls that the proc's iof is 
> complete. This is done via a zero-time event so that we can step out of the 
> read event before processing the notification.
>
> 2. in the odls, modify the waitpid callback so it only flags that it was 
> called. Add a function to receive the iof-complete notification, and a 
> function that checks for both iof complete and waitpid callback before 
> declaring a proc fully terminated. This ensures that we read and deliver 
> -all- of the IO prior to declaring the job complete.
>
> Also modified the odls call to orte_iof.close (and the component's 
> implementation) so it only closes stdin, leaving the other io channels alone. 
> This fixes the other half of the known problem.
>
> This should fix the ticket on this subject, but I'll wait to close it pending 
> further testing in the trunk.
>
> Text files modified:
>   trunk/orte/mca/iof/base/base.h                   |    30 +++-
>   trunk/orte/mca/iof/base/iof_base_open.c          |    32 ++++
>   trunk/orte/mca/iof/hnp/iof_hnp.c                 |    98 +++++++------
>   trunk/orte/mca/iof/hnp/iof_hnp.h                 |     2
>   trunk/orte/mca/iof/hnp/iof_hnp_component.c       |    14 -
>   trunk/orte/mca/iof/hnp/iof_hnp_read.c            |    62 +++++++-
>   trunk/orte/mca/iof/orted/iof_orted.c             |    85 +++++++----
>   trunk/orte/mca/iof/orted/iof_orted.h             |     2
>   trunk/orte/mca/iof/orted/iof_orted_component.c   |     6
>   trunk/orte/mca/iof/orted/iof_orted_read.c        |    39 +++++
>   trunk/orte/mca/odls/base/base.h                  |     5
>   trunk/orte/mca/odls/base/odls_base_default_fns.c |   280 
> +++++++++++++++++++++++++--------------
>   trunk/orte/mca/odls/base/odls_base_open.c        |     2
>   trunk/orte/mca/odls/base/odls_private.h          |     2
>   trunk/orte/mca/odls/odls_types.h                 |     3
>   trunk/orte/runtime/orte_wait.c                   |    17 ++
>   trunk/orte/runtime/orte_wait.h                   |    33 ++++
>   17 files changed, 491 insertions(+), 221 deletions(-)
>
> Modified: trunk/orte/mca/iof/base/base.h
>

-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
    I'm a bright... http://www.the-brights.net/

Reply via email to