Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1 MTT. :- ( I can't reproduce them manually, but they seem to only happen in a very small fraction of overall MTT runs. I'm seeing at least 3 classes of errors:

1. btl_sm_add_procs.c:529 which is this:

if(mca_btl_sm_component.fifo[j][my_smp_rank].head_lock != NULL) {

j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j][my_smp_rank] appears to have a valid value in it (i.e., .fifo[3][0] = x, .fifo[3] [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x+3*offset. But gdb says:

(gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
Cannot access memory at address 0x2a96b73050

I see a fair number of these errors. This is unbelievable to me; if we have a problem in the startup of the sm btl, how on earth has it escaped for so long?

2. btl_sm_component.c:430 which is this:

                reg->cbfunc(&mca_btl_sm.super, hdr->tag, &(Frag.base),
                            reg->cbdata);

reg->cbfunc == NULL in this case.  I only see a few of these.

3. ompi_fifo.h:422 which is this:

    return_value = ompi_cb_fifo_read_from_tail(&fifo->tail->cb_fifo,
            fifo->tail->cb_overflow, &queue_empty);

fifo->tail points to memory that gdb says we cannot access. I only see a few of these.

I'm running on RHEL4U6 with a variety of different classes of Xeon machines. In one particular run, they were slightly older Xeon machines, 2 core/2 socket machines.

I also found a segv in ibm/environment/finalize where a strlen() was segv'ing, but I'm unable to diagnose that any further because the char* argument passed to an asprintf() is the return value of a function that should *never* be NULL. :-\

The one thing that these failures have in common is that they all appear to be compiled by icc. Here's the configure line:

CC=icc CXX=icpc F77=ifort FC=ifort "CFLAGS=-g -wd188" --enable- picky --enable-debug --enable-mpirun-prefix-by-default --disable-dlopen

Here's a run line, but the MCA parameters appear to vary wildly in terms of which tests are failing (remember that I run 20+ variants of each test at Cisco):

mpirun -np 8 --mca btl_openib_device_type ib --mca btl sm,openib,self pt2pt/allocmem

Here's a slice of an MTT report that shows the problem:

    http://www.open-mpi.org/mtt/index.php?do_redir=970

(ignore any "svbu-mpiXXX - daemon did not report back when launched" errors; that's SLURM mucking up)

I'm digging further...  But help on this would be appreciated...

--
Jeff Squyres
Cisco Systems

Reply via email to