Ralf, Rolf, and I talked about this issue this morning on the phone. We're pretty sure that it's an overflow because of the large number of procs being run. LANL is going to try running with -DLARGE_CLUSTER and see what happens. Rolf thinks he's run the Intel C tests up to 1k procs, so hopefully that should be sufficient.

On Oct 30, 2008, at 7:01 PM, Ralph Castain wrote:

Hi folks

We aren't running a full MTT here (which is why I'm reporting these results to the list instead of into the MTT database), but we are running a subset of tests on the 1.3 beta and hitting a consistent set of errors involving five tests. For reference, we see all of these tests pass on 1.2.6, but fail in the identical way on 1.2.8 - so it appears that something systematic may have entered the system and gotten into the 1.2 series as well.

The tests are:
MPI_Pack_user_type
MPI_Type_hindexed_blklen
MPI_Type_vector_stride
MPI_Cart_get_c
MPI_Graph_neighbors_c

The tests are running under slurm on RHEL5 with 16-cores of Opteron processors on each node plus IB. The below results are with 40 nodes at 16ppn.

Any thoughts would be appreciated. Meantime, we are trying different ppn to see if that has an impact.

Thanks
Ralph

Here is what we see:

MPITEST error (585): 1 errors in buffer (17, 5) len 1024 commsize 214
commtype -14 extent 64 root 194 MPITEST error (591): Received buffer
overflow, Expected buffer[65536]: -197, Actual buffer[65536]: 59
MPITEST
error (591): 1 errors in buffer (17, 5) len 1024 commsize 214
commtype -14
extent 64 root 196
MPITEST_results: MPI_Pack_user_type 60480 tests FAILED (of 21076704)

MPITEST error (597): Received buffer overflow, Expected
buffer[16384]: -199,
Actual buffer[16384]: 57 MPITEST error (597): 1 errors in buffer
(17, 5) len
16 commsize 214 commtype
-14 extent 64 root 198
MPITEST error (585): Received buffer overflow, Expected
buffer[16384]: -195,
Actual buffer[16384]: 61 MPITEST error (585): 1 errors in buffer
(17, 5) len
16 commsize 214 commtype
-14 extent 64 root 194
MPITEST_results: MPI_Type_hindexed_blklen 60480 tests FAILED (of
21076704)


MPITEST error (597): Received buffer overflow, Expected
buffer[65536]: -199,
Actual buffer[65536]: 57 MPITEST error (597): 1 errors in buffer
(17, 5) len
512 commsize 214 commtype -14 extent 64 root 198 MPITEST error (615):
Received buffer overflow, Expected buffer[65536]: -205, Actual
buffer[65536]: 51 MPITEST error (615): 1 errors in buffer (17, 5)
len 512
commsize 214 commtype -14 extent 64 root 204
MPITEST_results: MPI_Type_vector_stride 60480 tests FAILED (of
21076704)

[lob097:32556] *** Process received signal *** mpirun noticed that
job rank
0 with PID 32556 on node lob097 exited on signal 11 (Segmentation
fault).
639 additional processes aborted (not shown)
make[1]: *** [MPI_Cart_get_c] Error 139


MPITEST fatal error (568): MPI_ERR_COMM: invalid communicator
MPITEST fatal
error (572): MPI_ERR_COMM: invalid communicator MPITEST fatal error
(574):
MPI_ERR_COMM: invalid communicator mpirun noticed that job rank 37
with PID
32074 on node lob099 exited on signal 1 (Hangup).
18 additional processes aborted (not shown)
make[1]: *** [MPI_Graph_neighbors_c] Error 1




Here is how the different versions are built:

1.2.6 and 1.2.8
oob_tcp_connect_timeout=600
pml_ob1_use_early_completion=0
mca_component_show_load_errors=0
btl_openib_ib_retry_count=7
btl_openib_ib_timeout=31
mpi_keep_peer_hostnames=1


RPMBUILD parameters
setenv CPPFLAGS -I/opt/panfs/include
setenv CFLAGS -I/opt/panfs/include

rpmbuild -bb ./SPECS/loboopenmpi128.spec \
--with gcc \
--with root=/opt/OpenMPI \
--with shared \
--with openib \
--with slurm \
--without pty_support \
--without dlopen \
--with io_romio_flags=--with-file-system=ufs+nfs+panfs






1.3beta

# Basic behavior to smooth startup
mca_component_show_load_errors = 0
orte_abort_timeout = 10
opal_set_max_sys_limits = 1

## Protect the shared file systems
orte_no_session_dirs = /panfs,/scratch,/users,/usr/projects
orte_tmpdir_base = /tmp

## Require an allocation to run - protects the frontend
## from inadvertent job executions
orte_allocation_required = 1

## Add the interface for out-of-band communication
## and set it up
oob_tcp_if_include=ib0
oob_tcp_peer_retries = 10
oob_tcp_disable_family = IPv6
oob_tcp_listen_mode = listen_thread
oob_tcp_sndbuf = 32768
oob_tcp_rcvbuf = 32768

## Define the MPI interconnects
btl = sm,openib,self

## Setup OpenIB
btl_openib_want_fork_support = 0
btl_openib_cpc_include = oob
#btl_openib_receive_queues = P,128,256,64,32,32:S, 2048,1024,128,32:S,12288,1024,128,32:S,65536,1024,128,32

## Enable cpu affinity
mpi_paffinity_alone = 1

## Setup MPI options
mpi_show_handle_leaks = 0
mpi_warn_on_fork = 1

enable_dlopen=no
with_openib=/opt/ofed
with_openib_libdir=/opt/ofed/lib64
enable_mem_debug=no
enable_mem_profile=no
enable_debug_symbols=no
enable_binaries=yes
with_devel_headers=yes
enable_heterogeneous=yes
enable_debug=no
enable_shared=yes
enable_static=yes
with_slurm=yes
enable_memchecker=no
enable_ipv6=no
enable_mpi_f77=yes
enable_mpi_f90=yes
enable_mpi_cxx=yes
enable_mpi_cxx_seek=yes
enable_cxx_exceptions=yes
enable_mca_no_build=pml-dr,pml-crcp2,crcp,filem
with_io_romio_flags=--with-file-system=ufs+nfs+panfs
with_threads=posix

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

Reply via email to