Hi!
I'm still having some troubles using the newly released 1.3 with
Myricom's MX. I've meant to send a message earlier, but the release
candidates went so fast that I didn't have time to catch up and test.
General details:
Nodes with dual CPU, dual core Opteron 2220, 8 GB RAM
Debian etch x86_64, self-compiled kernel 2.6.22.18, gcc-4.1
Torque 2.1.10 (but this shouldn't make a difference)
MX 1.2.7 with a tiny patch from Myricom
OpenMPI 1.3
IMB 3.1
OpenMPI was configured with '--enable-shared --enable-static
--with-mx=... --with-tm=...'
In all cases, there were no options specified at runtime (either in
files or on the command line) except for the PML and BTL selection.
Problem 1:
I still see hangs of collective functions when running on large number
of nodes (or maybe ranks) with the default OB1+BTL. F.e. with 128
ranks distributed as nodes=32:ppn=4 or nodes=64:ppn=2, the IMB hangs
in Gather.
strace reports for each rank a stream of:
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 3,
0) = 0
and once in a while a:
futex(0x55f650, FUTEX_WAKE, 1) = 1
A gdb stack shows:
#0 0x00002acf2615d090 in pthread_mutex_unlock () from /lib/libpthread.so.0
#1 0x00002acf25a53858 in mx_ipeek (endpoint=0x565150, request=0x7fff857215c0,
result=0x7fff857215cc) at ./../mx_ipeek.c:45
#2 0x00002acf2551ec87 in mca_btl_mx_component_progress () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#3 0x00002acf258ea1a2 in opal_progress () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libopen-pal.so.0
#4 0x00002acf2558436b in mca_pml_ob1_recv () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#5 0x00002acf25535da6 in ompi_coll_tuned_gather_intra_linear_sync () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#6 0x00002acf255280e6 in ompi_coll_tuned_gather_intra_dec_fixed () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#7 0x00002acf254faa93 in PMPI_Gather () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#8 0x0000000000409e7b in IMB_gather ()
#9 0x0000000000403838 in main ()
and on another rank:
#0 0x00002b71acb4ac04 in mca_btl_mx_component_progress () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#1 0x00002b71acf161a2 in opal_progress () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libopen-pal.so.0
#2 0x00002b71acbb036b in mca_pml_ob1_recv () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#3 0x00002b71acb61da6 in ompi_coll_tuned_gather_intra_linear_sync () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#4 0x00002b71acb540e6 in ompi_coll_tuned_gather_intra_dec_fixed () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#5 0x00002b71acb26a93 in PMPI_Gather () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#6 0x0000000000409e7b in IMB_gather ()
#7 0x0000000000403838 in main ()
Problem 2:
When using the CM+MTL with 128 ranks, it finishes fine when running on
nodes=64:ppn=2, but on nodes=32:ppn=4 I get a stream of errors that I
haven't seen before:
Max retransmit retries reached (1000) for message
Max retransmit retries reached (1000) for message
type (2): send_medium
state (0x14): buffered dead
requeued: 1000 (timeout=510000ms)
dest: 00:60:dd:47:89:40 (opt029:0)
partner: peer_index=146, endpoint=3, seqnum=0x2944
type (2): send_medium
state (0x14): buffered dead
requeued: 1000 (timeout=510000ms)
dest: 00:60:dd:47:89:40 (opt029:0)
partner: peer_index=146, endpoint=3, seqnum=0x2f9a
matched_val: 0x0068002a_fffffff2
slength=32768, xfer_length=32768
matched_val: 0x0068002b_fffffff2
slength=32768, xfer_length=32768
seg: 0x2aaacc30f010,32768
caller: 0x5b
Was trying to contact
00:60:dd:47:89:40 (opt029:0)/3
Aborted 2 send requests due to remote peer seg: 0x2aaacc30f010,32768
caller: 0x1b
Was trying to contact
00:60:dd:47:89:40 (opt029:0)00:60:dd:47:89:40 (opt029:0) disconnected
/3
Aborted 2 send requests due to remote peer 00:60:dd:47:89:40 (opt029:0)
disconnected
...
(output comes interleaved from nodes, so there might be information
missing or garbled). These seem to come from the libmyriexpress, not
OpenMPI. However earlier OpenMPI versions have not shown such errors
and neither MPICH-MX, so I wonder if some new behaviour in OpenMPI 1.3
triggers them. From the MX experts out there, I would also need some
help to understand what is the source of these messages - I can only
see opt029 mentioned, so does it try to communicate intra-node ? (IOW
the equivalent of "self" BTL in OpenMPI) This would be somehow
consistent with running more ranks per node (4) than the successfull
job (with 2 ranks per node).
At this point, the job hangs in Alltoallv. The strace output is the
same as for OB1+BTL above.
The gdb stack shows:
#0 0x00002b001d01e318 in mx__luigi (ep=0x55f650) at ./../mx__lib.c:2373
#1 0x00002b001d01283d in mx_ipeek (endpoint=0x55f650, request=0x7fff8e160090,
result=0x7fff8e16009c) at ./../mx_ipeek.c:40
#2 0x00002b001cb29252 in ompi_mtl_mx_progress () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#3 0x00002b001cea91a2 in opal_progress () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libopen-pal.so.0
#4 0x00002b001ca9f77d in ompi_request_default_wait_all () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#5 0x00002b001caec8c4 in ompi_coll_tuned_alltoallv_intra_basic_linear () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#6 0x00002b001cab36c0 in PMPI_Alltoallv () from
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#7 0x000000000040a5be in IMB_alltoallv ()
#8 0x0000000000403838 in main ()
Can anyone suggest some ways forward ? I'd be happy to help in
debugging if given some instructions.
Thanks in advance!
--
Bogdan Costescu
IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de