Hi!

I'm still having some troubles using the newly released 1.3 with Myricom's MX. I've meant to send a message earlier, but the release candidates went so fast that I didn't have time to catch up and test.

General details:
        Nodes with dual CPU, dual core Opteron 2220, 8 GB RAM
        Debian etch x86_64, self-compiled kernel 2.6.22.18, gcc-4.1
        Torque 2.1.10 (but this shouldn't make a difference)
        MX 1.2.7 with a tiny patch from Myricom
        OpenMPI 1.3
        IMB 3.1

OpenMPI was configured with '--enable-shared --enable-static --with-mx=... --with-tm=...' In all cases, there were no options specified at runtime (either in files or on the command line) except for the PML and BTL selection.

Problem 1:

I still see hangs of collective functions when running on large number of nodes (or maybe ranks) with the default OB1+BTL. F.e. with 128 ranks distributed as nodes=32:ppn=4 or nodes=64:ppn=2, the IMB hangs in Gather.

strace reports for each rank a stream of:

poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 3, 
0) = 0

and once in a while a:

futex(0x55f650, FUTEX_WAKE, 1)          = 1

A gdb stack shows:

#0  0x00002acf2615d090 in pthread_mutex_unlock () from /lib/libpthread.so.0
#1  0x00002acf25a53858 in mx_ipeek (endpoint=0x565150, request=0x7fff857215c0, 
result=0x7fff857215cc) at ./../mx_ipeek.c:45
#2  0x00002acf2551ec87 in mca_btl_mx_component_progress () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#3  0x00002acf258ea1a2 in opal_progress () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libopen-pal.so.0
#4  0x00002acf2558436b in mca_pml_ob1_recv () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#5  0x00002acf25535da6 in ompi_coll_tuned_gather_intra_linear_sync () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#6  0x00002acf255280e6 in ompi_coll_tuned_gather_intra_dec_fixed () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#7  0x00002acf254faa93 in PMPI_Gather () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#8  0x0000000000409e7b in IMB_gather ()
#9  0x0000000000403838 in main ()

and on another rank:

#0  0x00002b71acb4ac04 in mca_btl_mx_component_progress () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#1  0x00002b71acf161a2 in opal_progress () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libopen-pal.so.0
#2  0x00002b71acbb036b in mca_pml_ob1_recv () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#3  0x00002b71acb61da6 in ompi_coll_tuned_gather_intra_linear_sync () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#4  0x00002b71acb540e6 in ompi_coll_tuned_gather_intra_dec_fixed () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#5  0x00002b71acb26a93 in PMPI_Gather () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#6  0x0000000000409e7b in IMB_gather ()
#7  0x0000000000403838 in main ()

Problem 2:

When using the CM+MTL with 128 ranks, it finishes fine when running on nodes=64:ppn=2, but on nodes=32:ppn=4 I get a stream of errors that I haven't seen before:

Max retransmit retries reached (1000) for message
Max retransmit retries reached (1000) for message
        type (2): send_medium
        state (0x14): buffered dead
        requeued: 1000 (timeout=510000ms)
        dest: 00:60:dd:47:89:40 (opt029:0)
        partner: peer_index=146, endpoint=3, seqnum=0x2944
        type (2): send_medium
        state (0x14): buffered dead
        requeued: 1000 (timeout=510000ms)
        dest: 00:60:dd:47:89:40 (opt029:0)
        partner: peer_index=146, endpoint=3, seqnum=0x2f9a
        matched_val: 0x0068002a_fffffff2
        slength=32768, xfer_length=32768
        matched_val: 0x0068002b_fffffff2
        slength=32768, xfer_length=32768
        seg: 0x2aaacc30f010,32768
        caller: 0x5b

Was trying to contact
        00:60:dd:47:89:40 (opt029:0)/3
Aborted 2 send requests due to remote peer      seg: 0x2aaacc30f010,32768
        caller: 0x1b

Was trying to contact
        00:60:dd:47:89:40 (opt029:0)00:60:dd:47:89:40 (opt029:0) disconnected
/3
Aborted 2 send requests due to remote peer 00:60:dd:47:89:40 (opt029:0) 
disconnected
...

(output comes interleaved from nodes, so there might be information missing or garbled). These seem to come from the libmyriexpress, not OpenMPI. However earlier OpenMPI versions have not shown such errors and neither MPICH-MX, so I wonder if some new behaviour in OpenMPI 1.3 triggers them. From the MX experts out there, I would also need some help to understand what is the source of these messages - I can only see opt029 mentioned, so does it try to communicate intra-node ? (IOW the equivalent of "self" BTL in OpenMPI) This would be somehow consistent with running more ranks per node (4) than the successfull job (with 2 ranks per node).

At this point, the job hangs in Alltoallv. The strace output is the same as for OB1+BTL above.

The gdb stack shows:
#0  0x00002b001d01e318 in mx__luigi (ep=0x55f650) at ./../mx__lib.c:2373
#1  0x00002b001d01283d in mx_ipeek (endpoint=0x55f650, request=0x7fff8e160090, 
result=0x7fff8e16009c) at ./../mx_ipeek.c:40
#2  0x00002b001cb29252 in ompi_mtl_mx_progress () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#3  0x00002b001cea91a2 in opal_progress () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libopen-pal.so.0
#4  0x00002b001ca9f77d in ompi_request_default_wait_all () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#5  0x00002b001caec8c4 in ompi_coll_tuned_alltoallv_intra_basic_linear () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#6  0x00002b001cab36c0 in PMPI_Alltoallv () from 
/opt/openmpi/1.3/gcc-4.1.2/lib/libmpi.so.0
#7  0x000000000040a5be in IMB_alltoallv ()
#8  0x0000000000403838 in main ()

Can anyone suggest some ways forward ? I'd be happy to help in debugging if given some instructions.

Thanks in advance!

--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de

Reply via email to