Re: [OMPI devel] Still troubles with 1.3 and MX

Scott Atchley Thu, 22 Jan 2009 12:16:01 -0500

On Jan 22, 2009, at 9:18 AM, Bogdan Costescu wrote:

I'm still having some troubles using the newly released 1.3 withMyricom's MX. I've meant to send a message earlier, but the releasecandidates went so fast that I didn't have time to catch up and test.
General details:
        Nodes with dual CPU, dual core Opteron 2220, 8 GB RAM
        Debian etch x86_64, self-compiled kernel 2.6.22.18, gcc-4.1
        Torque 2.1.10 (but this shouldn't make a difference)
        MX 1.2.7 with a tiny patch from Myricom
        OpenMPI 1.3
        IMB 3.1
OpenMPI was configured with '--enable-shared --enable-static --with-mx=... --with-tm=...'In all cases, there were no options specified at runtime (either infiles or on the command line) except for the PML and BTL selection.
Problem 1:
I still see hangs of collective functions when running on largenumber of nodes (or maybe ranks) with the default OB1+BTL. F.e. with128 ranks distributed as nodes=32:ppn=4 or nodes=64:ppn=2, the IMBhangs in Gather.

Bogdan, this sounds like a similar issue to what you experienced inDecember and that it had been fixed. I do not remember if this wastied to the default collective or to free list management.


Can you try a run with:

  -mca btl_mx_free_list_max 1000000

added to the command line?

After that, try a additional runs without the above but with:

--mca coll_tuned_use_dynamic_rules 1 --mcacoll_tuned_gather_algorithm N


where N is 0, 1, 2, then 3 (one run for each value).

Problem 2:

When using the CM+MTL with 128 ranks, it finishes fine when runningon nodes=64:ppn=2, but on nodes=32:ppn=4 I get a stream of errorsthat I haven't seen before:


Max retransmit retries reached (1000) for message
Max retransmit retries reached (1000) for message
       type (2): send_medium
       state (0x14): buffered dead
       requeued: 1000 (timeout=510000ms)
       dest: 00:60:dd:47:89:40 (opt029:0)
       partner: peer_index=146, endpoint=3, seqnum=0x2944
       type (2): send_medium
       state (0x14): buffered dead
       requeued: 1000 (timeout=510000ms)
       dest: 00:60:dd:47:89:40 (opt029:0)
       partner: peer_index=146, endpoint=3, seqnum=0x2f9a
       matched_val: 0x0068002a_fffffff2
       slength=32768, xfer_length=32768
       matched_val: 0x0068002b_fffffff2
       slength=32768, xfer_length=32768
       seg: 0x2aaacc30f010,32768
       caller: 0x5b

These are two, overlapped messages from the MX library. It is unableto send to opt029 (i.e. opt029 is not consuming messages).

From the MX experts out there, I would also need some help tounderstand what is the source of these messages - I can only seeopt029 mentioned,

Anyone, does 1.3 support rank labeling of stdout? If so, Bogdan shouldrerun it with --display-map and the option to support labeling.

so does it try to communicate intra-node ? (IOW the equivalent of"self" BTL in OpenMPI) This would be somehow consistent with runningmore ranks per node (4) than the successfull job (with 2 ranks pernode).

I am under the impression that the MTLs pass all messages to theinterconnect. If so, then MX is handling self, shared memory (shmem),and host-to-host. Self, by the way, is a single rank (process)communicating with itself. In your case, you are using shmem.

At this point, the job hangs in Alltoallv. The strace output is thesame as for OB1+BTL above.
Can anyone suggest some ways forward ? I'd be happy to help indebugging if given some instructions.


I would suggest the same test as above with:

  -mca btl_mx_free_list_max 1000000

Additionally, try the following tuned collectives for alltoallv:

--mca coll_tuned_use_dynamic_rules 1 --mcacoll_tuned_alltoallv_algorithm N


where N is 0, 1, then 2 (one run for each value).

Scott

Re: [OMPI devel] Still troubles with 1.3 and MX

Reply via email to