On Jan 22, 2009, at 9:18 AM, Bogdan Costescu wrote:

I'm still having some troubles using the newly released 1.3 with Myricom's MX. I've meant to send a message earlier, but the release candidates went so fast that I didn't have time to catch up and test.

General details:
        Nodes with dual CPU, dual core Opteron 2220, 8 GB RAM
        Debian etch x86_64, self-compiled kernel 2.6.22.18, gcc-4.1
        Torque 2.1.10 (but this shouldn't make a difference)
        MX 1.2.7 with a tiny patch from Myricom
        OpenMPI 1.3
        IMB 3.1

OpenMPI was configured with '--enable-shared --enable-static --with- mx=... --with-tm=...' In all cases, there were no options specified at runtime (either in files or on the command line) except for the PML and BTL selection.

Problem 1:

I still see hangs of collective functions when running on large number of nodes (or maybe ranks) with the default OB1+BTL. F.e. with 128 ranks distributed as nodes=32:ppn=4 or nodes=64:ppn=2, the IMB hangs in Gather.

Bogdan, this sounds like a similar issue to what you experienced in December and that it had been fixed. I do not remember if this was tied to the default collective or to free list management.

Can you try a run with:

  -mca btl_mx_free_list_max 1000000

added to the command line?

After that, try a additional runs without the above but with:

--mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_gather_algorithm N

where N is 0, 1, 2, then 3 (one run for each value).

Problem 2:

When using the CM+MTL with 128 ranks, it finishes fine when running on nodes=64:ppn=2, but on nodes=32:ppn=4 I get a stream of errors that I haven't seen before:

Max retransmit retries reached (1000) for message
Max retransmit retries reached (1000) for message
       type (2): send_medium
       state (0x14): buffered dead
       requeued: 1000 (timeout=510000ms)
       dest: 00:60:dd:47:89:40 (opt029:0)
       partner: peer_index=146, endpoint=3, seqnum=0x2944
       type (2): send_medium
       state (0x14): buffered dead
       requeued: 1000 (timeout=510000ms)
       dest: 00:60:dd:47:89:40 (opt029:0)
       partner: peer_index=146, endpoint=3, seqnum=0x2f9a
       matched_val: 0x0068002a_fffffff2
       slength=32768, xfer_length=32768
       matched_val: 0x0068002b_fffffff2
       slength=32768, xfer_length=32768
       seg: 0x2aaacc30f010,32768
       caller: 0x5b

These are two, overlapped messages from the MX library. It is unable to send to opt029 (i.e. opt029 is not consuming messages).

From the MX experts out there, I would also need some help to understand what is the source of these messages - I can only see opt029 mentioned,

Anyone, does 1.3 support rank labeling of stdout? If so, Bogdan should rerun it with --display-map and the option to support labeling.

so does it try to communicate intra-node ? (IOW the equivalent of "self" BTL in OpenMPI) This would be somehow consistent with running more ranks per node (4) than the successfull job (with 2 ranks per node).

I am under the impression that the MTLs pass all messages to the interconnect. If so, then MX is handling self, shared memory (shmem), and host-to-host. Self, by the way, is a single rank (process) communicating with itself. In your case, you are using shmem.

At this point, the job hangs in Alltoallv. The strace output is the same as for OB1+BTL above.

Can anyone suggest some ways forward ? I'd be happy to help in debugging if given some instructions.

I would suggest the same test as above with:

  -mca btl_mx_free_list_max 1000000

Additionally, try the following tuned collectives for alltoallv:

--mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_alltoallv_algorithm N

where N is 0, 1, then 2 (one run for each value).

Scott

Reply via email to