On Jan 22, 2009, at 9:18 AM, Bogdan Costescu wrote:
I'm still having some troubles using the newly released 1.3 with
Myricom's MX. I've meant to send a message earlier, but the release
candidates went so fast that I didn't have time to catch up and test.
General details:
Nodes with dual CPU, dual core Opteron 2220, 8 GB RAM
Debian etch x86_64, self-compiled kernel 2.6.22.18, gcc-4.1
Torque 2.1.10 (but this shouldn't make a difference)
MX 1.2.7 with a tiny patch from Myricom
OpenMPI 1.3
IMB 3.1
OpenMPI was configured with '--enable-shared --enable-static --with-
mx=... --with-tm=...'
In all cases, there were no options specified at runtime (either in
files or on the command line) except for the PML and BTL selection.
Problem 1:
I still see hangs of collective functions when running on large
number of nodes (or maybe ranks) with the default OB1+BTL. F.e. with
128 ranks distributed as nodes=32:ppn=4 or nodes=64:ppn=2, the IMB
hangs in Gather.
Bogdan, this sounds like a similar issue to what you experienced in
December and that it had been fixed. I do not remember if this was
tied to the default collective or to free list management.
Can you try a run with:
-mca btl_mx_free_list_max 1000000
added to the command line?
After that, try a additional runs without the above but with:
--mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_gather_algorithm N
where N is 0, 1, 2, then 3 (one run for each value).
Problem 2:
When using the CM+MTL with 128 ranks, it finishes fine when running
on nodes=64:ppn=2, but on nodes=32:ppn=4 I get a stream of errors
that I haven't seen before:
Max retransmit retries reached (1000) for message
Max retransmit retries reached (1000) for message
type (2): send_medium
state (0x14): buffered dead
requeued: 1000 (timeout=510000ms)
dest: 00:60:dd:47:89:40 (opt029:0)
partner: peer_index=146, endpoint=3, seqnum=0x2944
type (2): send_medium
state (0x14): buffered dead
requeued: 1000 (timeout=510000ms)
dest: 00:60:dd:47:89:40 (opt029:0)
partner: peer_index=146, endpoint=3, seqnum=0x2f9a
matched_val: 0x0068002a_fffffff2
slength=32768, xfer_length=32768
matched_val: 0x0068002b_fffffff2
slength=32768, xfer_length=32768
seg: 0x2aaacc30f010,32768
caller: 0x5b
These are two, overlapped messages from the MX library. It is unable
to send to opt029 (i.e. opt029 is not consuming messages).
From the MX experts out there, I would also need some help to
understand what is the source of these messages - I can only see
opt029 mentioned,
Anyone, does 1.3 support rank labeling of stdout? If so, Bogdan should
rerun it with --display-map and the option to support labeling.
so does it try to communicate intra-node ? (IOW the equivalent of
"self" BTL in OpenMPI) This would be somehow consistent with running
more ranks per node (4) than the successfull job (with 2 ranks per
node).
I am under the impression that the MTLs pass all messages to the
interconnect. If so, then MX is handling self, shared memory (shmem),
and host-to-host. Self, by the way, is a single rank (process)
communicating with itself. In your case, you are using shmem.
At this point, the job hangs in Alltoallv. The strace output is the
same as for OB1+BTL above.
Can anyone suggest some ways forward ? I'd be happy to help in
debugging if given some instructions.
I would suggest the same test as above with:
-mca btl_mx_free_list_max 1000000
Additionally, try the following tuned collectives for alltoallv:
--mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm N
where N is 0, 1, then 2 (one run for each value).
Scott