On Thu, 22 Jan 2009, Scott Atchley wrote:
Can you try a run with:
-mca btl_mx_free_list_max 1000000
Still hangs in Gather on 128 ranks.
After that, try a additional runs without the above but with:
--mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_gather_algorithm N
where N is 0, 1, 2, then 3 (one run for each value).
0 hangs in Gather on 64 ranks (out of 128)
1 passes Gather, hangs in Alltoall on 64 ranks (out of 128)
2 passes Gather, hangs in Alltoall on 32 ranks (out of 128)
3 passes Gather, hangs in Alltoall on 64 ranks (out of 128)
I've also done a run with nodes=128:ppn=1 (so the sm BTL was not
involved) with default options and it has also hung in Gather on 64
ranks (out of 128).
These are two, overlapped messages from the MX library. It is unable
to send to opt029 (i.e. opt029 is not consuming messages).
Immediately after my test job, another job was run on this node and
has finished successfully; that job was certainly not using OpenMPI
1.3 (because I've just installed it...), but was certainly using MX.
This leads me to believe that there was nothing wrong on the node.
Anyone, does 1.3 support rank labeling of stdout? If so, Bogdan
should rerun it with --display-map and the option to support
labeling.
I think that this is only in trunk at the moment, there were some
messages on this subject in the past few days...
I am under the impression that the MTLs pass all messages to the
interconnect. If so, then MX is handling self, shared memory
(shmem), and host-to-host. Self, by the way, is a single rank
(process) communicating with itself. In your case, you are using
shmem.
Indeed, that was my mistake, I have thought "sm" but written "self".
I would suggest the same test as above with:
-mca btl_mx_free_list_max 1000000
Finishes successfully.
Additionally, try the following tuned collectives for alltoallv:
--mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_alltoallv_algorithm N
where N is 0, 1, then 2 (one run for each value).
0 finishes successfully
1 finishes successfully
2 finishes successfully
I've also run using CM+MTL on nodes=128:ppn=1 and nodes=128:ppn=2 with
default options and they finished successfully. So I guess that the
error that I've seen was transient... I'll do some more runs in the
same conditions and will write back in case this problem appears
again.
--
Bogdan Costescu
IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de