On Thu, 22 Jan 2009, Scott Atchley wrote:

Can you try a run with:

-mca btl_mx_free_list_max 1000000

Still hangs in Gather on 128 ranks.

After that, try a additional runs without the above but with:

--mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_gather_algorithm N

where N is 0, 1, 2, then 3 (one run for each value).

0       hangs in Gather on 64 ranks (out of 128)
1       passes Gather, hangs in Alltoall on 64 ranks (out of 128)
2       passes Gather, hangs in Alltoall on 32 ranks (out of 128)
3       passes Gather, hangs in Alltoall on 64 ranks (out of 128)

I've also done a run with nodes=128:ppn=1 (so the sm BTL was not involved) with default options and it has also hung in Gather on 64 ranks (out of 128).

These are two, overlapped messages from the MX library. It is unable to send to opt029 (i.e. opt029 is not consuming messages).

Immediately after my test job, another job was run on this node and has finished successfully; that job was certainly not using OpenMPI 1.3 (because I've just installed it...), but was certainly using MX. This leads me to believe that there was nothing wrong on the node.

Anyone, does 1.3 support rank labeling of stdout? If so, Bogdan should rerun it with --display-map and the option to support labeling.

I think that this is only in trunk at the moment, there were some messages on this subject in the past few days...

I am under the impression that the MTLs pass all messages to the interconnect. If so, then MX is handling self, shared memory (shmem), and host-to-host. Self, by the way, is a single rank (process) communicating with itself. In your case, you are using shmem.

Indeed, that was my mistake, I have thought "sm" but written "self".

I would suggest the same test as above with:

-mca btl_mx_free_list_max 1000000

Finishes successfully.

Additionally, try the following tuned collectives for alltoallv:

--mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_alltoallv_algorithm N

where N is 0, 1, then 2 (one run for each value).

0       finishes successfully
1       finishes successfully
2       finishes successfully

I've also run using CM+MTL on nodes=128:ppn=1 and nodes=128:ppn=2 with default options and they finished successfully. So I guess that the error that I've seen was transient... I'll do some more runs in the same conditions and will write back in case this problem appears again.

--
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.coste...@iwr.uni-heidelberg.de

Reply via email to