I've attached a simple test program that should demonstrate the limitations I'm seeing when joining multiple multicast groups; the idea being to allow others to see the weirdness I'm seeing and make some progress.

An MPI is needed to compile/run the test. No arguments are needed; the test repeatedly joins groups (without leaving them) until an error occurs, then intentionally hangs.

Here's some of the different behaviors I see with this test (OFED v1.2 is always used):

mpirun -np 1 ./jointest

On my 128 node machine 'odin' running OpenSM, I was able to join 891 groups quite a few times in a row. Then suddenly running the same test again I was able to join only 5 groups. This behavior persists on this node. I can go to another node on the same machine, and again be able to join 891 groups. If I run the test separately on two different nodes (that can still join 891 each), I am able to join a total of 891 groups between both nodes before both tests error. If I run on one node that errors after 5 groups and another that errors at 891 groups, the first node joins 5 groups and the second joins 886 groups.

On a separate 8 node machine 'thor' running Cisco's SM on a Topspin switch, I can join 14 groups.

mpirun -np 2 ./jointest   (one node)

On odin I can join 892 groups, the thor machine is able to join 5 groups.

mpirun -np 2 ./jointest   (two nodes)

Odin was able to join 4 groups for the first 3 runs, then was able to join 14 groups repeatedly. Thor is able to join 5 groups consistently.


None of these results seem to match with any of the hardcoded limits people have mentioned to me. I really need to figure out the cause of this strange behavior, as most cases severely limit the usability of IB multicast in MPI. Is my test code correct? Does anybody know what is causing this, or where I could look/test to try and nail it down? I've gotten suggestions that the problem lies in the SM, though I haven't found anything blatantly wrong when reading relevant parts of the OpenSM code.

Andrew

Attachment: jointest.tar.gz
Description: GNU Zip compressed data

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to