Some updates on this problem.

The code I'm using to test/produce this behavior is an MPI program. MPI is used for convenience of job startup and collection of results. The actual test/benchmark is using straight RDMA CM & ibverbs. What I'm doing is timing how long it takes to join and bring up a multicast group with varying number of processes and existing groups. One rank joins with a '0' address to get a real address, MPI_Bcast's that address to the other ranks, which then join the group. Meanwhile the root rank is repeatedly sending a small ping message to the group. Every other rank times from when they call rdma_join_multicast() to the join event arrival, and to when they first receive a message on that group. Once completed, the process repeats N times, leaving all the groups joined.

I'm now running OFED v1.2, and behavior has not changed due to this, though I've noticed some other cases. First -- If I have not been using anything multicast on the network for a while, I'm able to join a total of 4 groups with my benchmark. After this, running it any number of times, I can join 14 groups as described below.

Now the more interesting part. I'm now able to run on a 128 node machine using open SM running on a node (before, I was running on an 8 node machine which I'm told is running the Cisco SM on a Topspin switch). On this machine, if I run my benchmark with two processes per node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able to join > 750 groups simultaneously from one QP on each process. To make this stranger, I can join only 4 groups running the same thing on the 8-node machine.

While doing so I noticed that the time from calling rdma_join_multicast() to the event arrival stayed fairly constant (in the .001sec range), while the time from the join call to actually receiving messages on the group steadily increased from around .1 secs to around 2.7 secs with 750+ groups. Furthermore, this time does not drop back to .1 secs if I stop the benchmark and run it (or any of my other multicast code) again. This is understandable within a single program run, but the fact that behavior persists across runs concerns me -- feels like a bug, but I don't have much concrete here.

Sorry for the long email -- I'm trying to provide as much detail as possible so this can get fixed. I'm really not sure where to start looking on my own, so even some hints on where the problem(s) might lie would be useful.

Andrew

Andrew Friedley wrote:
I've run into a problem where it appears that I cannot join more than 14 multicast groups from a single HCA. I'm using the RDMA CM UD/multicast interface from an OFED v1.2 nightly build, and using a '0' address when joining to have the SM allocate an unused address. The first 14 rdma_join_multicast() calls succeed, a MULTICAST_JOIN event comes through for each of them and everything works. But the 15th call to rdma_join_multicast() returns -1 and sets errno to 99, 'Cannot assign requested address'.

Note that I'm using a single QP per process to do all the joins. Things get weirder if I run two instances of my program on the same node -- as soon the total between the two instances is 14, neither instance can join any more groups. Also, right now my code hangs when this happens -- if I kill off one of the two instances and run a third instance (while leaving the other hung, holding some number of groups), the third instance is not able to join ANY groups. The behavior resets when I kill all instances.

Two instances running on separate nodes (on the same network) do not appear to interfere with each other like described above; they do still error out on the 15th join.

This feels like a bug to me; though regardless this limit is WAY too low. Any ideas what might be going on, or how I can work around it?

Andrew
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to