Now the more interesting part. I'm now able to run on a 128 node machine using open SM running on a node (before, I was running on an 8 node machine which I'm told is running the Cisco SM on a Topspin switch). On this machine, if I run my benchmark with two processes per node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able to join > 750 groups simultaneously from one QP on each process. To make this stranger, I can join only 4 groups running the same thing on the 8-node machine.
Are the switches and HCAs in the two setups the same? If you run the same SM on both clusters, do you see the same results?
While doing so I noticed that the time from calling rdma_join_multicast() to the event arrival stayed fairly constant (in the .001sec range), while the time from the join call to actually receiving messages on the group steadily increased from around .1 secs to around 2.7 secs with 750+ groups. Furthermore, this time does not drop back to .1 secs if I stop the benchmark and run it (or any of my other multicast code) again. This is understandable within a single program run, but the fact that behavior persists across runs concerns me -- feels like a bug, but I don't have much concrete here.
Even after all nodes leave all multicast groups, I don't believe that there's a requirement for the SA to reprogram the switches immediately. So if the switches or the configuration of the swtiches are part of the problem, I can imagine seeing issues between runs.
When rdma_join_multicast() reports the join event, it means either: the SA has been notified of the join request, or, if the port has already joined the group, that a reference count on the group has been incremented. The SA may still require time to program the switch forwarding tables.
- Sean _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
