Re: [ofa-general] Limited number of multicasts groups that can be joined?

Andrew Friedley Thu, 19 Jul 2007 10:13:19 -0700

Finally was able to have the SM switched over from Cisco on the switchto OpenSM on a node. Responses inline below..


Sean Hefty wrote:

Now the more interesting part. I'm now able to run on a 128 nodemachine using open SM running on a node (before, I was running on an 8node machine which I'm told is running the Cisco SM on a Topspinswitch). On this machine, if I run my benchmark with two processesper node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm ableto join > 750 groups simultaneously from one QP on each process. Tomake this stranger, I can join only 4 groups running the same thing onthe 8-node machine.
Are the switches and HCAs in the two setups the same? If you run thesame SM on both clusters, do you see the same results?

The switches are different. The 8 node machine uses a Topspin switch,the 128 node machine uses a Mellanox switch. Looking at `ibstat` theHCAs appear to be the same (MT23108), though HCAs on the 128 nodemachine have firmware 3.2.0, where 3.5.0 is on the 8 node machine. Doesthis matter?

Running OpenSM now, I still do not see the same results. Behavior isnow the same as the 128 node machine, except when running two processesper node (in which case I can join as many groups as I like on the 128node machine). On the 8 node machine I am still limited to 4 groups inthis case. This makes me think the switch is involved, is this correct?

While doing so I noticed that the time from callingrdma_join_multicast() to the event arrival stayed fairly constant (inthe .001sec range), while the time from the join call to actuallyreceiving messages on the group steadily increased from around .1 secsto around 2.7 secs with 750+ groups. Furthermore, this time does notdrop back to .1 secs if I stop the benchmark and run it (or any of myother multicast code) again. This is understandable within a singleprogram run, but the fact that behavior persists across runs concernsme -- feels like a bug, but I don't have much concrete here.
Even after all nodes leave all multicast groups, I don't believe thatthere's a requirement for the SA to reprogram the switches immediately.So if the switches or the configuration of the swtiches are part of theproblem, I can imagine seeing issues between runs.
When rdma_join_multicast() reports the join event, it means either: theSA has been notified of the join request, or, if the port has alreadyjoined the group, that a reference count on the group has beenincremented. The SA may still require time to program the switchforwarding tables.

OK this makes sense, but I still don't see where all the time is going.Should the fact that the switches haven't been reprogrammed sinceleaving the groups really effect how long it takes to do a subsequentjoin? I'm not convinced.

Is this time being consumed by the switches when the are asked toreprogram their tables (I assume some sort of routing table is usedinternally)? What could they be doing that takes so long to do that?Is it something that a firmware change on the switch could alleviate?


Andrew
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Limited number of multicasts groups that can be joined?

Reply via email to