Finally was able to have the SM switched over from Cisco on the switch to OpenSM on a node. Responses inline below..

Sean Hefty wrote:
Now the more interesting part. I'm now able to run on a 128 node machine using open SM running on a node (before, I was running on an 8 node machine which I'm told is running the Cisco SM on a Topspin switch). On this machine, if I run my benchmark with two processes per node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able to join > 750 groups simultaneously from one QP on each process. To make this stranger, I can join only 4 groups running the same thing on the 8-node machine.

Are the switches and HCAs in the two setups the same? If you run the same SM on both clusters, do you see the same results?

The switches are different. The 8 node machine uses a Topspin switch, the 128 node machine uses a Mellanox switch. Looking at `ibstat` the HCAs appear to be the same (MT23108), though HCAs on the 128 node machine have firmware 3.2.0, where 3.5.0 is on the 8 node machine. Does this matter?

Running OpenSM now, I still do not see the same results. Behavior is now the same as the 128 node machine, except when running two processes per node (in which case I can join as many groups as I like on the 128 node machine). On the 8 node machine I am still limited to 4 groups in this case. This makes me think the switch is involved, is this correct?


While doing so I noticed that the time from calling rdma_join_multicast() to the event arrival stayed fairly constant (in the .001sec range), while the time from the join call to actually receiving messages on the group steadily increased from around .1 secs to around 2.7 secs with 750+ groups. Furthermore, this time does not drop back to .1 secs if I stop the benchmark and run it (or any of my other multicast code) again. This is understandable within a single program run, but the fact that behavior persists across runs concerns me -- feels like a bug, but I don't have much concrete here.

Even after all nodes leave all multicast groups, I don't believe that there's a requirement for the SA to reprogram the switches immediately. So if the switches or the configuration of the swtiches are part of the problem, I can imagine seeing issues between runs.

When rdma_join_multicast() reports the join event, it means either: the SA has been notified of the join request, or, if the port has already joined the group, that a reference count on the group has been incremented. The SA may still require time to program the switch forwarding tables.

OK this makes sense, but I still don't see where all the time is going. Should the fact that the switches haven't been reprogrammed since leaving the groups really effect how long it takes to do a subsequent join? I'm not convinced.

Is this time being consumed by the switches when the are asked to reprogram their tables (I assume some sort of routing table is used internally)? What could they be doing that takes so long to do that? Is it something that a firmware change on the switch could alleviate?

Andrew
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to