Andrew, On 7/19/07, Andrew Friedley <[EMAIL PROTECTED]> wrote:
Finally was able to have the SM switched over from Cisco on the switch to OpenSM on a node. Responses inline below.. Sean Hefty wrote: >> Now the more interesting part. I'm now able to run on a 128 node >> machine using open SM running on a node (before, I was running on an 8 >> node machine which I'm told is running the Cisco SM on a Topspin >> switch). On this machine, if I run my benchmark with two processes >> per node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able >> to join > 750 groups simultaneously from one QP on each process. To >> make this stranger, I can join only 4 groups running the same thing on >> the 8-node machine. > > Are the switches and HCAs in the two setups the same? If you run the > same SM on both clusters, do you see the same results? The switches are different. The 8 node machine uses a Topspin switch, the 128 node machine uses a Mellanox switch. Looking at `ibstat` the HCAs appear to be the same (MT23108), though HCAs on the 128 node machine have firmware 3.2.0, where 3.5.0 is on the 8 node machine. Does this matter? Running OpenSM now, I still do not see the same results. Behavior is now the same as the 128 node machine, except when running two processes per node (in which case I can join as many groups as I like on the 128 node machine). On the 8 node machine I am still limited to 4 groups in this case.
I'm not quite parsing what is the same with what is different in the results (and I presume the only variable is SM). This makes me think the switch is involved, is this correct? I doubt it. It is either end station, SM, or a combination of the two.
>> While doing so I noticed that the time from calling >> rdma_join_multicast() to the event arrival stayed fairly constant (in >> the .001sec range), while the time from the join call to actually >> receiving messages on the group steadily increased from around .1 secs >> to around 2.7 secs with 750+ groups. Furthermore, this time does not >> drop back to .1 secs if I stop the benchmark and run it (or any of my >> other multicast code) again. This is understandable within a single >> program run, but the fact that behavior persists across runs concerns >> me -- feels like a bug, but I don't have much concrete here. > > Even after all nodes leave all multicast groups, I don't believe that > there's a requirement for the SA to reprogram the switches immediately. > So if the switches or the configuration of the swtiches are part of the > problem, I can imagine seeing issues between runs. > > When rdma_join_multicast() reports the join event, it means either: the > SA has been notified of the join request, or, if the port has already > joined the group, that a reference count on the group has been > incremented. The SA may still require time to program the switch > forwarding tables. OK this makes sense, but I still don't see where all the time is going. Should the fact that the switches haven't been reprogrammed since leaving the groups really effect how long it takes to do a subsequent join? I'm not convinced.
It takes time for the SM to recalculate the multicast tree. While leaves can be lazy, I forget whether joins are synchronous or not. Is this time being consumed by the switches when the are asked to
reprogram their tables (I assume some sort of routing table is used internally)?
This is relatively quick compared to the policy for the SM rerouting of multicast based on joins/leaves/group creation/deletion. -- Hal What could they be doing that takes so long to do that?
Is it something that a firmware change on the switch could alleviate? Andrew _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
_______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
