Sean Hefty wrote:
Linux has a quite sophisticated mechanism to maintain / cache / probe / invalidate / update the network stack L2 neighbour info.

Path records are not just L2 info. They contain L4, L3, and L2 info together.

Maybe I was not clear enough: the neighbours cache keeps the stack Link (=L2) level info. The "IPoIB L2 info" (the neighbour HW address) contains IB L3 (GID) & L4 (QPN) info and points to the IB L2 (AH) info.

So bottom line, the stack considers the <flags|gid|qpn> creature as L2 info wheres in IB terms it contains L4/L3/L2 info.

For example, in the Voltaire gen1 stack we had an ib arp module which was used by both IPoIB and native IB ULPs (SDP, iSER, Lustre, etc). This module managed some sort of path cache, were IPoIB was always asking for non-cached path and other ULPs were willing to get cached path.

IMO, using a cached AH is no different than using a cached path. You're simply mapping the PR data into another structure.

From the one hand the stack can't allow itself to do L3 --> L2 (ARP) resolving for each packet xmit but on the other hand the stack has this mechanism to probe / invalidate / etc its L2 cache. So my basic claim is that if the stack decided to renew its L2 info, it would be incorrect design to use cached IB L2 info.

We're ignoring the problem here, and that is that a centralized SA doesn't scale. MPI stacks have largely ignored this problem by simply not doing path record queries. Path information is often hard-coded, with QPN data exchanged out of band over sockets (often over Ethernet).

I don't think that trying to separate IPoIB flow from MPI flow is ignoring the problem. Its different settings, IPoIB is a network device working under the net stack which has some design philosophy. Native MPI implementations over IB are not tied to the stack, its different.

We've seen problems running large MPI jobs without PR caching. I know that Silverstorm/QLogic did as well. And apparently Voltaire hit the same type of problem, since you added a caching module. (Did Mellanox and Topspin/Cisco create PR caches as well?) At least three companies working on IB came up with the same solution. What is the objection to the current patch set?

Again, as I stated above, in the Voltaire gen1 stack IPoIB was --not-- using cached IB L2 info wheres MPI,Lustre etc did.

I am willing to go with the local sa coming to serve large MPI jobs, so you load as a prerequisite to spawning large all-to-all job.

But, I think the default for IPoIB needs to be usage of non cached PR.

If you want to support the non-common case of huge-mpi-job-over-ipoib, I am fine with adding a param to IPoIB telling it to request cached PR from the ib_sa module.

Or.

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to