Re: [ofa-general] Re: IPoIB path caching

Or Gerlitz Wed, 25 Jul 2007 00:00:42 -0700

Sean Hefty wrote:

Linux has a quite sophisticated mechanism to maintain / cache / probe/ invalidate / update the network stack L2 neighbour info.

Path records are not just L2 info. They contain L4, L3, and L2 infotogether.

Maybe I was not clear enough: the neighbours cache keeps the stack Link(=L2) level info. The "IPoIB L2 info" (the neighbour HW address)contains IB L3 (GID) & L4 (QPN) info and points to the IB L2 (AH) info.

So bottom line, the stack considers the <flags|gid|qpn> creature as L2info wheres in IB terms it contains L4/L3/L2 info.

For example, in the Voltaire gen1 stack we had an ib arp module whichwas used by both IPoIB and native IB ULPs (SDP, iSER, Lustre, etc).This module managed some sort of path cache, were IPoIB was alwaysasking for non-cached path and other ULPs were willing to get cachedpath.

IMO, using a cached AH is no different than using a cached path. You'resimply mapping the PR data into another structure.

From the one hand the stack can't allow itself to do L3 --> L2 (ARP)resolving for each packet xmit but on the other hand the stack has thismechanism to probe / invalidate / etc its L2 cache. So my basic claim isthat if the stack decided to renew its L2 info, it would be incorrectdesign to use cached IB L2 info.

We're ignoring the problem here, and that is that a centralized SAdoesn't scale. MPI stacks have largely ignored this problem by simplynot doing path record queries. Path information is often hard-coded,with QPN data exchanged out of band over sockets (often over Ethernet).

I don't think that trying to separate IPoIB flow from MPI flow isignoring the problem. Its different settings, IPoIB is a network deviceworking under the net stack which has some design philosophy. Native MPIimplementations over IB are not tied to the stack, its different.

We've seen problems running large MPI jobs without PR caching. I knowthat Silverstorm/QLogic did as well. And apparently Voltaire hit thesame type of problem, since you added a caching module. (Did Mellanoxand Topspin/Cisco create PR caches as well?) At least three companiesworking on IB came up with the same solution. What is the objection tothe current patch set?

Again, as I stated above, in the Voltaire gen1 stack IPoIB was --not--using cached IB L2 info wheres MPI,Lustre etc did.

I am willing to go with the local sa coming to serve large MPI jobs, soyou load as a prerequisite to spawning large all-to-all job.


But, I think the default for IPoIB needs to be usage of non cached PR.

If you want to support the non-common case of huge-mpi-job-over-ipoib, Iam fine with adding a param to IPoIB telling it to request cached PRfrom the ib_sa module.


Or.

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: IPoIB path caching

Reply via email to