Sean Hefty wrote:
Linux has a quite sophisticated mechanism to maintain / cache / probe
/ invalidate / update the network stack L2 neighbour info.
Path records are not just L2 info. They contain L4, L3, and L2 info
together.
Maybe I was not clear enough: the neighbours cache keeps the stack Link
(=L2) level info. The "IPoIB L2 info" (the neighbour HW address)
contains IB L3 (GID) & L4 (QPN) info and points to the IB L2 (AH) info.
So bottom line, the stack considers the <flags|gid|qpn> creature as L2
info wheres in IB terms it contains L4/L3/L2 info.
For example, in the Voltaire gen1 stack we had an ib arp module which
was used by both IPoIB and native IB ULPs (SDP, iSER, Lustre, etc).
This module managed some sort of path cache, were IPoIB was always
asking for non-cached path and other ULPs were willing to get cached
path.
IMO, using a cached AH is no different than using a cached path. You're
simply mapping the PR data into another structure.
From the one hand the stack can't allow itself to do L3 --> L2 (ARP)
resolving for each packet xmit but on the other hand the stack has this
mechanism to probe / invalidate / etc its L2 cache. So my basic claim is
that if the stack decided to renew its L2 info, it would be incorrect
design to use cached IB L2 info.
We're ignoring the problem here, and that is that a centralized SA
doesn't scale. MPI stacks have largely ignored this problem by simply
not doing path record queries. Path information is often hard-coded,
with QPN data exchanged out of band over sockets (often over Ethernet).
I don't think that trying to separate IPoIB flow from MPI flow is
ignoring the problem. Its different settings, IPoIB is a network device
working under the net stack which has some design philosophy. Native MPI
implementations over IB are not tied to the stack, its different.
We've seen problems running large MPI jobs without PR caching. I know
that Silverstorm/QLogic did as well. And apparently Voltaire hit the
same type of problem, since you added a caching module. (Did Mellanox
and Topspin/Cisco create PR caches as well?) At least three companies
working on IB came up with the same solution. What is the objection to
the current patch set?
Again, as I stated above, in the Voltaire gen1 stack IPoIB was --not--
using cached IB L2 info wheres MPI,Lustre etc did.
I am willing to go with the local sa coming to serve large MPI jobs, so
you load as a prerequisite to spawning large all-to-all job.
But, I think the default for IPoIB needs to be usage of non cached PR.
If you want to support the non-common case of huge-mpi-job-over-ipoib, I
am fine with adding a param to IPoIB telling it to request cached PR
from the ib_sa module.
Or.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general