Hi Sean,
This is great initiative - tackling an important issue.
I am glad you took this on.
Please see below.
Sean Hefty wrote:
I've been given the task of trying to come up with an implementation for
an SA cache. The intent is to increase the scalability and performance
of the openib stack. My current thoughts on the implementation are
below. Any feedback is welcome.
To keep the design as flexible as possible, my plan is to implement the
cache in userspace. The interface to the cache would be via MADs.
Clients would send their queries to the sa_cache instead of the SA
itself. The format of the MADs would be essentially identical to those
used to query the SA itself. Response MADs would contain any requested
information. If the cache could not satisfy a request, the sa_cache
would query the SA, update its cache, then return a reply.
* I think the idea of using MADs to interface with the cache is very good.
* User space implementation:
This also might be a good tradeoff between coding and debugging versus the
the impact on number of connections per second. I hope the impact on
performance
will not be too big. Maybe we can take the path of implementing in user space
and
if the performance penalty will be too high we can port to kernel.
* Regarding the sentence:"Clients would send their queries to the sa_cache instead
of the SA"
I would propose that a "SA MAD send switch" be implemented in the core: Such
a switch
will enable plugging in the SA cache (I would prefer calling it SA local
agent due to
its extended functionality). Once plugged in, this "SA local agent" should be
forwarded all
outgoing SA queries. Once it handles the MAD it should be able to inject the
response through
the core "SA MAD send switch" as if they arrived from the wire.
The benefits that I see with this approach are:
+ Clients would only need to send requests to the sa_cache.
+ The sa_cache can be implemented in stages. Requests that it cannot
handle would just be forwarded to the SA.
+ The sa_cache could be implemented on each host, or a select number of
hosts.
+ The interface to the sa_cache is similar to that used by the SA.
+ The cache would use virtual memory and could be saved to disk.
Some drawbacks specific to this method are:
- The MAD interface will result in additional data copies and userspace
to kernel transitions for clients residing on the local system.
- Clients require a mechanism to locate the sa_cache, or need to make
assumptions about its location.
The proposal for "SA MAD send switch" in the core will resolve this issue.
No client change will be required as all MADs are sent through the core which
will
redirect them to the SA agent ...
Functional requirements:
* It is clear that the first SA query to cache is PathRecord.
So if a new client wants to connect to another node a new PathRecord
query will not need to be sent to the SA. However, recent work on QoS has
pointed out
that under some QoS schemes PathRecord should not be shared by different
clients
or even connections. There are several ways to make such QoS scheme scale.
Since this is a different discussion topic - I only bring this up such that
we take into account caching might also need to be done by a complex key (not
just
SRC/DST ...)
* Forgive me for bringing the following issue - over and over to the group:
Multicast Join/Leave should be reference counted. The "SA local agent" could
be
the right place for doing this kind of reference counting (actually if it
does that
it probably needs to be located in the Kernel - to enable cleanup after
killed processes).
* Similarly - "Client re-registration" could be made transparent to clients.
Cache Invalidation:
Several discussions about PathRecord invalidation were spawn in the past.
IMO, it is enough to be notified about death of local processes, remote port
availability (trap 64/65) and
multicast group availability (trap 66/67) in order to invalidate SA cache
information.
So each SA Agent could register to obtain this data. But that solution does not
nicely scale,
as the SA needs to send notification to all nodes (but is reliable - could
resend until Repressed).
However, current IBTA definition for InformInfo (event forwarding mechanism)
does not
allow for multicast of Report(Notice). The reason is that registration for
event forwarding
is done with Set(InformInfo) which uses the requester QP and LID as the
address for sending
the matching report. A simple way around that limitation could be to enable the SM to
"pre-register"
a well known multicast group target for event forwarding. One issue though,
would be that UD multicast
is not reliable and some notifications could get lost. A notification sequence
number could be used
to catch these missed notifications eventually.
Eitan
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general