Hi Sean,

This is great initiative - tackling an important issue.
I am glad you took this on.

Please see below.

Sean Hefty wrote:
I've been given the task of trying to come up with an implementation for an SA cache. The intent is to increase the scalability and performance of the openib stack. My current thoughts on the implementation are below. Any feedback is welcome.

To keep the design as flexible as possible, my plan is to implement the cache in userspace. The interface to the cache would be via MADs. Clients would send their queries to the sa_cache instead of the SA itself. The format of the MADs would be essentially identical to those used to query the SA itself. Response MADs would contain any requested information. If the cache could not satisfy a request, the sa_cache would query the SA, update its cache, then return a reply.
* I think the idea of using MADs to interface with the cache is very good.
* User space implementation:
  This also might be a good tradeoff between coding and debugging versus the
  the impact on number of connections per second. I hope the impact on 
performance
  will not be too big. Maybe we can take the path of implementing in user space 
and
  if the performance penalty will be too high we can port to kernel.
* Regarding the sentence:"Clients would send their queries to the sa_cache instead 
of the SA"
  I would propose that a "SA MAD send switch" be implemented in the core: Such 
a switch
  will enable plugging in the SA cache (I would prefer calling it SA local 
agent due to
  its extended functionality). Once plugged in, this "SA local agent" should be 
forwarded all
  outgoing SA queries. Once it handles the MAD it should be able to inject the 
response through
  the core "SA MAD send switch" as if they arrived from the wire.

The benefits that I see with this approach are:

+ Clients would only need to send requests to the sa_cache.
+ The sa_cache can be implemented in stages. Requests that it cannot handle would just be forwarded to the SA. + The sa_cache could be implemented on each host, or a select number of hosts.
+ The interface to the sa_cache is similar to that used by the SA.
+ The cache would use virtual memory and could be saved to disk.

Some drawbacks specific to this method are:

- The MAD interface will result in additional data copies and userspace to kernel transitions for clients residing on the local system. - Clients require a mechanism to locate the sa_cache, or need to make assumptions about its location.
The proposal for "SA MAD send switch" in the core will resolve this issue.
No client change will be required as all MADs are sent through the core which 
will
redirect them to the SA agent ...

Functional requirements:
* It is clear that the first SA query to cache is PathRecord.
  So if a new client wants to connect to another node a new PathRecord
  query will not need to be sent to the SA. However, recent work on QoS has 
pointed out
  that under some QoS schemes PathRecord should not be shared by different 
clients
  or even connections. There are several ways to make such QoS scheme scale.
  Since this is a different discussion topic - I only bring this up such that
  we take into account caching might also need to be done by a complex key (not 
just
  SRC/DST ...)
* Forgive me for bringing the following issue - over and over to the group:
  Multicast Join/Leave should be reference counted. The "SA local agent" could 
be
  the right place for doing this kind of reference counting (actually if it 
does that
  it probably needs to be located in the Kernel - to enable cleanup after 
killed processes).
* Similarly - "Client re-registration" could be made transparent to clients.

Cache Invalidation:
Several discussions about PathRecord invalidation were spawn in the past.
IMO, it is enough to be notified about death of local processes, remote port 
availability (trap 64/65) and
multicast group availability (trap 66/67) in order to invalidate SA cache 
information.
So each SA Agent could register to obtain this data. But that solution does not 
nicely scale,
as the SA needs to send notification to all nodes (but is reliable - could 
resend until Repressed).
However, current IBTA definition for InformInfo (event forwarding mechanism) 
does not
allow for multicast of Report(Notice). The reason is that registration for 
event forwarding
is done with  Set(InformInfo) which uses the requester QP and LID as the 
address for sending
the matching report. A simple way around that limitation could be to enable the SM to 
"pre-register"
a well known multicast group target for event forwarding. One issue though, 
would be that UD multicast
is not reliable and some notifications could get lost. A notification sequence 
number could be used
to catch these missed notifications eventually.

Eitan
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to