Hi Eitan, On Thu, 2006-01-05 at 07:27, Eitan Zahavi wrote: > Hi Sean, > > This is great initiative - tackling an important issue. > I am glad you took this on. > > Please see below. > > Sean Hefty wrote: > > I've been given the task of trying to come up with an implementation for > > an SA cache. The intent is to increase the scalability and performance > > of the openib stack. My current thoughts on the implementation are > > below. Any feedback is welcome. > > > > To keep the design as flexible as possible, my plan is to implement the > > cache in userspace. The interface to the cache would be via MADs. > > Clients would send their queries to the sa_cache instead of the SA > > itself. The format of the MADs would be essentially identical to those > > used to query the SA itself. Response MADs would contain any requested > > information. If the cache could not satisfy a request, the sa_cache > > would query the SA, update its cache, then return a reply. > * I think the idea of using MADs to interface with the cache is very good. > * User space implementation: > This also might be a good tradeoff between coding and debugging versus the > the impact on number of connections per second. I hope the impact on > performance > will not be too big. Maybe we can take the path of implementing in user > space and > if the performance penalty will be too high we can port to kernel. > * Regarding the sentence:"Clients would send their queries to the sa_cache > instead of the SA" > I would propose that a "SA MAD send switch" be implemented in the core: > Such a switch > will enable plugging in the SA cache (I would prefer calling it SA local > agent due to > its extended functionality). Once plugged in, this "SA local agent" should > be forwarded all > outgoing SA queries. Once it handles the MAD it should be able to inject > the response through > the core "SA MAD send switch" as if they arrived from the wire. > > > > The benefits that I see with this approach are: > > > > + Clients would only need to send requests to the sa_cache. > > + The sa_cache can be implemented in stages. Requests that it cannot > > handle would just be forwarded to the SA. > > + The sa_cache could be implemented on each host, or a select number of > > hosts. > > + The interface to the sa_cache is similar to that used by the SA. > > + The cache would use virtual memory and could be saved to disk. > > > > Some drawbacks specific to this method are: > > > > - The MAD interface will result in additional data copies and userspace > > to kernel transitions for clients residing on the local system. > > - Clients require a mechanism to locate the sa_cache, or need to make > > assumptions about its location. > The proposal for "SA MAD send switch" in the core will resolve this issue. > No client change will be required as all MADs are sent through the core which > will > redirect them to the SA agent ...
I see this as more granular than a complete switch for the entire class. More like on a per attribute basis. > Functional requirements: > * It is clear that the first SA query to cache is PathRecord. > So if a new client wants to connect to another node a new PathRecord > query will not need to be sent to the SA. However, recent work on QoS has > pointed out > that under some QoS schemes PathRecord should not be shared by different > clients > or even connections. There are several ways to make such QoS scheme scale. > Since this is a different discussion topic - I only bring this up such that > we take into account caching might also need to be done by a complex key > (not just > SRC/DST ...) Per the QoS direction, this complex key is indeed part of the enhanced PathRecord, right ? > * Forgive me for bringing the following issue - over and over to the group: > Multicast Join/Leave should be reference counted. The "SA local agent" > could be > the right place for doing this kind of reference counting (actually if it > does that > it probably needs to be located in the Kernel - to enable cleanup after > killed processes). The cache itself may need another level of reference counting (even if invalidation is broadcast). > * Similarly - "Client re-registration" could be made transparent to clients. > > Cache Invalidation: > Several discussions about PathRecord invalidation were spawn in the past. > IMO, it is enough to be notified about death of local processes, remote port > availability (trap 64/65) and > multicast group availability (trap 66/67) in order to invalidate SA cache > information. I think that it's more complicated than this. As an example, how does the SA cache know whether a cached path record needs to be changed based on traps 64/65 ? It seems to me to need to be tightly tied to the SM/SA for this. > So each SA Agent could register to obtain this data. But that solution does > not nicely scale, > as the SA needs to send notification to all nodes (but is reliable - could > resend until Repressed). > However, current IBTA definition for InformInfo (event forwarding mechanism) > does not > allow for multicast of Report(Notice). The reason is that registration for > event forwarding > is done with Set(InformInfo) which uses the requester QP and LID as the > address for sending > the matching report. A simple way around that limitation could be to enable > the SM to "pre-register" > a well known multicast group target for event forwarding. One issue though, > would be that UD multicast > is not reliable and some notifications could get lost. A notification > sequence number could be used > to catch these missed notifications eventually. A multicast group could be defined for SA caching. The reliable aspects are another matter although the represses could be unicast back to the cache. -- Hal > Eitan > _______________________________________________ > openib-general mailing list > [email protected] > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
