[Doug writes] So, if Lustre creates only one endpoint (QP) to another node and fires a high rate of concurrent messages (high thread count) over that endpoint, will libfabrics/kFabrics intelligently use CPU cores, IRQ balancing, NUMA, etc? Or will it be the responsibility of the application writers to find a way to manipulate the use of endpoints to get the best performance?
OK - I grok where you are coming from... Thread & core allocation/scheduling/binding w.r.t. endpoints are all aspects outside the current scope of libfabric/kFabric today. From a libfabric/kFabric provider POV what would 'intelligently use CPU cores, IRQ balancing, NUMA' actually imply? The transport layer (aka libfabric/kFabric provider) existing at a layer below the client, could have a difficult time guessing at the expected thread/core behavior a higher level client layer would expect. That said, perhaps the client could provide hints as to the desired/expected behavior which the provider could choose to implement if possible. Getting this design discussion on the OFIWIG things-to-think-about list would be a good 1st step. Stan. > On Feb 12, 2016, at 8:52 AM, Smith, Stan <[email protected]> wrote: > > Hi Doug, > I may have misled you in believing that clients of libfabric and/or KFabric > are responsible for transport locking issues, they are 'not'. > > Libfabric/kFabric providers 'are' responsible for access serialization to > hardware. > > s. > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Oucharek, Doug S > Sent: Wednesday, February 10, 2016 3:37 PM > To: Paul Grun <[email protected]> > Cc: [email protected] > Subject: [ofiwg] DS/DA Runtime Model Discussion > > This email is a followup to my comment in a previous DS/DA call about the > runtime model being an important part of the DS/DA definition. > > MPI seems to be the dominate user of fabrics in HPC. As such, they have a > huge impact on the design of the runtime model being followed by fabric > developers and corresponding middleware (what I consider OFED/verbs, > libfabrics, and DS/DA). Currently, they seems to be pushing for bare metal > access from the providers leaving the work of serialization/locking to the > middleware or the applications themselves. > > If DS/DA follows libfabrics in its development, I am concerned that the bare > metal mindset will dominate here as well and that will leave “application > anarchy” with regards to how serialization/locking is being done. Mitigating > the strategy of fabric users is something I would expect from the providers > (the one common access point regardless of middleware). The MPI push was to > get this common point to back off and leave serialization/locking to the > upper layers but we now do not have a common point to coordinate competing > access to the fabric. > > Should it not be a part of the middleware (libfabrics and DS/DA) to at the > very least, put demands upon the providers so a common strategy for > serialization/locking can be enforced for a specific fabric so the apps, like > Lustre, don’t have to make significant code changes to get reasonable > performance out of the fabric? If we have to make significant changes for > each new fabric released, the value of the middleware (be it OFED, > libfabrics, or DS/DA) is severely diminished and we might as well just access > the fabric drivers directly. > > Discussion? > > Doug > _______________________________________________ > ofiwg mailing list > [email protected] > http://lists.openfabrics.org/mailman/listinfo/ofiwg _______________________________________________ ofiwg mailing list [email protected] http://lists.openfabrics.org/mailman/listinfo/ofiwg
