Re: [ofiwg] DS/DA Runtime Model Discussion

Oucharek, Doug S Fri, 12 Feb 2016 11:25:04 -0800

You can see where I am coming from.  As an application writer using this 
middleware, if I write my code one way and am able to get good performance from 
fabric A (provider A), I am expecting to get a consistent  performance profile 
when I start to support fabric B (provider B).  If I have to put a bunch of “if 
this provider, do this, if that provider, do something different” conditions in 
my application to get consistent performance out of the fabric, I consider that 
a fail of the middleware.  The middleware should minimize the changes the 
applications do to adopt new fabrics and that needs to include, as much as 
possible, what is required for best performance.


I appreciate that the application may need to provide hints, message profiles, 
etc. to make the job easier.  But good middleware should be a negotiator 
between the application and the provider so I don’t have to learn all the 
gritty details of how the provider works just to use it reasonably well.  

Doug

> On Feb 12, 2016, at 10:52 AM, Smith, Stan <[email protected]> wrote:
> 
> [Doug writes] 
> So, if Lustre creates only one endpoint (QP) to another node and fires a high 
> rate of concurrent messages (high thread count) over that endpoint, will 
> libfabrics/kFabrics intelligently use CPU cores, IRQ balancing, NUMA, etc?  
> Or will it be the responsibility of the application writers to find a way to 
> manipulate the use of endpoints to get the best performance?
> 
> 
> OK - I grok where you are coming from...
> 
> Thread & core allocation/scheduling/binding w.r.t. endpoints are all aspects 
> outside the current scope of libfabric/kFabric today.
> 
> From a libfabric/kFabric provider POV what would 'intelligently use CPU 
> cores, IRQ balancing, NUMA'  actually imply?
> 
> The transport layer (aka libfabric/kFabric provider) existing at a layer 
> below the client, could have a difficult time guessing at the expected 
> thread/core behavior a higher level client layer would expect.
> 
> That said, perhaps the client could provide hints as to the desired/expected 
> behavior which the provider could choose to implement if possible.
> 
> Getting this design discussion on the OFIWIG things-to-think-about list would 
> be a good 1st step.
> 
> Stan.
> 
> 
> 
>> On Feb 12, 2016, at 8:52 AM, Smith, Stan <[email protected]> wrote:
>> 
>> Hi Doug,
>> I may have misled you in believing that clients of libfabric and/or KFabric 
>> are responsible for transport locking issues, they are 'not'.
>> 
>> Libfabric/kFabric providers 'are' responsible for access serialization to 
>> hardware.
>> 
>> s.
>> 
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Oucharek, Doug S
>> Sent: Wednesday, February 10, 2016 3:37 PM
>> To: Paul Grun <[email protected]>
>> Cc: [email protected]
>> Subject: [ofiwg] DS/DA Runtime Model Discussion
>> 
>> This email is a followup to my comment in a previous DS/DA call about the 
>> runtime model being an important part of the DS/DA definition.
>> 
>> MPI seems to be the dominate user of fabrics in HPC.  As such, they have a 
>> huge impact on the design of the runtime model being followed by fabric 
>> developers and corresponding middleware (what I consider OFED/verbs, 
>> libfabrics, and DS/DA).  Currently, they seems to be pushing for bare metal 
>> access from the providers leaving the work of serialization/locking to the 
>> middleware or the applications themselves.
>> 
>> If DS/DA follows libfabrics in its development, I am concerned that the bare 
>> metal mindset will dominate here as well and that will leave “application 
>> anarchy” with regards to how serialization/locking is being done.  
>> Mitigating the strategy of fabric users is something I would expect from the 
>> providers (the one common access point regardless of middleware).  The MPI 
>> push was to get this common point to back off and leave 
>> serialization/locking to the upper layers but we now do not have a common 
>> point to coordinate competing access to the fabric.
>> 
>> Should it not be a part of the middleware (libfabrics and DS/DA) to at the 
>> very least, put demands upon the providers so a common strategy for 
>> serialization/locking can be enforced for a specific fabric so the apps, 
>> like Lustre, don’t have to make significant code changes to get reasonable 
>> performance out of the fabric?  If we have to make significant changes for 
>> each new fabric released, the value of the middleware (be it OFED, 
>> libfabrics, or DS/DA) is severely diminished and we might as well just 
>> access the fabric drivers directly.
>> 
>> Discussion?  
>> 
>> Doug
>> _______________________________________________
>> ofiwg mailing list
>> [email protected]
>> http://lists.openfabrics.org/mailman/listinfo/ofiwg
> 

_______________________________________________
ofiwg mailing list
[email protected]
http://lists.openfabrics.org/mailman/listinfo/ofiwg

Re: [ofiwg] DS/DA Runtime Model Discussion

Reply via email to