Re: [ofiwg] DS/DA Runtime Model Discussion

Smith, Stan Fri, 12 Feb 2016 10:52:41 -0800

[Doug writes] 
So, if Lustre creates only one endpoint (QP) to another node and fires a high 
rate of concurrent messages (high thread count) over that endpoint, will 
libfabrics/kFabrics intelligently use CPU cores, IRQ balancing, NUMA, etc?  Or 
will it be the responsibility of the application writers to find a way to 
manipulate the use of endpoints to get the best performance?



OK - I grok where you are coming from...

Thread & core allocation/scheduling/binding w.r.t. endpoints are all aspects 
outside the current scope of libfabric/kFabric today.

From a libfabric/kFabric provider POV what would 'intelligently use CPU cores, 
IRQ balancing, NUMA'  actually imply?

The transport layer (aka libfabric/kFabric provider) existing at a layer below 
the client, could have a difficult time guessing at the expected thread/core 
behavior a higher level client layer would expect.

That said, perhaps the client could provide hints as to the desired/expected 
behavior which the provider could choose to implement if possible.

Getting this design discussion on the OFIWIG things-to-think-about list would 
be a good 1st step.

Stan.



> On Feb 12, 2016, at 8:52 AM, Smith, Stan <[email protected]> wrote:
> 
> Hi Doug,
>  I may have misled you in believing that clients of libfabric and/or KFabric 
> are responsible for transport locking issues, they are 'not'.
> 
> Libfabric/kFabric providers 'are' responsible for access serialization to 
> hardware.
> 
> s.
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Oucharek, Doug S
> Sent: Wednesday, February 10, 2016 3:37 PM
> To: Paul Grun <[email protected]>
> Cc: [email protected]
> Subject: [ofiwg] DS/DA Runtime Model Discussion
> 
> This email is a followup to my comment in a previous DS/DA call about the 
> runtime model being an important part of the DS/DA definition.
> 
> MPI seems to be the dominate user of fabrics in HPC.  As such, they have a 
> huge impact on the design of the runtime model being followed by fabric 
> developers and corresponding middleware (what I consider OFED/verbs, 
> libfabrics, and DS/DA).  Currently, they seems to be pushing for bare metal 
> access from the providers leaving the work of serialization/locking to the 
> middleware or the applications themselves.
> 
> If DS/DA follows libfabrics in its development, I am concerned that the bare 
> metal mindset will dominate here as well and that will leave “application 
> anarchy” with regards to how serialization/locking is being done.  Mitigating 
> the strategy of fabric users is something I would expect from the providers 
> (the one common access point regardless of middleware).  The MPI push was to 
> get this common point to back off and leave serialization/locking to the 
> upper layers but we now do not have a common point to coordinate competing 
> access to the fabric.
> 
> Should it not be a part of the middleware (libfabrics and DS/DA) to at the 
> very least, put demands upon the providers so a common strategy for 
> serialization/locking can be enforced for a specific fabric so the apps, like 
> Lustre, don’t have to make significant code changes to get reasonable 
> performance out of the fabric?  If we have to make significant changes for 
> each new fabric released, the value of the middleware (be it OFED, 
> libfabrics, or DS/DA) is severely diminished and we might as well just access 
> the fabric drivers directly.
> 
> Discussion?  
> 
> Doug
> _______________________________________________
> ofiwg mailing list
> [email protected]
> http://lists.openfabrics.org/mailman/listinfo/ofiwg

_______________________________________________
ofiwg mailing list
[email protected]
http://lists.openfabrics.org/mailman/listinfo/ofiwg

Re: [ofiwg] DS/DA Runtime Model Discussion

Reply via email to