Re: [ofiwg] DS/DA Runtime Model Discussion

Jeff Becker Mon, 22 Feb 2016 17:14:47 -0800

On 02/22/2016 05:05 PM, Jeff Becker wrote:

On 02/22/2016 04:55 PM, Oucharek, Doug S wrote:

Not really sure I had a question, per se, but rather pointing out an 
opportunity for kFabrics to make life easier for the guys, like me, to support 
applications like Lustre.  And, in doing so, benefit the vendors.


Attached are two diagrams I whipped up.  The first one, lnetToday.png, shows 
the current state of affairs for LNet/LNDs.  the second, lnetFuture.png, is the 
dream world where kFabrics takes on more of what the apps in kernel space have 
to do today.  I’ll explain…

Don't know about others, but I don't see any diagrams. I like thewriteup though.


-jeff


Sorry for the noise - I see them now.

-jeff

As you can see from the colours in the first diagram, a vendor of new fabrics 
has to provide the low-level driver and a provider (if wanting to use OFED as 
API).  If they don’t want to use OFED, they also have to create an LND for 
Lustre like Cray did for GNI (see diagram).  The Lustre community will NOT do 
this for you.  It is expected that a vendor hire/train staff to ensure Lustre 
runs well on their hardware (both new and updated).  Or, they can pay a Lustre 
support group like mine to do it for them.  Either way, it is an expensive 
proposition.

Now, the assumption that just writing a provider for OFED will save them having 
to do any Lustre-specific work, that is not necessarily true.  Depending on how 
well the OFED interface works for their specific design, some adaptations in 
the LND may still need to be done.  That, again, is not something the Lustre 
community is going to do.  The vendor is responsible for that.

To the point: if we just swap kFabrics for OFED in the first diagram, that will 
not change the responsibility of who does what.  There will be a new LND that 
sits on top of kFabrics and the Lustre community will “eventually” come to 
support it.  But any custom changes to that LND for different vendor hardware 
is the responsibility of the vendor.  The best thing kFabrics can do to protect 
all the vendors from having to spend time and money on LND optimizations is to 
isolate, as much as possible, the different providers and their performance 
characteristics from the LND layer.  If this is done well, we can evolve to the 
second diagram where the LND layer disappears (or is adopted by kFabrics) and 
all the Lustre community has to do is maintain LNet and its use of kFabrics.  
That should also make the vendor’s lives easier if/when they choose to support 
Lustre.

I suspect that GPFS and NVMe will find themselves in a similar boat.  Solve 
this in kFabrics solves it for all of us rather than having to tackle it on a 
per application basis.  User space is not having the same issues with 
libfabrics because there are a bunch of other layers in the networking onion 
taking a role in smoothing the road over.  We don’t have those layers in kernel 
space.

Just a point about LNDs: the o2iblnd (interface to OFED) is over 6,000 lines of 
some very complex code.  The gnilnd is significantly bigger.  Writing and 
testing a new LND is a very significant effort and is not something any vendor 
should expect to do in under 6 months.  The more of this kFabrics takes on, the 
more it will save vendors and app developers alike.

Doug


[cid:[email protected]][cid:[email protected]]
On Feb 18, 2016, at 4:47 PM, Paul Grun <[email protected]<mailto:[email protected]>> 
wrote:

Meanwhile, are we anywhere near to addressing your original question??  I think 
we may have wandered afield...

-----Original Message-----
From: Paul Grun
Sent: Thursday, February 18, 2016 4:47 PM
To: 'Oucharek, Doug S' 
<[email protected]<mailto:[email protected]>>
Cc: Smith, Stan 
<[email protected]<mailto:[email protected]>>;[email protected]<mailto:[email protected]>
Subject: RE: DS/DA Runtime Model Discussion

Let's keep this useful discussion alive...

I can see your point that LND looks to the LNET layer like a provider, insomuch 
as it insulates the network layer (LNET) from the vagaries of a specific 
network.  My understanding of the details of the layers is rusty, at best, but 
as far as I know there are two main LND layers available today - one for RDMA 
networks, and one for non-RDMA networks, like TCP/IP.  The former is o2iblnd, 
and I can't remember the name for the latter.  My assumption is (please correct 
me otherwise) that o2iblnd only runs over a single network, that being IB 
(including its RoCE variant).

So I have a couple of thoughts:
1. Does it make sense to write the existing o2iblnd layer to the (proposed) 
kfabric API?  Keep in mind that IB is one of the networks supported by kfabric 
via a verbs provider layer.  Doing this would truly insulate the LNET layer 
from substitutions of the underlying network.  It would also place LND squarely 
in the realm of being a consumer of network services.  Or...
2. Does it make sense to write a new LND which is natively coded to kfabric, 
leaving us with (at least) three possible LND layers?
I'd love to hear a discussion about this.

As far as equating LND with MPI, I would describe MPI as communications 
middleware - it provides a communication service which I would equate with 
LNET, not with LND.  Obviously the analogies are far from perfect.  As you 
point out, in today's world, the kernel treats LND as a network service 
provider.  I guess my suggestion is that we try to push it up the stack 
slightly and lump it together with LNET as the communications service.

Your thoughts?
-Paul

-----Original Message-----
From: Oucharek, Doug S [mailto:[email protected]]
Sent: Monday, February 15, 2016 11:32 AM
To: Paul Grun <[email protected]<mailto:[email protected]>>
Cc: Smith, Stan 
<[email protected]<mailto:[email protected]>>;[email protected]<mailto:[email protected]>
Subject: Re: DS/DA Runtime Model Discussion

In a way, I view the Lustre LND layer as a provider layer (specific code for a 
specific fabric API) and the LNet layer which is above the LNDs as the network 
services layer.  Guess it comes down to perspective :^).

As a former user space developer, I view an example of a network services layer 
something like ZeroMQ which provides a complete end-to-end communications 
system which handles such things as the threading model when running 
asynchronously.  If the network stack being used requires a different approach 
to the runtime model, the ZeroMQ developers deal with that thereby protecting 
the applications from having to change.

I guess MPI is the replacement for ZeroMQ in the HPC world.  However, kernel 
space has nothing like ZeroMQ or MPI that file systems like Lustre or GPFS can 
use so we have to have layers like Lustre’s LND to do that work for us.  Using 
OFED/verbs from one of our LNDs was supposed to help protect us from changes in 
vendor hardware/firmware.  It doesn’t.  Recently, Mellanox changed their 
firmware from mlx4 to mlx5.  In theory, Lustre should never have cared about 
that as OFED should be a standard which shields us from such changes (i.e. if a 
change to the usage model is needed, that should be made to the OFED code base 
and not what lies above).  I have just spent the last two months firefighting 
the effects on customers who upgraded one or more IB cards in a cluster from 
mlx4 to mlx5.

In a perfect dream world I have, the work our LNDs do would be absorbed by 
kFabrics and all Lustre will have to do is change LNet to directly use kFabrics 
and we can toss away all the LNDs and be good to run on current and future 
fabrics equally well.

Doug

On Feb 12, 2016, at 5:08 PM, Paul Grun <[email protected]<mailto:[email protected]>> 
wrote:

In general, I agree with your basic assertion...one of the expected values of 
the OFI project is 'application transportability', meaning that a given 
consumer of the services offered via the API should be easily ported from one 
provider to another (assuming that both providers offer equivalent 
functionality).

That being said, one of the expectations of the OFI project is that a given 
provider vendor may target his provider at a particular market and thus may 
optimize his implementation for that market resulting in a higher 
quality/higher performing provider, but potentially at higher cost.  None of 
which negates your basic point.

One point I do want to raise is the expression 'middleware'.  The convention 
we've adopted in OFI is to refer to everything above the API as a consumer of 
network services, and everything below the API as comprising the network stack. 
 Thus MPI, which is referred to as  communications middleware, is a consumer of 
network services.

I am looking (in vain, I'm afraid) for my canonical LNET stack diagram, but if 
memory serves I think of the LND layer, which is written to a particular 
network API (e.g. o2iblnd), as a consumer and thus roughly equivalent to MPI as 
middleware.  But I would not think of the provider as being middleware.

All that aside, to help me better visualize your point, can you give an example 
of a specific way that an LNET consumer (LND?) would behave that might differ 
between providers in order to maximize performance?

Thanks,
-Paul

-----Original Message-----
From: Oucharek, Doug S [mailto:[email protected]]
Sent: Friday, February 12, 2016 11:25 AM
To: Smith, Stan <[email protected]<mailto:[email protected]>>
Cc: Paul Grun 
<[email protected]<mailto:[email protected]>>;[email protected]<mailto:[email protected]>
Subject: Re: DS/DA Runtime Model Discussion

You can see where I am coming from.  As an application writer using this 
middleware, if I write my code one way and am able to get good performance from 
fabric A (provider A), I am expecting to get a consistent  performance profile 
when I start to support fabric B (provider B).  If I have to put a bunch of “if 
this provider, do this, if that provider, do something different” conditions in 
my application to get consistent performance out of the fabric, I consider that 
a fail of the middleware.  The middleware should minimize the changes the 
applications do to adopt new fabrics and that needs to include, as much as 
possible, what is required for best performance.

I appreciate that the application may need to provide hints, message profiles, 
etc. to make the job easier.  But good middleware should be a negotiator 
between the application and the provider so I don’t have to learn all the 
gritty details of how the provider works just to use it reasonably well.

Doug

On Feb 12, 2016, at 10:52 AM, Smith, Stan 
<[email protected]<mailto:[email protected]>> wrote:

[Doug writes]
So, if Lustre creates only one endpoint (QP) to another node and fires a high 
rate of concurrent messages (high thread count) over that endpoint, will 
libfabrics/kFabrics intelligently use CPU cores, IRQ balancing, NUMA, etc?  Or 
will it be the responsibility of the application writers to find a way to 
manipulate the use of endpoints to get the best performance?


OK - I grok where you are coming from...

Thread & core allocation/scheduling/binding w.r.t. endpoints are all aspects 
outside the current scope of libfabric/kFabric today.

 From a libfabric/kFabric provider POV what would 'intelligently use CPU cores, 
IRQ balancing, NUMA'  actually imply?

The transport layer (aka libfabric/kFabric provider) existing at a layer below 
the client, could have a difficult time guessing at the expected thread/core 
behavior a higher level client layer would expect.

That said, perhaps the client could provide hints as to the desired/expected 
behavior which the provider could choose to implement if possible.

Getting this design discussion on the OFIWIG things-to-think-about list would 
be a good 1st step.

Stan.



On Feb 12, 2016, at 8:52 AM, Smith, Stan 
<[email protected]<mailto:[email protected]>> wrote:

Hi Doug,
I may have misled you in believing that clients of libfabric and/or KFabric are 
responsible for transport locking issues, they are 'not'.

Libfabric/kFabric providers 'are' responsible for access serialization to 
hardware.

s.

-----Original Message-----
From:[email protected]<mailto:[email protected]>
  [mailto:[email protected]] On Behalf Of Oucharek, Doug S
Sent: Wednesday, February 10, 2016 3:37 PM
To: Paul Grun <[email protected]<mailto:[email protected]>>
Cc:[email protected]<mailto:[email protected]>
Subject: [ofiwg] DS/DA Runtime Model Discussion

This email is a followup to my comment in a previous DS/DA call about the 
runtime model being an important part of the DS/DA definition.

MPI seems to be the dominate user of fabrics in HPC.  As such, they have a huge 
impact on the design of the runtime model being followed by fabric developers 
and corresponding middleware (what I consider OFED/verbs, libfabrics, and 
DS/DA).  Currently, they seems to be pushing for bare metal access from the 
providers leaving the work of serialization/locking to the middleware or the 
applications themselves.

If DS/DA follows libfabrics in its development, I am concerned that the bare 
metal mindset will dominate here as well and that will leave “application 
anarchy” with regards to how serialization/locking is being done.  Mitigating 
the strategy of fabric users is something I would expect from the providers 
(the one common access point regardless of middleware).  The MPI push was to 
get this common point to back off and leave serialization/locking to the upper 
layers but we now do not have a common point to coordinate competing access to 
the fabric.

Should it not be a part of the middleware (libfabrics and DS/DA) to at the very 
least, put demands upon the providers so a common strategy for 
serialization/locking can be enforced for a specific fabric so the apps, like 
Lustre, don’t have to make significant code changes to get reasonable 
performance out of the fabric?  If we have to make significant changes for each 
new fabric released, the value of the middleware (be it OFED, libfabrics, or 
DS/DA) is severely diminished and we might as well just access the fabric 
drivers directly.

Discussion?

Doug
_______________________________________________
ofiwg mailing list
[email protected]<mailto:[email protected]>
http://lists.openfabrics.org/mailman/listinfo/ofiwg






_______________________________________________
ofiwg mailing list
[email protected]
http://lists.openfabrics.org/mailman/listinfo/ofiwg




_______________________________________________
ofiwg mailing list
[email protected]
http://lists.openfabrics.org/mailman/listinfo/ofiwg

_______________________________________________
ofiwg mailing list
[email protected]
http://lists.openfabrics.org/mailman/listinfo/ofiwg

Re: [ofiwg] DS/DA Runtime Model Discussion

Reply via email to