[Lustre-devel] RE: More LND

John R. Dunning Mon, 04 Dec 2006 09:08:16 -0800

    From: "Eric Barton" <[EMAIL PROTECTED]>
    Date: Mon, 4 Dec 2006 16:11:25 -0000
    
    
    > I think this may boil down a question of tastefulness.
    > 
    > In the case where A does a GET at B, the nature of our transport
    > dictates that B must figure out what he's going to do to satisfy the
    > GET, get it staged up, then send something back to A saying "here's
    > the thing you're receiving from".  Both the get request and the
    > response describing where A should pull from will go over the
    > short-message channel.
    
    Alternatively A can send over the RDMA descriptor for its sink buffer
    and B can just launch the transfer after it has called lnet_parse()
    and had its recv() callback called to "receive" the GET.  The elan and
    openib(gen1)/cisco LNDs do this.


Yeah, I see that, but our transport doesn't work that way.  I need to have
already initialized the transmit side before I can kick the receive side into
gear.  After that it all happens in hardware, but the cost is that I'm not
allowed to start the receive side first.
    
    Where it gets a bit yeuchy is that the last arg of lnet_parse() is 1
    (TRUE) in this case, meaning that the LND wants to do RDMA using
    whatever other info (e.g. RDMA descriptors) it just received, so lnet
    should call its recv() callback passing the matched buffers (which the
    recv() will actually send from!!!).  If it passed 0 (FALSE), the
    recv() callback would occur, but just to allow the LND to repost its
    buffer - it could be some time later that the LND's send() callback
    would be invoked with the matched buffers and it hugely complicates
    the LND code to make it hang on to the receive buffer (including peer
    sink buffer RDMA descriptor) until then.

Ouch.  You're making my brain hurt.
    
    > Looking at existing LNDs that do similar things, it looks like the
    > response message is often handled more "under the hood".  But our
    > transport is asymmetrical in that way, ie when a receiver starts a
    > pull, the signalling is handled by the hardware, but when a
    > transmitter wants to push, he must send a short message to tell the
    > receiver to pull.
    > 
    > So the question is, is it a "Bad Thing" (tm) to just invent a new
    > opcode analogous to LND_GET_REQUEST, and let it percolate through
    > the same flow of control as everything else?  Or do you think I'm
    > better off having basically two layers of message, one for these out
    > of band ones, and the other for the usual short-message stuff?
    
    I used RDMA READ on the first RDMA capable networks, but stopped using
    it for iiblnd, (and other LNDs just followed suit) because I was
    worried about additional resources required in the underlying network
    to support it.  If RDMA READ is a natural choice on Scicortex, it's a
    no-brainer to eliminate yet another network latency.  
    
So it sounds like the answer is that I *should* implement a different layer of
messaging, where the first dispatch is whether or not it's a low-level dma
control message vs an lnd message that needs to go through the usual
mechanism.  That way the control messages (of which I believe "I've set up my
transmit, now you start your receive" is the only one) will be handled with a
minimum of overhead.

    However, note the restrictions on when you can use it to implement
    GET/REPLY.  When LNET routers are involved, you cannot optimize the
    GET in this - the LNET GET request must make it all the way to the
    destination before the source buffers can be matched and the LNET
    REPLY is forwarded back just like PUTs.
    
Yes, understood.  There are a number of things that get more complicated when
I start to deal with routing, but I believe I can push them all off until rev
2.  For starters, the only thing that will use this LND is the case where the
cluster is directly connected (via FC or whatever) to the storage array, and
therefore all I care about is the simple case, no routing.  For external
cases, I expect we'll just be using socklnd.  I'm trying to leave breadcrumbs
in the code for that future date.

There are other issues that will need to be addressed before I can really
build a proper router anyhow.  For instance, can I interface IB's dma stuff
directly to our proprietary technology to make bits stream all the way through
from end to end?  If the answer is yes, then the world is a very different
place than if things have to land in memory and be re-launched by software
along the way.  It might turn out that the right architecture is simply
(hah!) to implement an IB-like semantic layer, then use that to extend
something like openib throughout the system.  Until we can think through those
issues, I'm not going to tube too hard on the routing question.

_______________________________________________
Lustre-devel mailing list
Lustre-devel@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

[Lustre-devel] RE: More LND

Reply via email to