[Lustre-devel] RE: More LND

Eric Barton Mon, 04 Dec 2006 08:10:47 -0800

> I think this may boil down a question of tastefulness.
> 
> In the case where A does a GET at B, the nature of our transport
> dictates that B must figure out what he's going to do to satisfy the
> GET, get it staged up, then send something back to A saying "here's
> the thing you're receiving from".  Both the get request and the
> response describing where A should pull from will go over the
> short-message channel.


Alternatively A can send over the RDMA descriptor for its sink buffer
and B can just launch the transfer after it has called lnet_parse()
and had its recv() callback called to "receive" the GET.  The elan and
openib(gen1)/cisco LNDs do this.

Where it gets a bit yeuchy is that the last arg of lnet_parse() is 1
(TRUE) in this case, meaning that the LND wants to do RDMA using
whatever other info (e.g. RDMA descriptors) it just received, so lnet
should call its recv() callback passing the matched buffers (which the
recv() will actually send from!!!).  If it passed 0 (FALSE), the
recv() callback would occur, but just to allow the LND to repost its
buffer - it could be some time later that the LND's send() callback
would be invoked with the matched buffers and it hugely complicates
the LND code to make it hang on to the receive buffer (including peer
sink buffer RDMA descriptor) until then.

> Looking at existing LNDs that do similar things, it looks like the
> response message is often handled more "under the hood".  But our
> transport is asymmetrical in that way, ie when a receiver starts a
> pull, the signalling is handled by the hardware, but when a
> transmitter wants to push, he must send a short message to tell the
> receiver to pull.
> 
> So the question is, is it a "Bad Thing" (tm) to just invent a new
> opcode analogous to LND_GET_REQUEST, and let it percolate through
> the same flow of control as everything else?  Or do you think I'm
> better off having basically two layers of message, one for these out
> of band ones, and the other for the usual short-message stuff?

I used RDMA READ on the first RDMA capable networks, but stopped using
it for iiblnd, (and other LNDs just followed suit) because I was
worried about additional resources required in the underlying network
to support it.  If RDMA READ is a natural choice on Scicortex, it's a
no-brainer to eliminate yet another network latency.  

However, note the restrictions on when you can use it to implement
GET/REPLY.  When LNET routers are involved, you cannot optimize the
GET in this - the LNET GET request must make it all the way to the
destination before the source buffers can be matched and the LNET
REPLY is forwarded back just like PUTs.

-- 

                Cheers,
                        Eric

---------------------------------------------------
|Eric Barton        Barton Software               |
|9 York Gardens     Tel:    +44 (117) 330 1575    |
|Clifton            Mobile: +44 (7909) 680 356    |
|Bristol BS8 4LL    Fax:    call first            |
|United Kingdom     E-Mail: [EMAIL PROTECTED]|
---------------------------------------------------


_______________________________________________
Lustre-devel mailing list
Lustre-devel@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

[Lustre-devel] RE: More LND

Reply via email to