[Lustre-devel] RE: More LND

John R. Dunning Mon, 04 Dec 2006 10:37:51 -0800

    From: "Eric Barton" <[EMAIL PROTECTED]>
    Date: Mon, 4 Dec 2006 17:57:05 -0000
    
    > Yeah, I see that, but our transport doesn't work that way.  I need
    > to have already initialized the transmit side before I can kick the
    > receive side into gear.  After that it all happens in hardware, but
    > the cost is that I'm not allowed to start the receive side first.
    John,
    
    > So it sounds like the answer is that I *should* implement a
    > different layer of messaging, where the first dispatch is whether or
    > not it's a low-level dma control message vs an lnd message that
    > needs to go through the usual mechanism.  That way the control
    > messages (of which I believe "I've set up my transmit, now you start
    > your receive" is the only one) will be handled with a minimum of
    > overhead.
    
    Do you have a special low-latency way of handling messages that the
    short-message channel can't use?


The hardware (actually it's microcode), totally below the direct control of
software, does the signalling necessary to start the companion transmit on a
peer node when a receive is started.  I believe that's what you're asking
about?

There is nothing analogous for going the other direction.  IOW, initiating the
actual transfer of a bag of bits can only be done from the receive side, after
the transmit side is ready for it.  What this means in practice is that it's
cheaper to initiate a transfer from the transmit side, because the transmitter
can set it up, then send a short message to the receiver, which then starts
the receive operation.  To initiate a transfer from the receive side, the
receiver has to basically ask the transmitter to set up his side, get a
something back that says that's done and passes the transmit descriptor, then
start up the receive.

                                      If not, I don't see what this will
    really buy you.  If the "natural" RDMA on scicortex is RDMA READ, then
    LNET PUT would go something like....
    
     1. Lustre on node A calls LNetPut(), which calls sc_send() to send an
        LNET PUT header + payload.  sc_send() sets up the source buffers
        ready for transmission and sends an SC_PUT_REQ which includes the
        LNET header and the RDMA descriptor for the source buffers.
    
     2. The LND on node B receives the SC_PUT_REQ and passes the LNET
        header to lnet_parse().  When sc_recv() is called with the matched
        buffer iovs, it initiates the RDMA to fetch the PUT payload.
    
     3. Both sides receive notification somehow when the RDMA completes
        and call lnet_finalize().
    
yes...

    ...and an "optimized" LNET GET might go something like...
    
     1. Lustre on node A calls LNetGet(), which calls sc_send().  This
        sends an SC_GET_REQ which includes the LNET header.
    
     2. The LND on node B receives the SC_GET_REQ and passes the LNET
        header to lnet_parse().  When sc_recv() is called with the matched
        buffer iovs, it sets up these buffers for transmitting and sends
        back an SC_GET_ACK containing the RDMA descriptor for them.
    
     3. The LND on node A receives the SC_GET_ACK and initiates the RDMA
        to fetch the GET payload.
    
     4. Both sides receive notification somehow when the RDMA completes
        and calls lnet_finalize().

yes.  It was step 2, what you're calling get-ack, which I was originally
asking about, and supposing ought to happen "under the hood", ie without
benefit of going through lnet_parse etc.
    
    ...but actually, this is more code for no reward.  You could just as
    easily send the GET immediately and handle RDMA in the LNET REPLY
    message i.e....
    
     1. Lustre on node A calls LNetGet(), which calls sc_send().  This
        sends an SC_IMMEDIATE message with the LNET header and 0 payload.
    
     2. The LND on node B receives the SC_IMMEDIATE message and passes it
        to lnet_parse().  sc_recv() is called back with NULL payload
        buffers in response to the LNET GET just parsed (just because LNET
        always calls the recv() callback if lnet_parse() returns success).
    
        LNET on node B also calls sc_send() to send an LNET REPLY header +
        payload.  sc_send() sets up the source buffers ready for
        transmission and sends an SC_PUT_REQ which includes the LNET
        header and the RDMA descriptor for the source buffers.
    
     3. The LND on node B receives the SC_PUT_REQ and passes the LNET
        header to lnet_parse().  When sc_recv() is called with the matched
        buffer iovs, it initiates the RDMA to fetch the PUT payload.
    
     4. Both sides receive notification somehow when the RDMA completes
        and call lnet_finalize().
    
Ok.  I think I see.  You're relying on B's LNET to call into lnd_recv with the
header (which is interesting) and zero payload (which is not).  B then does
what he would have done if he was the initiator in the first place, namely
issue a PUT and let the rest of the machinery do its thing.

I actually thought about doing that a bit, but wasn't sure it was safe for B
to be doing a PUT involving the lnet_hdr_t on which A was already doing a GET.
If the answer is that that is safe, I think my problem's solved.

_______________________________________________
Lustre-devel mailing list
Lustre-devel@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

[Lustre-devel] RE: More LND

Reply via email to