From: "Eric Barton" <[EMAIL PROTECTED]> Date: Mon, 4 Dec 2006 17:57:05 -0000 > Yeah, I see that, but our transport doesn't work that way. I need > to have already initialized the transmit side before I can kick the > receive side into gear. After that it all happens in hardware, but > the cost is that I'm not allowed to start the receive side first. John, > So it sounds like the answer is that I *should* implement a > different layer of messaging, where the first dispatch is whether or > not it's a low-level dma control message vs an lnd message that > needs to go through the usual mechanism. That way the control > messages (of which I believe "I've set up my transmit, now you start > your receive" is the only one) will be handled with a minimum of > overhead. Do you have a special low-latency way of handling messages that the short-message channel can't use?
The hardware (actually it's microcode), totally below the direct control of software, does the signalling necessary to start the companion transmit on a peer node when a receive is started. I believe that's what you're asking about? There is nothing analogous for going the other direction. IOW, initiating the actual transfer of a bag of bits can only be done from the receive side, after the transmit side is ready for it. What this means in practice is that it's cheaper to initiate a transfer from the transmit side, because the transmitter can set it up, then send a short message to the receiver, which then starts the receive operation. To initiate a transfer from the receive side, the receiver has to basically ask the transmitter to set up his side, get a something back that says that's done and passes the transmit descriptor, then start up the receive. If not, I don't see what this will really buy you. If the "natural" RDMA on scicortex is RDMA READ, then LNET PUT would go something like.... 1. Lustre on node A calls LNetPut(), which calls sc_send() to send an LNET PUT header + payload. sc_send() sets up the source buffers ready for transmission and sends an SC_PUT_REQ which includes the LNET header and the RDMA descriptor for the source buffers. 2. The LND on node B receives the SC_PUT_REQ and passes the LNET header to lnet_parse(). When sc_recv() is called with the matched buffer iovs, it initiates the RDMA to fetch the PUT payload. 3. Both sides receive notification somehow when the RDMA completes and call lnet_finalize(). yes... ...and an "optimized" LNET GET might go something like... 1. Lustre on node A calls LNetGet(), which calls sc_send(). This sends an SC_GET_REQ which includes the LNET header. 2. The LND on node B receives the SC_GET_REQ and passes the LNET header to lnet_parse(). When sc_recv() is called with the matched buffer iovs, it sets up these buffers for transmitting and sends back an SC_GET_ACK containing the RDMA descriptor for them. 3. The LND on node A receives the SC_GET_ACK and initiates the RDMA to fetch the GET payload. 4. Both sides receive notification somehow when the RDMA completes and calls lnet_finalize(). yes. It was step 2, what you're calling get-ack, which I was originally asking about, and supposing ought to happen "under the hood", ie without benefit of going through lnet_parse etc. ...but actually, this is more code for no reward. You could just as easily send the GET immediately and handle RDMA in the LNET REPLY message i.e.... 1. Lustre on node A calls LNetGet(), which calls sc_send(). This sends an SC_IMMEDIATE message with the LNET header and 0 payload. 2. The LND on node B receives the SC_IMMEDIATE message and passes it to lnet_parse(). sc_recv() is called back with NULL payload buffers in response to the LNET GET just parsed (just because LNET always calls the recv() callback if lnet_parse() returns success). LNET on node B also calls sc_send() to send an LNET REPLY header + payload. sc_send() sets up the source buffers ready for transmission and sends an SC_PUT_REQ which includes the LNET header and the RDMA descriptor for the source buffers. 3. The LND on node B receives the SC_PUT_REQ and passes the LNET header to lnet_parse(). When sc_recv() is called with the matched buffer iovs, it initiates the RDMA to fetch the PUT payload. 4. Both sides receive notification somehow when the RDMA completes and call lnet_finalize(). Ok. I think I see. You're relying on B's LNET to call into lnd_recv with the header (which is interesting) and zero payload (which is not). B then does what he would have done if he was the initiator in the first place, namely issue a PUT and let the rest of the machinery do its thing. I actually thought about doing that a bit, but wasn't sure it was safe for B to be doing a PUT involving the lnet_hdr_t on which A was already doing a GET. If the answer is that that is safe, I think my problem's solved. _______________________________________________ Lustre-devel mailing list Lustre-devel@clusterfs.com https://mail.clusterfs.com/mailman/listinfo/lustre-devel