> -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Chuck Lever > Sent: Thursday, July 16, 2015 9:46 AM > To: Jason Gunthorpe > Cc: Sagi Grimberg; Christoph Hellwig; [email protected]; Steve Wise; > Or Gerlitz; Oren Duer; Bart Van Assche; Liran Liss; Hefty, > Sean; Doug Ledford; Tom Talpey > Subject: Re: Kernel fast memory registration API proposal [RFC] > > > On Jul 15, 2015, at 6:49 PM, Jason Gunthorpe > <[email protected]> wrote: > > > On Wed, Jul 15, 2015 at 05:25:11PM -0400, Chuck Lever wrote: > > > >> NFS READ and WRITE data payloads are mapped with ib_map_phys_mr() > >> just before the RPC is sent, and those payloads are unmapped > >> with ib_unmap_fmr() as soon as the client sees the server's RPC > >> reply. > > > > Okay.. but.. ib_unmap_fmr is the thing that sleeps, so you must > > already have a sleepable context when you call it? > > The RPC scheduler operates on the assumption that the processing > during each step does not sleep. > > We're not holding a lock, so a short sleep here works. In general > this kind of thing can deadlock pretty easily, but right at this > step I think it's avoiding deadlock "by accident." > > For some time, I've been considering deferring ib_unmap_fmr() to > a work queue, but FMR is operational and is a bit of an antique > so I haven't put much effort into bettering it. > > The point is, this is not something that should be perpetuated > into a new API, and certainly the other initiators have a hard > intolerance for a sleep. > > > > I was poking around to see how NFS is working (to see how we might fit > > a different API under here), I didn't find the call to ro_unmap I'd > > expect? xprt_rdma_free is presumbly the place, but how it relates to > > rpcrdma_reply_handler I could not obviously see. Does the upper layer > > call back to xprt_rdma_free before any of the RDMA buffers are > > touched? Can you clear up the call chain for me? > > The server performs RDMA READ and WRITE operations, then SENDs the > RPC reply. > > On the client, rpcrdma_recvcq_upcall() is invoked when the RPC > reply arrives and the RECV completes. > > rpcrdma_schedule_tasklet() queues the incoming RPC reply on a > global list and spanks our reply tasklet. > > The tasklet invokes rpcrdma_reply_handler() for each reply on the > list. > > The reply handler parses the incoming reply, looks up the XID and > matches it to a waiting RPC request (xprt_lookup_rqst). It then > wakes that request (xprt_complete_rqst). The tasklet goes to the > next reply on the global list. > > The RPC scheduler sees the awoken RPC request and steps the > finished request through to completion, at which point > xprt_release() is invoked to retire the request slot. > > Here resources allocated to the RPC request are freed. For > RPC/RDMA transports, xprt->ops->buf_free is xprt_rdma_free(). > xprt_rdma_free() invokes the ro_unmap method to unmap/invalidate > the MRs involved with the RPC request. > > > > Second, the FRWR stuff looks deeply suspicious, it is posting a > > IB_WR_LOCAL_INV, but the completion of that (in frwr_sendcompletion) > > triggers nothing. Handoff to the kernel must be done only after seeing > > IB_WC_LOCAL_INV, never before. > > I don't understand. Our LOCAL_INV is typically unsignalled because > there's nothing to do in the normal case. frwr_sendcompletion is > there to handle only flushed sends. > > Send queue ordering and the mw_list prevent each MR from being > reused before it is truly invalidated. > > > > Third all the unmaps do something like this: > > > > frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg) > > { > > invalidate_wr.opcode = IB_WR_LOCAL_INV; > > [..] > > while (seg1->mr_nsegs--) > > rpcrdma_unmap_one(ia->ri_device, seg++); > > read_lock(&ia->ri_qplock); > > rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr); > > > > That is the wrong order, the DMA unmap of rpcrdma_unmap_one must only > > be done once the invalidate is complete. For FR this is ib_unmap_fmr > > returning, for FRWR it is when you see IB_WC_LOCAL_INV. > > I'm assuming you mean the DMA unmap has to be done after LINV > completes. > > I'm not sure it matters here, because when the RPC reply shows > up at the client, it already means the server isn't going to > access that MR/rkey again. (If the server does access that MR > again, it would be a protocol violation). > > Can you provide an example in another kernel ULP? > > > > Finally, where is the flow control for posting the IB_WR_LOCAL_INV to > > the SQ? I'm guessing there is some kind of implicit flow control here > > where the SEND buffer is recycled during RECV of the response, and > > this limits the SQ usage, then there are guarenteed 3x as many SQEs as > > SEND buffers to accommodate the REG_MR and INVALIDATE WRs?? > > RPC/RDMA provides flow control via credits. The peers agree on > a maximum number of concurrent outstanding RPC requests. > Typically that is 32, though implementations are increasing that > default to 128. > > There's a comment in frwr_op_open that explains how we calculate > the maximum number of send queue entries for each credit. > > > >> These memory regions require an rkey, which is sent in the RPC > >> call to the server. The server performs RDMA READ or WRITE on > >> these regions. > >> > >> I don't think the server ever uses FMR to register the target > >> memory regions for RDMA READ and WRITE. > > > > What happens if you hit the SGE limit when constructing the RDMA > > READ/WRITE? Upper layer forbids that? What about iWARP, how do you > > avoid the 1 SGE limit on RDMA READ? > > I'm much less familiar with the server side. Maybe Steve knows, > but I suspect the RPC/RDMA code is careful to construct more > READ Work Requests if it runs out of sges.
The server chunks it up based on the device limits and issues a series of rdma reads as required. See rdma_read_chunk_frmr() and rdma_read_chunks() which calls rdma_read_chunk_frmr() via xprt->sc_reader. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
