Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
On Mon, 2008-05-26 at 08:07 -0500, Steve Wise wrote: Roland Dreier wrote: - device-specific alloc/free of physical buffer lists for use in fast register work requests. This allows devices to allocate this memory as needed (like via dma_alloc_coherent). I'm looking at how one would implement the MM extensions for mlx4, and it turns out that in addition to needing to allocate these fastreg page lists in coherent memory, mlx4 is even going to need to write to the memory (basically set the lsb of each address for internal device reasons). So I think we just need to update the documentation of the interface so that not only does the page list belong to the device driver between posting the fastreg work request and completing the request, but also the device driver is allowed to change the page list as part of the work request processing. I don't see any real reason why this would cause problems for consumers; does this seem OK to other people? Tom, Does this affect how you plan to implement NFSRDMA MEM_MGT_EXTENTIONS support? I think this is ok. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
At 11:33 AM 5/27/2008, Tom Tucker wrote: So I think from an NFSRDMA coding perspective it's a wash... Just to be clear, you're talking about the NFS/RDMA server. However, it's pretty much a wash on the client, for different reasons. When posting the WR, We check the fastreg capabilities bit + transport type bit: If fastreg is true -- Post FastReg If iWARP (or with a cap bit read-with-inv-flag) post rdma read w/ invalidate ... For iWARP's case, this means rdma-read-w-inv, plus rdma-send-w-inv, etc... Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't: ---+---+---+--+---+---+-- RDMA | Message | Tagged| STag | Queue | Invalidate| Message Message| Type | Flag | and | Number| STag | Length OpCode | | | TO | | | Communicated | | | | | | between DDP | | | | | | and RDMAP ---+---+---+--+---+---+-- b | RDMA Write| 1 | Valid| N/A | N/A | Yes | | | | | | ---+---+---+--+---+---+-- 0001b | RDMA Read | 0 | N/A | 1 | N/A | Yes | Request | | | | | ---+---+---+--+---+---+-- 0010b | RDMA Read | 1 | Valid| N/A | N/A | Yes | Response | | | | | ---+---+---+--+---+---+-- 0011b | Send | 0 | N/A | 0 | N/A | Yes | | | | | | ---+---+---+--+---+---+-- 0100b | Send with | 0 | N/A | 0 | Valid | Yes | Invalidate| | | | | ---+---+---+--+---+---+-- 0101b | Send with | 0 | N/A | 0 | N/A | Yes | SE| | | | | ---+---+---+--+---+---+-- 0110b | Send with | 0 | N/A | 0 | Valid | Yes | SE and| | | | | | Invalidate| | | | | ---+---+---+--+---+---+-- 0111b | Terminate | 0 | N/A | 2 | N/A | Yes | | | | | | ---+---+---+--+---+---+-- 1000b | | to | Reserved | Not Specified b | | ---+---+- I want to take this opportunity to also mention that the RPC/RDMA client-server exchange does not support remote-invalidate currently. Because of the multiple stags supported by the rpcrdma chunking header, and because the client needs to verify that the stags were in fact invalidated, there is significant overhead, and the jury is out on that benefit. In fact, I suspect it's a lose at the client. Tom (Talpey). ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
On Tue, 2008-05-27 at 10:33 -0500, Tom Tucker wrote: On Mon, 2008-05-26 at 16:02 -0700, Roland Dreier wrote: The invalidate local stag part of a read is just a local sink side operation (ie no wire protocol change from a read). It's not like processing an ingress send-with-inv. It is really functionally like a read followed immediately by a fenced invalidate-local, but it doesn't stall the pipe. So the device has to remember the read is a with inv local stag and invalidate the stag after the read response is placed and before the WCE is reaped by the application. Yes, understood. My point was just that in IB, at least in theory, one could just use an L_Key that doesn't have any remote permissions in the scatter list of an RDMA read, while in iWARP, the STag used to place an RDMA read response has to have remote write permission. So RDMA read with invalidate makes sense for iWARP, because it gives a race-free way to allow an STag to be invalidated immediately after an RDMA read response is placed, while in IB it's simpler just to never give remote access at all. So I think from an NFSRDMA coding perspective it's a wash... When creating the local data sink, We need to check the transport type. If it's IB -- only local access, if it's iWARP -- local + remote access. When posting the WR, We check the fastreg capabilities bit + transport type bit: If fastreg is true -- Post FastReg If iWARP (or with a cap bit read-with-inv-flag) post rdma read w/ invalidate else /* IB */ post rdma read Steve pointed out a good optimization here. Instead of fencing the RDMA READ here in advance of the INVALIDATE, we should post the INVALIDATE when the READ WR completes. This will avoid stalling the SQ. Since IB doesn't put the LKEY on the wire, there's no security issue to close. We need to keep a bunch of fastreg MR around anyway for concurrent RPC. Thoughts? Tom post invalidate fi else ... today's logic fi I make the observation, however, that the transport type is now overloaded with a set of required verbs. For iWARP's case, this means rdma-read-w-inv, plus rdma-send-w-inv, etc... This also means that new transport types will inherit one or the other set of verbs (IB or iWARP). Tom - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote: At 11:33 AM 5/27/2008, Tom Tucker wrote: So I think from an NFSRDMA coding perspective it's a wash... Just to be clear, you're talking about the NFS/RDMA server. However, it's pretty much a wash on the client, for different reasons. Tom: What client side memory registration strategy do you recommend if the default on the server side is fastreg? On the performance side we are limited by the min size of the read/write-chunk element. If the client still gives the server a 4k chunk, the performance benefit (fewer PDU on the wire) goes away. Tom When posting the WR, We check the fastreg capabilities bit + transport type bit: If fastreg is true -- Post FastReg If iWARP (or with a cap bit read-with-inv-flag) post rdma read w/ invalidate ... For iWARP's case, this means rdma-read-w-inv, plus rdma-send-w-inv, etc... Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't: ---+---+---+--+---+---+-- RDMA | Message | Tagged| STag | Queue | Invalidate| Message Message| Type | Flag | and | Number| STag | Length OpCode | | | TO | | | Communicated | | | | | | between DDP | | | | | | and RDMAP ---+---+---+--+---+---+-- b | RDMA Write| 1 | Valid| N/A | N/A | Yes | | | | | | ---+---+---+--+---+---+-- 0001b | RDMA Read | 0 | N/A | 1 | N/A | Yes | Request | | | | | ---+---+---+--+---+---+-- 0010b | RDMA Read | 1 | Valid| N/A | N/A | Yes | Response | | | | | ---+---+---+--+---+---+-- 0011b | Send | 0 | N/A | 0 | N/A | Yes | | | | | | ---+---+---+--+---+---+-- 0100b | Send with | 0 | N/A | 0 | Valid | Yes | Invalidate| | | | | ---+---+---+--+---+---+-- 0101b | Send with | 0 | N/A | 0 | N/A | Yes | SE| | | | | ---+---+---+--+---+---+-- 0110b | Send with | 0 | N/A | 0 | Valid | Yes | SE and| | | | | | Invalidate| | | | | ---+---+---+--+---+---+-- 0111b | Terminate | 0 | N/A | 2 | N/A | Yes | | | | | | ---+---+---+--+---+---+-- 1000b | | to | Reserved | Not Specified b | | ---+---+- I want to take this opportunity to also mention that the RPC/RDMA client-server exchange does not support remote-invalidate currently. Because of the multiple stags supported by the rpcrdma chunking header, and because the client needs to verify that the stags were in fact invalidated, there is significant overhead, and the jury is out on that benefit. In fact, I suspect it's a lose at the client. Tom (Talpey). ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Tom Tucker wrote: On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote: At 11:33 AM 5/27/2008, Tom Tucker wrote: So I think from an NFSRDMA coding perspective it's a wash... Just to be clear, you're talking about the NFS/RDMA server. However, it's pretty much a wash on the client, for different reasons. Tom: What client side memory registration strategy do you recommend if the default on the server side is fastreg? On the performance side we are limited by the min size of the read/write-chunk element. If the client still gives the server a 4k chunk, the performance benefit (fewer PDU on the wire) goes away. Tom I would hope that dma_mr usage will be replaced with fast_reg on both the client and the server. When posting the WR, We check the fastreg capabilities bit + transport type bit: If fastreg is true -- Post FastReg If iWARP (or with a cap bit read-with-inv-flag) post rdma read w/ invalidate ... For iWARP's case, this means rdma-read-w-inv, plus rdma-send-w-inv, etc... Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't: ---+---+---+--+---+---+-- RDMA | Message | Tagged| STag | Queue | Invalidate| Message Message| Type | Flag | and | Number| STag | Length OpCode | | | TO | | | Communicated | | | | | | between DDP | | | | | | and RDMAP ---+---+---+--+---+---+-- b | RDMA Write| 1 | Valid| N/A | N/A | Yes | | | | | | ---+---+---+--+---+---+-- 0001b | RDMA Read | 0 | N/A | 1 | N/A | Yes | Request | | | | | ---+---+---+--+---+---+-- 0010b | RDMA Read | 1 | Valid| N/A | N/A | Yes | Response | | | | | ---+---+---+--+---+---+-- 0011b | Send | 0 | N/A | 0 | N/A | Yes | | | | | | ---+---+---+--+---+---+-- 0100b | Send with | 0 | N/A | 0 | Valid | Yes | Invalidate| | | | | ---+---+---+--+---+---+-- 0101b | Send with | 0 | N/A | 0 | N/A | Yes | SE| | | | | ---+---+---+--+---+---+-- 0110b | Send with | 0 | N/A | 0 | Valid | Yes | SE and| | | | | | Invalidate| | | | | ---+---+---+--+---+---+-- 0111b | Terminate | 0 | N/A | 2 | N/A | Yes | | | | | | ---+---+---+--+---+---+-- 1000b | | to | Reserved | Not Specified b | | ---+---+- I want to take this opportunity to also mention that the RPC/RDMA client-server exchange does not support remote-invalidate currently. Because of the multiple stags supported by the rpcrdma chunking header, and because the client needs to verify that the stags were in fact invalidated, there is significant overhead, and the jury is out on that benefit. In fact, I suspect it's a lose at the client. Tom (Talpey). ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
At 02:58 PM 5/27/2008, Tom Tucker wrote: On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote: At 11:33 AM 5/27/2008, Tom Tucker wrote: So I think from an NFSRDMA coding perspective it's a wash... Just to be clear, you're talking about the NFS/RDMA server. However, it's pretty much a wash on the client, for different reasons. Tom: What client side memory registration strategy do you recommend if the default on the server side is fastreg? Whatever is fastest and safest. Given that the client and server won't necessarily be using the same hardware, nor the same kernel for that matter, I don't think we can or should legislate it. That said, I am hopeful that fastreg does turn out to be fast and therefore will become the only logical choice for the NFS/RDMA Linux client. But the future Linux client is only one such system. I cannot speak for others. Tom. On the performance side we are limited by the min size of the read/write-chunk element. If the client still gives the server a 4k chunk, the performance benefit (fewer PDU on the wire) goes away. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Or Gerlitz wrote: After discussing the rkey renew and fencing with send/rdma ops, I am quite clear with how all this plugs well into ULPs such as SCSI or FS low-level (interconnect) initiator/target drivers, specifically those who use a transactional protocol. Few more points to clarify are (sorry if it became somehow long): * Do we want it to be a must for a consumer to invalidate an fast-reg mr before reusing it? if yes, how? The verbs specs mandate that the mr be in the invalid state when the fast-reg work request is processed. So I think that means yes. And the consumer invalidates it via the INVALIDATE_MR work request. * If remote invalidation is supported, when the peer is done with the mr, it sends the response in send-with-invalidate fashion and saves the mapper side from doing a local invalidate. For the case of the mapping produced by SCSI initiator or FS client, when remote invalidation is not supported, I don't see how a local invalidate design can be made in a pipe-lined manner - Since from the network perspective the I/O is done, the target response at your hands, but until doing mr invalidation the pages are typically not returned to the upper layer and the ULP has to stall till the invalidation WR is completed? I don't say its a bug or a big issue, just wonder what are your thoughts regarding this point. I guess that's why they invented send-with-inv, and read-with-inv-local. * talking about remote invalidation, I understand that it requires support of both sides (and hence has to be negotiated), so the IB_DEVICE_SEND_W_INV device capability says that a device can send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well? * what about ZBVA, is it orthogonal to these calls, no enhancement of the suggested API is needed even if zbva is used, or the other way, it would work also when zbva is not used? Or ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
* Do we want it to be a must for a consumer to invalidate an fast-reg mr before reusing it? if yes, how? The verbs specs go into exhaustive detail about the state diagram for validity of MRs. * talking about remote invalidation, I understand that it requires support of both sides (and hence has to be negotiated), so the IB_DEVICE_SEND_W_INV device capability says that a device can send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well? I think we decided that all of these related features will be indicated by IB_DEVICE_MEM_MGT_EXTENSIONS to avoid an explosion of capability bits. * what about ZBVA, is it orthogonal to these calls, no enhancement of the suggested API is needed even if zbva is used, or the other way, it would work also when zbva is not used? ZBVA would require adding some flag to request ZBVA when registering. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
So something like this? yeah, looks reasonable... static void ib_update_fast_reg_key(struct ib_mr *mr, u8 newkey) { /* iWARP: rkey == lkey */ actually I need to reread the IB spec and understand how the consumer key part of L_Key and R_Key is supposed to work... for Mellanox adapters at least the L_Key and R_Key are the same too. if (mr-rkey == mr-lkey) mr-lkey = mr-lkey 0xff00 | newkey; mr-rkey = mr-rkey 0xff00 | newkey; } ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
static void ib_update_fast_reg_key(struct ib_mr *mr, u8 newkey) { /* iWARP: rkey == lkey */ actually I need to reread the IB spec and understand how the consumer key part of L_Key and R_Key is supposed to work... for Mellanox adapters at least the L_Key and R_Key are the same too. if (mr-rkey == mr-lkey) mr-lkey = mr-lkey 0xff00 | newkey; mr-rkey = mr-rkey 0xff00 | newkey; } I just looked in the IB spec (1.2.1) and it talks about passing the Key to use on the new L_Key and R_Key into a fastreg work request. So I think we can just can the test for rkey==lkey and just do mr-lkey = mr-lkey 0xff00 | newkey; mr-rkey = mr-rkey 0xff00 | newkey; - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
- device-specific alloc/free of physical buffer lists for use in fast register work requests. This allows devices to allocate this memory as needed (like via dma_alloc_coherent). I'm looking at how one would implement the MM extensions for mlx4, and it turns out that in addition to needing to allocate these fastreg page lists in coherent memory, mlx4 is even going to need to write to the memory (basically set the lsb of each address for internal device reasons). So I think we just need to update the documentation of the interface so that not only does the page list belong to the device driver between posting the fastreg work request and completing the request, but also the device driver is allowed to change the page list as part of the work request processing. I don't see any real reason why this would cause problems for consumers; does this seem OK to other people? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Steve Wise wrote: Usage Model: - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) Hi Steve, Roland, After discussing the rkey renew and fencing with send/rdma ops, I am quite clear with how all this plugs well into ULPs such as SCSI or FS low-level (interconnect) initiator/target drivers, specifically those who use a transactional protocol. Few more points to clarify are (sorry if it became somehow long): * Do we want it to be a must for a consumer to invalidate an fast-reg mr before reusing it? if yes, how? * If remote invalidation is supported, when the peer is done with the mr, it sends the response in send-with-invalidate fashion and saves the mapper side from doing a local invalidate. For the case of the mapping produced by SCSI initiator or FS client, when remote invalidation is not supported, I don't see how a local invalidate design can be made in a pipe-lined manner - Since from the network perspective the I/O is done, the target response at your hands, but until doing mr invalidation the pages are typically not returned to the upper layer and the ULP has to stall till the invalidation WR is completed? I don't say its a bug or a big issue, just wonder what are your thoughts regarding this point. * talking about remote invalidation, I understand that it requires support of both sides (and hence has to be negotiated), so the IB_DEVICE_SEND_W_INV device capability says that a device can send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well? * what about ZBVA, is it orthogonal to these calls, no enhancement of the suggested API is needed even if zbva is used, or the other way, it would work also when zbva is not used? Or ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
And then the provider updates the mr-rkey field as part of WR processing? Yeah, I guess so. Actually thinking about it, another possibility would be to wrap up the newrkey = (mr-rkey 0xff00) | newkey; operation in a little inline helper function so people don't screw it up. Maybe that's the cleanest way to do it. (We would probably want the helper for low-level driver use anyway) - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Actually thinking about it, another possibility would be to wrap up the newrkey = (mr-rkey 0xff00) | newkey; operation in a little inline helper function so people don't screw it up. Maybe that's the cleanest way to do it. If we add a key field to the work request, then it seems too easy for a consumer to forget to set it and end up passing uninitialized garbage. If the consumer has to explicitly update the key when posting the work request then that failure is avoided. HOWEVER -- if we have the consumer update the key when posting the operation, then there is the problem of what happens when the consumer posts multiple fastreg work requests at once (ie fastreg, local inval, new fastreg, etc. in a pipelined way). Does the low-level driver just take the the key value given when the WR is posted, even if there's a new value there by the time the WR is executed? - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Roland Dreier wrote: Actually thinking about it, another possibility would be to wrap up the newrkey = (mr-rkey 0xff00) | newkey; operation in a little inline helper function so people don't screw it up. Maybe that's the cleanest way to do it. If we add a key field to the work request, then it seems too easy for a consumer to forget to set it and end up passing uninitialized garbage. If the consumer has to explicitly update the key when posting the work request then that failure is avoided. HOWEVER -- if we have the consumer update the key when posting the operation, then there is the problem of what happens when the consumer posts multiple fastreg work requests at once (ie fastreg, local inval, new fastreg, etc. in a pipelined way). Does the low-level driver just take the the key value given when the WR is posted, even if there's a new value there by the time the WR is executed? I would have to say yes. And it makes sense i think. say rkey is 0x010203XX. The a pipeline could look like: fastreg (mr-rkey is 0x01020301) rdma read (mr-rkey is 0x01020301) invalidate local with fence (mr-rkey is 0x01020301) fastreg (mr-rkey is 0x01020302) rdma read (sink mr-rkey is 0x01020302) invalidate local with fence (mr-rkey is 0x01020302) So the consumer is using the correct mr-rkey at all times even though the rnic is possibly processing the previous generation (that was copied into a fastreg WR at an earlier point in time) at the same time as the app is registering the next generation of the rkey. Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Dror Goldenberg wrote: When you post a fast register WQE, you specify the new 8 LSBits to be assigned to the MR. The rest 24 MSBits are the ones that you obtained while allocating the MR, and they persist throughout the lifetime of this MR. OK, thanks Dror. Steve, do we agree on this point? if yes, the next version of the patches should include the new rkey value (or just the new 8 LSbits) in the work request. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Or Gerlitz wrote: Dror Goldenberg wrote: When you post a fast register WQE, you specify the new 8 LSBits to be assigned to the MR. The rest 24 MSBits are the ones that you obtained while allocating the MR, and they persist throughout the lifetime of this MR. OK, thanks Dror. Steve, do we agree on this point? if yes, the next version of the patches should include the new rkey value (or just the new 8 LSbits) in the work request. Are we sure we need to expose this to the user? Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Or Gerlitz wrote: Steve Wise wrote: So you allocate the rkey/stag up front, allocate page_lists up front, then as needed you populate your page list and bind it to the rkey/stag via IB_WR_FAST_REG_MR, and invalidate that mapping via IB_WR_INVALIDATE_MR. You can do this any number of times, and with proper fencing, you can pipeline these mappings. Eventually when you're done doing IO (like for NFSRDMA when the mount is unmounted) you free up the page list(s) and mr/rkey/stag. Yes, that was my thought as well. Just to make sure, by proper fencing your understanding is that for both IB and iWARP the ULP should not wait for the fmr work request to complete and post the send work-request carrying the rkey/stag with the IB_SEND_FENCE flag? Looking in the IB spec, its seems that the fence indicator only applies to previous rdma-read / atomic operations, eg in section 11.4.1.1 POST SEND REQUEST it says: Fence indicator. If the fence indicator is set, then all prior RDMA Read and Atomic Work Requests on the queue must be completed before starting to process this Work Request. The fast register and invalidate work requests require that they be completed by the device _before_ processing any subsequent work requests. So you can post subsequent SEND WRs that utilize the rkey without problems. In addition, invalidate allows a local fence which means the device will no begin processing the invalitdae until all _prior_ work requests complete (similar to a read fence but for all prior WRS). Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Steve Wise wrote: Are we sure we need to expose this to the user? I believe this is the way to go if we want to let smart ULPs generate new rkey/stag per mapping. Simpler ULPs could then just put the same value for each map associated with the same mr. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Or Gerlitz wrote: Steve Wise wrote: Are we sure we need to expose this to the user? I believe this is the way to go if we want to let smart ULPs generate new rkey/stag per mapping. Simpler ULPs could then just put the same value for each map associated with the same mr. Or. Roland, what do you think? I'm ok with adding this. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Steve Wise wrote: My point is that if you do the mappipng at allocation time, then the failure will happen when you allocate the page list vs when you post the send WR. Maybe it doesn't matter, but the idea, I think, is to not fail post_send for lack of resources. Everything should be pre-allocated pretty much by the time you post work requests... fair-enough. I understand we are requiring that a page list can be reused without being freed, just make sure its documents. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Steve Wise wrote: Support for the IB BMME and iWARP equivalent memory extensions ... Usage Model: - MR allocated with ib_alloc_mr() - Page lists allocated via ib_alloc_fast_reg_page_list(). - MR made VALID bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) AFAIK, the idea was to let the ulp post --two-- work requests, where the first creates the mapping and the second sends this mapping to the remote side, such that the second does not start before the first completes (i.e a fence). Now, the above scheme means that the ulp knows the value of the rkey/stag at the time of posting these two work requests (since it has to encode it in the second one), so something has to be clarified re the rkey/stag here, do they change each time this MR is used? how many bits can be changed, etc. The ULP knows the rkey/stag because its returned up front in the ib_alloc_fast_reg_mr(). And it doesn't change (ignoring the key issue which we haven't exposed yet to the ULP). The same rkey/stag can be used for multiple mappings. It can be made invalid at any point in time via the IB_WR_INVALIDATE_MR so the fact that you're leaving the same rkey/stag advertised is not a risk. I understand that this (same rkey/stag used for all mapping produced for a specific mr) is what you are proposing, I still think there's a chance that by the spec and (not less important!) by existing HW support, its possible to have a different rkey/stag per mapping done on an mr, for example the IB spec uses a consumer owned key portion of the L_Key notation which makes me think there should be a way to have different rkey per mapping, Roland? Dror? 10.7.2.6 FAST REGISTER PHYSICAL MR The Fast Register Physical MR Operation is allowed on Non-Shared Physical Memory Regions that were created with a Consumer owned key portion of the L_Key, and any associated R_Key Or ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Steve Wise wrote: So you allocate the rkey/stag up front, allocate page_lists up front, then as needed you populate your page list and bind it to the rkey/stag via IB_WR_FAST_REG_MR, and invalidate that mapping via IB_WR_INVALIDATE_MR. You can do this any number of times, and with proper fencing, you can pipeline these mappings. Eventually when you're done doing IO (like for NFSRDMA when the mount is unmounted) you free up the page list(s) and mr/rkey/stag. Yes, that was my thought as well. Just to make sure, by proper fencing your understanding is that for both IB and iWARP the ULP should not wait for the fmr work request to complete and post the send work-request carrying the rkey/stag with the IB_SEND_FENCE flag? Looking in the IB spec, its seems that the fence indicator only applies to previous rdma-read / atomic operations, eg in section 11.4.1.1 POST SEND REQUEST it says: Fence indicator. If the fence indicator is set, then all prior RDMA Read and Atomic Work Requests on the queue must be completed before starting to process this Work Request. Talking on usage, do you plan to patch the mainline nfs-rdma code to use these verbs? Yes. Tom Tucker will be doing this. Jon Mason is implementing RDS changes to utilize this too. The hope is all this makes 2.6.27/ofed-1.4. I can also post test code (krping module) if anyone is interested. I'm developing that now. Posting this code would be very much helpful (also to the discussion, I think), thanks. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
-Original Message- From: Or Gerlitz [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 21, 2008 12:25 PM To: Steve Wise Cc: [EMAIL PROTECTED]; general@lists.openfabrics.org; Dror Goldenberg Subject: Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support Steve Wise wrote: Support for the IB BMME and iWARP equivalent memory extensions ... Usage Model: - MR allocated with ib_alloc_mr() - Page lists allocated via ib_alloc_fast_reg_page_list(). - MR made VALID bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) AFAIK, the idea was to let the ulp post --two-- work requests, where the first creates the mapping and the second sends this mapping to the remote side, such that the second does not start before the first completes (i.e a fence). Now, the above scheme means that the ulp knows the value of the rkey/stag at the time of posting these two work requests (since it has to encode it in the second one), so something has to be clarified re the rkey/stag here, do they change each time this MR is used? how many bits can be changed, etc. The ULP knows the rkey/stag because its returned up front in the ib_alloc_fast_reg_mr(). And it doesn't change (ignoring the key issue which we haven't exposed yet to the ULP). The same rkey/stag can be used for multiple mappings. It can be made invalid at any point in time via the IB_WR_INVALIDATE_MR so the fact that you're leaving the same rkey/stag advertised is not a risk. I understand that this (same rkey/stag used for all mapping produced for a specific mr) is what you are proposing, I still think there's a chance that by the spec and (not less important!) by existing HW support, its possible to have a different rkey/stag per mapping done on an mr, for example the IB spec uses a consumer owned key portion of the L_Key notation which makes me think there should be a way to have different rkey per mapping, Roland? Dror? [dg] When you post a fast register WQE, you specify the new 8 LSBits to be assigned to the MR. The rest 24 MSBits are the ones that you obtained while allocating the MR, and they persist throughout the lifetime of this MR. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Steve Wise wrote: dma mapping would work too but then handling the map/unmap becomes an issue. I think it is way too complicated too add new verbs for map/unmap fastreg page list (in addition to the alloc/free fastreg page list that we are already adding) and force the consumer to do it. And if we expect the low-level driver to do it, then the map is easy (can be done while posting the send) but the unmap is a pain -- it would have to be done inside poll_cq when reapind the completion, and the low-level driver would have to keep some complicated extra data structure to go back from the completion to the original fast reg page list structure. And certain platforms can fail map requests (like PPC64) because they have limited resources for dma mapping. So then you'd fail a SQ work request when you might not want to... I see the point in allocating the page lists in dma consistent memory to make the mechanics of letting the HCA to DMA the list easier and simpler, as I think Roland is suggesting in his post. However, I an not sure to understand how this helps in the PPC64 case, if the HCA does DMA to fetch the list, then IOMMU slots have to be consumed this way or another, correct? Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Steve Wise wrote: Support for the IB BMME and iWARP equivalent memory extensions to non shared memory regions. Usage Model: - MR allocated with ib_alloc_mr() - Page lists allocated via ib_alloc_fast_reg_page_list(). - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) Steve, I am trying to further understand what would be a real life ULP design here, and I think there are some more issues to clarify/define for the case of ULP which has to create a mapping for a list of pages and send this mapping (eg IB/rkey iWARP/stag) to a remote party that uses it for RDMA. AFAIK, the idea was to let the ulp post --two-- work requests, where the first creates the mapping and the second sends this mapping to the remote side, such that the second does not start before the first completes (i.e a fence). Now, the above scheme means that the ulp knows the value of the rkey/stag at the time of posting these two work requests (since it has to encode it in the second one), so something has to be clarified re the rkey/stag here, do they change each time this MR is used? how many bits can be changed, etc. I guess my questions are to some extent RTFM ones, but, first, with some quick looking in the IB spec I did not manage to get enough answers (pointers appreciated...) and second, you are proposing an implementation here, so I think it makes sense to review the actual usage model to see all aspects needed for ULPs are covered... Talking on usage, do you plan to patch the mainline nfs-rdma code to use these verbs? Or. - MR deallocated with ib_dereg_mr() - page lists dealloced via ib_free_fast_reg_page_list() ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Or Gerlitz wrote: Steve Wise wrote: Support for the IB BMME and iWARP equivalent memory extensions to non shared memory regions. Usage Model: - MR allocated with ib_alloc_mr() - Page lists allocated via ib_alloc_fast_reg_page_list(). - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) Steve, I am trying to further understand what would be a real life ULP design here, and I think there are some more issues to clarify/define for the case of ULP which has to create a mapping for a list of pages and send this mapping (eg IB/rkey iWARP/stag) to a remote party that uses it for RDMA. AFAIK, the idea was to let the ulp post --two-- work requests, where the first creates the mapping and the second sends this mapping to the remote side, such that the second does not start before the first completes (i.e a fence). Now, the above scheme means that the ulp knows the value of the rkey/stag at the time of posting these two work requests (since it has to encode it in the second one), so something has to be clarified re the rkey/stag here, do they change each time this MR is used? how many bits can be changed, etc. The ULP knows the rkey/stag because its returned up front in the ib_alloc_fast_reg_mr(). And it doesn't change (ignoring the key issue which we haven't exposed yet to the ULP). The same rkey/stag can be used for multiple mappings. It can be made invalid at any point in time via the IB_WR_INVALIDATE_MR so the fact that you're leaving the same rkey/stag advertised is not a risk. So you allocate the rkey/stag up front, allocate page_lists up front, then as needed you populate your page list and bind it to the rkey/stag via IB_WR_FAST_REG_MR, and invalidate that mapping via IB_WR_INVALIDATE_MR. You can do this any number of times, and with proper fencing, you can pipeline these mappings. Eventually when you're done doing IO (like for NFSRDMA when the mount is unmounted) you free up the page list(s) and mr/rkey/stag. So NFSRDMA will keep these fast_reg_mrs and page_list structs pre-allocated and hung off some context so that per RPC, they can be bound/registered, the IO executed, and then the MR invalidated as part of processing the RPC. I guess my questions are to some extent RTFM ones, but, first, with some quick looking in the IB spec I did not manage to get enough answers (pointers appreciated...) and second, you are proposing an implementation here, so I think it makes sense to review the actual usage model to see all aspects needed for ULPs are covered... Talking on usage, do you plan to patch the mainline nfs-rdma code to use these verbs? Yes. Tom Tucker will be doing this. Jon Mason is implementing RDS changes to utilize this too. The hope is all this makes 2.6.27/ofed-1.4. I can also post test code (krping module) if anyone is interested. I'm developing that now. Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Roland Dreier wrote: Yes, the point of this verb is that the low-level driver owns the page list from when the fast register work request is posted until it completes. This should be explicitly documented somewhere. OK, got it, so this is different case compared to the SG elements which are not owned by the driver once the posting call returns. However the reason for having the low-level driver implement it is so that all strange device-specific issues can be taken care of in the driver. For instance mlx4 is going to require that the page list be aligned to 64 bytes, and will DMA from the memory, so we need to use dma_alloc_consistent(). On the other hand cxgb3 is just going to copy in software, so kmalloc is sufficient. I see. Just wondering, in the mlx4 case, is it a must to use dma consistent memory allocation or dma mapping would work too? Or. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Steve Wise wrote: Support for the IB BMME and iWARP equivalent memory extensions to non shared memory regions. Usage Model: - MR allocated with ib_alloc_mr() - Page lists allocated via ib_alloc_fast_reg_page_list(). - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) - MR deallocated with ib_dereg_mr() - page lists dealloced via ib_free_fast_reg_page_list(). Steve, Does this design goes hand-in-hand with remote invalidation? such that if the remote side invalidated the mapping there no need to issue the IB_WR_INVALIDATE_MR work request. Also, does the proposed design support fmr pages of granularity different than the OS ones? for example the OS pages are 4K and the ULP wants to use fmr of 512 byte pages (the block lists feature), etc. In that case doesn't the size of each page has to be specified in as a param to the alloc_fast_reg_mr() verb? Applications can allocate a fast_reg mr once, and then can repeatedly bind the mr to different physical memory SGLs via posting work requests to the send queue. For each outstanding mr-to-pbl binding in the SQ pipe, a fast_reg_page_list needs to be allocated. Thus pipelining can be achieved while still allowing device-specific page_list processing. mmm, is it a must for the ULP issue page list alloc/free per IB_WR_FAST_REG_MR call? --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -676,6 +683,20 @@ struct ib_send_wr { u16 pkey_index; /* valid for GSI only */ u8 port_num; /* valid for DR SMPs on switch only */ } ud; + struct { + u64 iova_start; + struct ib_mr*mr; + struct ib_fast_reg_page_list*page_list; + unsigned intpage_size; + unsigned intpage_list_len; + unsigned intfirst_byte_offset; + u32 length; + int access_flags; + + } fast_reg; + struct { + struct ib_mr*mr; + } local_inv; } wr; }; I suggest to use a page_shift notation and not page_size to comply with the kernel semantics of other APIs. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Or Gerlitz wrote: Steve Wise wrote: Support for the IB BMME and iWARP equivalent memory extensions to non shared memory regions. Usage Model: - MR allocated with ib_alloc_mr() - Page lists allocated via ib_alloc_fast_reg_page_list(). - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) - MR deallocated with ib_dereg_mr() - page lists dealloced via ib_free_fast_reg_page_list(). Steve, Does this design goes hand-in-hand with remote invalidation? such that if the remote side invalidated the mapping there no need to issue the IB_WR_INVALIDATE_MR work request. Yes. Also, does the proposed design support fmr pages of granularity different than the OS ones? for example the OS pages are 4K and the ULP wants to use fmr of 512 byte pages (the block lists feature), etc. In that case doesn't the size of each page has to be specified in as a param to the alloc_fast_reg_mr() verb? Page size is passed in at the registration time. At allocation time, the HW only need to know what the max page list length (or PBL depth) will ever be so it can pre-allocate that at alloc time. The the actualy page list length, the page size of each entry in the page list, as well as the page list itself is passed in via the post_send(IB_WR_FAST_REG_MR) work request. See the fast_reg union in struct ib_send_wr. Applications can allocate a fast_reg mr once, and then can repeatedly bind the mr to different physical memory SGLs via posting work requests to the send queue. For each outstanding mr-to-pbl binding in the SQ pipe, a fast_reg_page_list needs to be allocated. Thus pipelining can be achieved while still allowing device-specific page_list processing. mmm, is it a must for the ULP issue page list alloc/free per IB_WR_FAST_REG_MR call? No, the can be reused as needed. They typically will only get allocated once, used many times, then freed when the application is done. My point in the text above was that an application could allocate N page lists and use them in a pipeline for the same fast reg mr by fencing things appropriately in the SQ. --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -676,6 +683,20 @@ struct ib_send_wr { u16pkey_index; /* valid for GSI only */ u8port_num; /* valid for DR SMPs on switch only */ } ud; +struct { +u64iova_start; +struct ib_mr *mr; +struct ib_fast_reg_page_list*page_list; +unsigned intpage_size; +unsigned intpage_list_len; +unsigned intfirst_byte_offset; +u32length; +intaccess_flags; + +} fast_reg; +struct { +struct ib_mr *mr; +} local_inv; } wr; }; I suggest to use a page_shift notation and not page_size to comply with the kernel semantics of other APIs. Ok, I wondered about that. It will also ensure a power of two. Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Roland Dreier wrote: If ownership can be assumed, I suggest to have the core use the implementation of these two verbs as you did that for the Chelsio driver in case the HW driver did not implement it (i.e instead of returning ENOSYS). In that case, the alloc_list verb should do DMA mapping FROM device (I think...) since the device is going to do DMA to read the page list, and the free_list verb should do DMA unmapping, etc. Yes, the point of this verb is that the low-level driver owns the page list from when the fast register work request is posted until it completes. This should be explicitly documented somewhere. I've added it to the comments for ib_alloc_fast_reg_page_list() as per Ralph Campbell's suggestion. However the reason for having the low-level driver implement it is so that all strange device-specific issues can be taken care of in the driver. For instance mlx4 is going to require that the page list be aligned to 64 bytes, and will DMA from the memory, so we need to use dma_alloc_consistent(). On the other hand cxgb3 is just going to copy in software, so kmalloc is sufficient. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
At 09:58 AM 5/19/2008, Steve Wise wrote: Storage has been known to adopt non ^2 blocks, for instance including block checksums in sectors, etc. If transferred, these will become quite inefficient on ^2 hardware. Is this true today for any of the existing RDMA ULPs that will utilize fastreg? Ask the iSER and SRP folks. NFS won't. Tom. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Talpey, Thomas wrote: At 09:40 AM 5/19/2008, Steve Wise wrote: I suggest to use a "page_shift" notation and not "page_size" to comply with the kernel semantics of other APIs. Ok, I wondered about that. It will also ensure a power of two. Does it have to be ^2? In the iWARP spec development, we envisioned the possibility of arbitrary page sizes. I don't recall any such dependency in the protocol architecture. I didn't add block mode support since its not available anywhere in the Linux RDMA API. I'd rather _not_ introduce that at this point. Storage has been known to adopt non ^2 blocks, for instance including block checksums in sectors, etc. If transferred, these will become quite inefficient on ^2 hardware. Is this true today for any of the existing RDMA ULPs that will utilize fastreg? Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
I see. Just wondering, in the mlx4 case, is it a must to use dma consistent memory allocation or dma mapping would work too? dma mapping would work too but then handling the map/unmap becomes an issue. I think it is way too complicated too add new verbs for map/unmap fastreg page list (in addition to the alloc/free fastreg page list that we are already adding) and force the consumer to do it. And if we expect the low-level driver to do it, then the map is easy (can be done while posting the send) but the unmap is a pain -- it would have to be done inside poll_cq when reapind the completion, and the low-level driver would have to keep some complicated extra data structure to go back from the completion to the original fast reg page list structure. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Roland Dreier wrote: I see. Just wondering, in the mlx4 case, is it a must to use dma consistent memory allocation or dma mapping would work too? dma mapping would work too but then handling the map/unmap becomes an issue. I think it is way too complicated too add new verbs for map/unmap fastreg page list (in addition to the alloc/free fastreg page list that we are already adding) and force the consumer to do it. And if we expect the low-level driver to do it, then the map is easy (can be done while posting the send) but the unmap is a pain -- it would have to be done inside poll_cq when reapind the completion, and the low-level driver would have to keep some complicated extra data structure to go back from the completion to the original fast reg page list structure. And certain platforms can fail map requests (like PPC64) because they have limited resources for dma mapping. So then you'd fail a SQ work request when you might not want to... Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Steve Wise wrote: - device-specific alloc/free of physical buffer lists for use in fast register work requests. This allows devices to allocate this memory as needed (like via dma_alloc_coherent). Steve, Reading through the suggested API / patches and the previous threads I was not sure to understand if the HW driver must not assume that it has the ownership on the page --list-- structure until the registration work request is completed - or not. Now, if ownership can not be assumed (eg as for the SG list elements pointed by send/recv WR), the driver has to clone it anyway, and thus I don't see the need in the ib_alloc/free_fast_reg_page_list verbs. If ownership can be assumed, I suggest to have the core use the implementation of these two verbs as you did that for the Chelsio driver in case the HW driver did not implement it (i.e instead of returning ENOSYS). In that case, the alloc_list verb should do DMA mapping FROM device (I think...) since the device is going to do DMA to read the page list, and the free_list verb should do DMA unmapping, etc. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Or Gerlitz wrote: Steve Wise wrote: - device-specific alloc/free of physical buffer lists for use in fast register work requests. This allows devices to allocate this memory as needed (like via dma_alloc_coherent). Steve, Reading through the suggested API / patches and the previous threads I was not sure to understand if the HW driver must not assume that it has the ownership on the page --list-- structure until the registration work request is completed - or not. Yes, the driver owns the page list structure until the WR completes (ie is reaped by the consumer via poll_cq()). Now, if ownership can not be assumed (eg as for the SG list elements pointed by send/recv WR), the driver has to clone it anyway, and thus I don't see the need in the ib_alloc/free_fast_reg_page_list verbs. If ownership can be assumed, I suggest to have the core use the implementation of these two verbs as you did that for the Chelsio driver in case the HW driver did not implement it (i.e instead of returning ENOSYS). In that case, the alloc_list verb should do DMA mapping FROM device (I think...) since the device is going to do DMA to read the page list, and the free_list verb should do DMA unmapping, etc. Some devices don't need DMA mappings at all (chelsio for instance). The idea of a device-specific method was so the device could allocate a bigger structure to hold its own context info. So a core service that sets up DMA, in my opinion, isn't really useful. Steve. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
If ownership can be assumed, I suggest to have the core use the implementation of these two verbs as you did that for the Chelsio driver in case the HW driver did not implement it (i.e instead of returning ENOSYS). In that case, the alloc_list verb should do DMA mapping FROM device (I think...) since the device is going to do DMA to read the page list, and the free_list verb should do DMA unmapping, etc. Yes, the point of this verb is that the low-level driver owns the page list from when the fast register work request is posted until it completes. This should be explicitly documented somewhere. However the reason for having the low-level driver implement it is so that all strange device-specific issues can be taken care of in the driver. For instance mlx4 is going to require that the page list be aligned to 64 bytes, and will DMA from the memory, so we need to use dma_alloc_consistent(). On the other hand cxgb3 is just going to copy in software, so kmalloc is sufficient. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
Support for the IB BMME and iWARP equivalent memory extensions to non shared memory regions. This includes: - allocation of an ib_mr for use in fast register work requests - device-specific alloc/free of physical buffer lists for use in fast register work requests. This allows devices to allocate this memory as needed (like via dma_alloc_coherent). - fast register memory region work request - invalidate local memory region work request - read with invalidate local memory region work request (iWARP only) Design details: - New device capability flag added: IB_DEVICE_MEM_MGT_EXTENSIONS indicates device support for this feature. - New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request. - New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr. - New API function, ib_alloc_mr() used to allocate fast_reg memory regions. - New API function, ib_alloc_fast_reg_page_list to allocate device-specific page lists. - New API function, ib_free_fast_reg_page_list to free said page lists. Usage Model: - MR allocated with ib_alloc_mr() - Page lists allocated via ib_alloc_fast_reg_page_list(). - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) - MR deallocated with ib_dereg_mr() - page lists dealloced via ib_free_fast_reg_page_list(). Applications can allocate a fast_reg mr once, and then can repeatedly bind the mr to different physical memory SGLs via posting work requests to the send queue. For each outstanding mr-to-pbl binding in the SQ pipe, a fast_reg_page_list needs to be allocated. Thus pipelining can be achieved while still allowing device-specific page_list processing. Signed-off-by: Steve Wise [EMAIL PROTECTED] --- drivers/infiniband/core/verbs.c | 46 include/rdma/ib_verbs.h | 56 +++ 2 files changed, 102 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 0504208..0a334b4 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr) } EXPORT_SYMBOL(ib_dereg_mr); +struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len) +{ + struct ib_mr *mr; + + if (!pd-device-alloc_fast_reg_mr) + return ERR_PTR(-ENOSYS); + + mr = pd-device-alloc_fast_reg_mr(pd, max_page_list_len); + + if (!IS_ERR(mr)) { + mr-device = pd-device; + mr-pd = pd; + mr-uobject = NULL; + atomic_inc(pd-usecnt); + atomic_set(mr-usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_alloc_fast_reg_mr); + +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list( + struct ib_device *device, int max_page_list_len) +{ + struct ib_fast_reg_page_list *page_list; + + if (!device-alloc_fast_reg_page_list) + return ERR_PTR(-ENOSYS); + + page_list = device-alloc_fast_reg_page_list(device, max_page_list_len); + + if (!IS_ERR(page_list)) { + page_list-device = device; + page_list-max_page_list_len = max_page_list_len; + } + + return page_list; +} +EXPORT_SYMBOL(ib_alloc_fast_reg_page_list); + +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list) +{ + page_list-device-free_fast_reg_page_list(page_list); +} +EXPORT_SYMBOL(ib_free_fast_reg_page_list); + /* Memory windows */ struct ib_mw *ib_alloc_mw(struct ib_pd *pd) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 911a661..c4ace0f 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -106,6 +106,7 @@ enum ib_device_cap_flags { IB_DEVICE_UD_IP_CSUM= (118), IB_DEVICE_UD_TSO= (119), IB_DEVICE_SEND_W_INV= (121), + IB_DEVICE_MEM_MGT_EXTENSIONS= (122), }; enum ib_atomic_cap { @@ -151,6 +152,7 @@ struct ib_device_attr { int max_srq; int max_srq_wr; int max_srq_sge; + unsigned intmax_fast_reg_page_list_len; u16 max_pkeys; u8 local_ca_ack_delay; }; @@ -414,6 +416,8 @@ enum ib_wc_opcode { IB_WC_FETCH_ADD, IB_WC_BIND_MW, IB_WC_LSO, + IB_WC_FAST_REG_MR, + IB_WC_INVALIDATE_MR, /* * Set value of IB_WC_RECV so consumers can test if a completion is a * receive by testing (opcode IB_WC_RECV). @@ -628,6 +632,9 @@ enum ib_wr_opcode { IB_WR_ATOMIC_FETCH_AND_ADD, IB_WR_LSO, IB_WR_SEND_WITH_INV, + IB_WR_FAST_REG_MR, + IB_WR_INVALIDATE_MR, + IB_WR_READ_WITH_INV, }; enum ib_send_flags { @@ -676,6 +683,20 @@ struct ib_send_wr {