Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-06-12 Thread Tom Tucker

On Mon, 2008-05-26 at 08:07 -0500, Steve Wise wrote:
 
 
 Roland Dreier wrote:
- device-specific alloc/free of physical buffer lists for use in fast
register work requests.  This allows devices to allocate this memory as
needed (like via dma_alloc_coherent).
  
  I'm looking at how one would implement the MM extensions for mlx4, and
  it turns out that in addition to needing to allocate these fastreg page
  lists in coherent memory, mlx4 is even going to need to write to the
  memory (basically set the lsb of each address for internal device
  reasons).  So I think we just need to update the documentation of the
  interface so that not only does the page list belong to the device
  driver between posting the fastreg work request and completing the
  request, but also the device driver is allowed to change the page list
  as part of the work request processing.
  
  I don't see any real reason why this would cause problems for consumers;
  does this seem OK to other people?
 
 Tom,
 
 Does this affect how you plan to implement NFSRDMA MEM_MGT_EXTENTIONS 
 support?

I think this is ok. 


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Talpey, Thomas
At 11:33 AM 5/27/2008, Tom Tucker wrote:
So I think from an NFSRDMA coding perspective it's a wash...

Just to be clear, you're talking about the NFS/RDMA server. However, it's
pretty much a wash on the client, for different reasons.

When posting the WR, We check the fastreg capabilities bit + transport 
type bit:
If fastreg is true --
   Post FastReg
   If iWARP (or with a cap bit read-with-inv-flag)
   post rdma read w/ invalidate

... For iWARP's case, this means rdma-read-w-inv,
plus rdma-send-w-inv, etc... 


Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests
don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't:



   ---+---+---+--+---+---+--
   RDMA   | Message   | Tagged| STag | Queue | Invalidate| Message
   Message| Type  | Flag  | and  | Number| STag  | Length
   OpCode |   |   | TO   |   |   | Communicated
  |   |   |  |   |   | between DDP
  |   |   |  |   |   | and RDMAP
   ---+---+---+--+---+---+--
   b  | RDMA Write| 1 | Valid| N/A   | N/A   | Yes
  |   |   |  |   |   |
   ---+---+---+--+---+---+--
   0001b  | RDMA Read | 0 | N/A  | 1 | N/A   | Yes
  | Request   |   |  |   |   |
   ---+---+---+--+---+---+--
   0010b  | RDMA Read | 1 | Valid| N/A   | N/A   | Yes
  | Response  |   |  |   |   |
   ---+---+---+--+---+---+--
   0011b  | Send  | 0 | N/A  | 0 | N/A   | Yes
  |   |   |  |   |   |
   ---+---+---+--+---+---+--
   0100b  | Send with | 0 | N/A  | 0 | Valid | Yes
  | Invalidate|   |  |   |   |
   ---+---+---+--+---+---+--
   0101b  | Send with | 0 | N/A  | 0 | N/A   | Yes
  | SE|   |  |   |   |
   ---+---+---+--+---+---+--
   0110b  | Send with | 0 | N/A  | 0 | Valid | Yes
  | SE and|   |  |   |   |
  | Invalidate|   |  |   |   |
   ---+---+---+--+---+---+--
   0111b  | Terminate | 0 | N/A  | 2 | N/A   | Yes
  |   |   |  |   |   |
   ---+---+---+--+---+---+--
   1000b  |   |
   to | Reserved  |   Not Specified
   b  |   |
   ---+---+-



I want to take this opportunity to also mention that the RPC/RDMA client-server
exchange does not support remote-invalidate currently. Because of the multiple
stags supported by the rpcrdma chunking header, and because the client needs
to verify that the stags were in fact invalidated, there is significant 
overhead,
and the jury is out on that benefit. In fact, I suspect it's a lose at the 
client.

Tom (Talpey).  

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Tom Tucker

On Tue, 2008-05-27 at 10:33 -0500, Tom Tucker wrote:
 On Mon, 2008-05-26 at 16:02 -0700, Roland Dreier wrote:
   The invalidate local stag part of a read is just a local sink side
operation (ie no wire protocol change from a read).  It's not like
processing an ingress send-with-inv.  It is really functionally like a
read followed immediately by a fenced invalidate-local, but it doesn't
stall the pipe.  So the device has to remember the read is a with inv
local stag and invalidate the stag after the read response is placed
and before the WCE is reaped by the application.
  
  Yes, understood.  My point was just that in IB, at least in theory, one
  could just use an L_Key that doesn't have any remote permissions in the
  scatter list of an RDMA read, while in iWARP, the STag used to place an
  RDMA read response has to have remote write permission.  So RDMA read
  with invalidate makes sense for iWARP, because it gives a race-free way
  to allow an STag to be invalidated immediately after an RDMA read
  response is placed, while in IB it's simpler just to never give remote
  access at all.
  
 
 So I think from an NFSRDMA coding perspective it's a wash...
 
 When creating the local data sink, We need to check the transport type.
 
 If it's IB -- only local access,
 if it's iWARP -- local + remote access.
 
 When posting the WR, We check the fastreg capabilities bit + transport type 
 bit:
 If fastreg is true --
   Post FastReg
   If iWARP (or with a cap bit read-with-inv-flag)
   post rdma read w/ invalidate
   else /* IB */
   post rdma read

Steve pointed out a good optimization here. Instead of fencing the RDMA
READ here in advance of the INVALIDATE, we should post the INVALIDATE
when the READ WR completes. This will avoid stalling the SQ. Since IB
doesn't put the LKEY on the wire, there's no security issue to close. We
need to keep a bunch of fastreg MR around anyway for concurrent RPC.

Thoughts?
Tom

   post invalidate
   fi
 else
   ... today's logic
 fi
 
 I make the observation, however, that the transport type is now overloaded
 with a set of required verbs. For iWARP's case, this means rdma-read-w-inv,
 plus rdma-send-w-inv, etc... This also means that new transport types will
 inherit one or the other set of verbs (IB or iWARP).
 
 Tom
 
 
   - R.
  ___
  general mailing list
  general@lists.openfabrics.org
  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
  
  To unsubscribe, please visit 
  http://openib.org/mailman/listinfo/openib-general
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Tom Tucker

On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote:
 At 11:33 AM 5/27/2008, Tom Tucker wrote:
 So I think from an NFSRDMA coding perspective it's a wash...
 
 Just to be clear, you're talking about the NFS/RDMA server. However, it's
 pretty much a wash on the client, for different reasons.
 
Tom:

What client side memory registration strategy do you recommend if the
default on the server side is fastreg?

On the performance side we are limited by the min size of the
read/write-chunk element. If the client still gives the server a 4k
chunk, the performance benefit (fewer PDU on the wire) goes away.

Tom

 When posting the WR, We check the fastreg capabilities bit + transport 
 type bit:
 If fastreg is true --
Post FastReg
If iWARP (or with a cap bit read-with-inv-flag)
post rdma read w/ invalidate
 
 ... For iWARP's case, this means rdma-read-w-inv,
 plus rdma-send-w-inv, etc... 
 
 
 Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests
 don't support remote invalidate. At least, the table in RFC5040 (p.22) 
 doesn't:
 
 
 
---+---+---+--+---+---+--
RDMA   | Message   | Tagged| STag | Queue | Invalidate| Message
Message| Type  | Flag  | and  | Number| STag  | Length
OpCode |   |   | TO   |   |   | Communicated
   |   |   |  |   |   | between DDP
   |   |   |  |   |   | and RDMAP
---+---+---+--+---+---+--
b  | RDMA Write| 1 | Valid| N/A   | N/A   | Yes
   |   |   |  |   |   |
---+---+---+--+---+---+--
0001b  | RDMA Read | 0 | N/A  | 1 | N/A   | Yes
   | Request   |   |  |   |   |
---+---+---+--+---+---+--
0010b  | RDMA Read | 1 | Valid| N/A   | N/A   | Yes
   | Response  |   |  |   |   |
---+---+---+--+---+---+--
0011b  | Send  | 0 | N/A  | 0 | N/A   | Yes
   |   |   |  |   |   |
---+---+---+--+---+---+--
0100b  | Send with | 0 | N/A  | 0 | Valid | Yes
   | Invalidate|   |  |   |   |
---+---+---+--+---+---+--
0101b  | Send with | 0 | N/A  | 0 | N/A   | Yes
   | SE|   |  |   |   |
---+---+---+--+---+---+--
0110b  | Send with | 0 | N/A  | 0 | Valid | Yes
   | SE and|   |  |   |   |
   | Invalidate|   |  |   |   |
---+---+---+--+---+---+--
0111b  | Terminate | 0 | N/A  | 2 | N/A   | Yes
   |   |   |  |   |   |
---+---+---+--+---+---+--
1000b  |   |
to | Reserved  |   Not Specified
b  |   |
---+---+-
 
 
 
 I want to take this opportunity to also mention that the RPC/RDMA 
 client-server
 exchange does not support remote-invalidate currently. Because of the multiple
 stags supported by the rpcrdma chunking header, and because the client needs
 to verify that the stags were in fact invalidated, there is significant 
 overhead,
 and the jury is out on that benefit. In fact, I suspect it's a lose at the 
 client.
 
 Tom (Talpey).  
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Steve Wise

Tom Tucker wrote:

On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote:
  

At 11:33 AM 5/27/2008, Tom Tucker wrote:


So I think from an NFSRDMA coding perspective it's a wash...
  

Just to be clear, you're talking about the NFS/RDMA server. However, it's
pretty much a wash on the client, for different reasons.



Tom:

What client side memory registration strategy do you recommend if the
default on the server side is fastreg?

On the performance side we are limited by the min size of the
read/write-chunk element. If the client still gives the server a 4k
chunk, the performance benefit (fewer PDU on the wire) goes away.

Tom

  


I would hope that dma_mr usage will be replaced with fast_reg on both 
the client and the server. 

When posting the WR, We check the fastreg capabilities bit + transport 
type bit:

If fastreg is true --
  Post FastReg
  If iWARP (or with a cap bit read-with-inv-flag)
  post rdma read w/ invalidate
  
... For iWARP's case, this means rdma-read-w-inv,
plus rdma-send-w-inv, etc... 
  

Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests
don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't:



   ---+---+---+--+---+---+--
   RDMA   | Message   | Tagged| STag | Queue | Invalidate| Message
   Message| Type  | Flag  | and  | Number| STag  | Length
   OpCode |   |   | TO   |   |   | Communicated
  |   |   |  |   |   | between DDP
  |   |   |  |   |   | and RDMAP
   ---+---+---+--+---+---+--
   b  | RDMA Write| 1 | Valid| N/A   | N/A   | Yes
  |   |   |  |   |   |
   ---+---+---+--+---+---+--
   0001b  | RDMA Read | 0 | N/A  | 1 | N/A   | Yes
  | Request   |   |  |   |   |
   ---+---+---+--+---+---+--
   0010b  | RDMA Read | 1 | Valid| N/A   | N/A   | Yes
  | Response  |   |  |   |   |
   ---+---+---+--+---+---+--
   0011b  | Send  | 0 | N/A  | 0 | N/A   | Yes
  |   |   |  |   |   |
   ---+---+---+--+---+---+--
   0100b  | Send with | 0 | N/A  | 0 | Valid | Yes
  | Invalidate|   |  |   |   |
   ---+---+---+--+---+---+--
   0101b  | Send with | 0 | N/A  | 0 | N/A   | Yes
  | SE|   |  |   |   |
   ---+---+---+--+---+---+--
   0110b  | Send with | 0 | N/A  | 0 | Valid | Yes
  | SE and|   |  |   |   |
  | Invalidate|   |  |   |   |
   ---+---+---+--+---+---+--
   0111b  | Terminate | 0 | N/A  | 2 | N/A   | Yes
  |   |   |  |   |   |
   ---+---+---+--+---+---+--
   1000b  |   |
   to | Reserved  |   Not Specified
   b  |   |
   ---+---+-



I want to take this opportunity to also mention that the RPC/RDMA client-server
exchange does not support remote-invalidate currently. Because of the multiple
stags supported by the rpcrdma chunking header, and because the client needs
to verify that the stags were in fact invalidated, there is significant 
overhead,
and the jury is out on that benefit. In fact, I suspect it's a lose at the 
client.

Tom (Talpey).  


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
  


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Talpey, Thomas
At 02:58 PM 5/27/2008, Tom Tucker wrote:

On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote:
 At 11:33 AM 5/27/2008, Tom Tucker wrote:
 So I think from an NFSRDMA coding perspective it's a wash...
 
 Just to be clear, you're talking about the NFS/RDMA server. However, it's
 pretty much a wash on the client, for different reasons.
 
Tom:

What client side memory registration strategy do you recommend if the
default on the server side is fastreg?

Whatever is fastest and safest. Given that the client and server won't
necessarily be using the same hardware, nor the same kernel for that
matter, I don't think we can or should legislate it.

That said, I am hopeful that fastreg does turn out to be fast and
therefore will become the only logical choice for the NFS/RDMA Linux
client. But the future Linux client is only one such system. I cannot
speak for others.

Tom.


On the performance side we are limited by the min size of the
read/write-chunk element. If the client still gives the server a 4k
chunk, the performance benefit (fewer PDU on the wire) goes away.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-25 Thread Steve Wise



Or Gerlitz wrote:

After discussing the rkey renew and fencing with send/rdma ops, I am 
quite clear with how all this plugs well into ULPs such as SCSI or FS 
low-level (interconnect) initiator/target drivers, specifically those 
who use a transactional protocol. Few more points to clarify are (sorry 
if it became somehow long):


* Do we want it to be a must for a consumer to invalidate an fast-reg mr 
before reusing it? if yes, how?


The verbs specs mandate that the mr be in the invalid state when the 
fast-reg work request is processed.  So I think that means yes.  And the 
consumer invalidates it via the INVALIDATE_MR work request.




* If remote invalidation is supported, when the peer is done with the 
mr, it sends the response in
send-with-invalidate fashion and saves the mapper side from doing a 
local invalidate. For the case of the mapping produced by SCSI initiator 
or FS client, when remote invalidation is not supported, I don't see how 
a local invalidate design can be made in a pipe-lined manner - Since 
from the network perspective the I/O is done, the target response at 
your hands, but until doing mr invalidation the pages are typically not 
returned to the upper layer and the ULP has to stall till the 
invalidation WR is completed? I don't say its a bug or a big issue, just 
wonder what are your thoughts regarding this point.




I guess that's why they invented send-with-inv, and read-with-inv-local.

* talking about remote invalidation, I understand that it requires 
support of both sides (and hence has to be negotiated), so the 
IB_DEVICE_SEND_W_INV device capability says that a device can 
send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well?


* what about ZBVA, is it orthogonal to these calls, no enhancement of 
the suggested API is needed even if zbva is used, or the other way, it 
would work also when zbva is not used?


Or

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-25 Thread Roland Dreier
  * Do we want it to be a must for a consumer to invalidate an fast-reg
  mr before reusing it? if yes, how?

The verbs specs go into exhaustive detail about the state diagram for
validity of MRs.

  * talking about remote invalidation, I understand that it requires
  support of both sides (and hence has to be negotiated), so the
  IB_DEVICE_SEND_W_INV device capability says that a device can
  send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well?

I think we decided that all of these related features will be indicated
by IB_DEVICE_MEM_MGT_EXTENSIONS to avoid an explosion of capability bits.

  * what about ZBVA, is it orthogonal to these calls, no enhancement of
  the suggested API is needed even if zbva is used, or the other way, it
  would work also when zbva is not used?

ZBVA would require adding some flag to request ZBVA when registering.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-25 Thread Roland Dreier
  So something like this?

yeah, looks reasonable...

  static void ib_update_fast_reg_key(struct ib_mr *mr, u8 newkey)
  {
   /* iWARP: rkey == lkey */

actually I need to reread the IB spec and understand how the consumer
key part of L_Key and R_Key is supposed to work... for Mellanox adapters
at least the L_Key and R_Key are the same too.

   if (mr-rkey == mr-lkey)
   mr-lkey = mr-lkey  0xff00 | newkey;
   mr-rkey = mr-rkey  0xff00 | newkey;
  }
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-25 Thread Roland Dreier
static void ib_update_fast_reg_key(struct ib_mr *mr, u8 newkey)
{
  /* iWARP: rkey == lkey */
  
  actually I need to reread the IB spec and understand how the consumer
  key part of L_Key and R_Key is supposed to work... for Mellanox adapters
  at least the L_Key and R_Key are the same too.
  
  if (mr-rkey == mr-lkey)
  mr-lkey = mr-lkey  0xff00 | newkey;
  mr-rkey = mr-rkey  0xff00 | newkey;
}

I just looked in the IB spec (1.2.1) and it talks about passing the Key
to use on the new L_Key and R_Key into a fastreg work request.

So I think we can just can the test for rkey==lkey and just do

mr-lkey = mr-lkey  0xff00 | newkey;
mr-rkey = mr-rkey  0xff00 | newkey;

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-25 Thread Roland Dreier
  - device-specific alloc/free of physical buffer lists for use in fast
  register work requests.  This allows devices to allocate this memory as
  needed (like via dma_alloc_coherent).

I'm looking at how one would implement the MM extensions for mlx4, and
it turns out that in addition to needing to allocate these fastreg page
lists in coherent memory, mlx4 is even going to need to write to the
memory (basically set the lsb of each address for internal device
reasons).  So I think we just need to update the documentation of the
interface so that not only does the page list belong to the device
driver between posting the fastreg work request and completing the
request, but also the device driver is allowed to change the page list
as part of the work request processing.

I don't see any real reason why this would cause problems for consumers;
does this seem OK to other people?
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-24 Thread Or Gerlitz

Steve Wise wrote:

Usage Model:
- MR made VALID and bound to a specific page list via 
ib_post_send(IB_WR_FAST_REG_MR)
- MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)

Hi Steve, Roland,

After discussing the rkey renew and fencing with send/rdma ops, I am 
quite clear with how all this plugs well into ULPs such as SCSI or FS 
low-level (interconnect) initiator/target drivers, specifically those 
who use a transactional protocol. Few more points to clarify are (sorry 
if it became somehow long):


* Do we want it to be a must for a consumer to invalidate an fast-reg mr 
before reusing it? if yes, how?


* If remote invalidation is supported, when the peer is done with the 
mr, it sends the response in
send-with-invalidate fashion and saves the mapper side from doing a 
local invalidate. For the case of the mapping produced by SCSI initiator 
or FS client, when remote invalidation is not supported, I don't see how 
a local invalidate design can be made in a pipe-lined manner - Since 
from the network perspective the I/O is done, the target response at 
your hands, but until doing mr invalidation the pages are typically not 
returned to the upper layer and the ULP has to stall till the 
invalidation WR is completed? I don't say its a bug or a big issue, just 
wonder what are your thoughts regarding this point.


* talking about remote invalidation, I understand that it requires 
support of both sides (and hence has to be negotiated), so the 
IB_DEVICE_SEND_W_INV device capability says that a device can 
send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well?


* what about ZBVA, is it orthogonal to these calls, no enhancement of 
the suggested API is needed even if zbva is used, or the other way, it 
would work also when zbva is not used?


Or

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-23 Thread Roland Dreier
  And then the provider updates the mr-rkey field as part of WR processing?

Yeah, I guess so.

Actually thinking about it, another possibility would be to wrap up the

  newrkey = (mr-rkey  0xff00) | newkey;

operation in a little inline helper function so people don't screw it
up.  Maybe that's the cleanest way to do it.

(We would probably want the helper for low-level driver use anyway)

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-23 Thread Roland Dreier
  Actually thinking about it, another possibility would be to wrap up the

  newrkey = (mr-rkey  0xff00) | newkey;

  operation in a little inline helper function so people don't screw it
  up.  Maybe that's the cleanest way to do it.

If we add a key field to the work request, then it seems too easy for
a consumer to forget to set it and end up passing uninitialized garbage.
If the consumer has to explicitly update the key when posting the work
request then that failure is avoided.

HOWEVER -- if we have the consumer update the key when posting the
operation, then there is the problem of what happens when the consumer
posts multiple fastreg work requests at once (ie fastreg, local inval,
new fastreg, etc. in a pipelined way).  Does the low-level driver just
take the the key value given when the WR is posted, even if there's a
new value there by the time the WR is executed?

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-23 Thread Steve Wise



Roland Dreier wrote:

  Actually thinking about it, another possibility would be to wrap up the

  newrkey = (mr-rkey  0xff00) | newkey;

  operation in a little inline helper function so people don't screw it
  up.  Maybe that's the cleanest way to do it.

If we add a key field to the work request, then it seems too easy for
a consumer to forget to set it and end up passing uninitialized garbage.
If the consumer has to explicitly update the key when posting the work
request then that failure is avoided.

HOWEVER -- if we have the consumer update the key when posting the
operation, then there is the problem of what happens when the consumer
posts multiple fastreg work requests at once (ie fastreg, local inval,
new fastreg, etc. in a pipelined way).  Does the low-level driver just
take the the key value given when the WR is posted, even if there's a
new value there by the time the WR is executed?



I would have to say yes.  And it makes sense i think.

say rkey is 0x010203XX.  The a pipeline could look like:

fastreg (mr-rkey is 0x01020301)
rdma read (mr-rkey is 0x01020301)
invalidate local with fence (mr-rkey is 0x01020301)
fastreg (mr-rkey is 0x01020302)
rdma read (sink mr-rkey is 0x01020302)
invalidate local with fence (mr-rkey is 0x01020302)

So the consumer is using the correct mr-rkey at all times even though 
the rnic is possibly processing the previous generation (that was copied 
into a fastreg WR at an earlier point in time) at the same time as the 
app is registering the next generation of the rkey.


Steve.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-22 Thread Or Gerlitz

Dror Goldenberg wrote:

When you post a fast register WQE, you specify the new 8 LSBits to
be assigned to the MR. The rest 24 MSBits are the ones that you obtained
while allocating the MR, and they persist throughout the lifetime of
this MR.

OK, thanks Dror.

Steve, do we agree on this point? if yes, the next version of the 
patches should include the new rkey value (or just the new 8 LSbits) in 
the work request.


Or.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-22 Thread Steve Wise



Or Gerlitz wrote:

Dror Goldenberg wrote:

When you post a fast register WQE, you specify the new 8 LSBits to
be assigned to the MR. The rest 24 MSBits are the ones that you obtained
while allocating the MR, and they persist throughout the lifetime of
this MR.

OK, thanks Dror.

Steve, do we agree on this point? if yes, the next version of the 
patches should include the new rkey value (or just the new 8 LSbits) in 
the work request.




Are we sure we need to expose this to the user?


Or.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-22 Thread Steve Wise



Or Gerlitz wrote:

Steve Wise wrote:
So you allocate the rkey/stag up front, allocate page_lists up front, 
then as needed you populate your page list and bind it to the 
rkey/stag via IB_WR_FAST_REG_MR, and invalidate that mapping via 
IB_WR_INVALIDATE_MR.  You can do this any number of times, and with 
proper fencing, you can pipeline these mappings.   Eventually when 
you're done doing IO (like for NFSRDMA when the mount is unmounted) 
you free up the page list(s) and mr/rkey/stag.

Yes, that was my thought as well.

Just to make sure, by proper fencing your understanding is that for 
both IB and iWARP the ULP should not wait for the fmr work request to 
complete and post the send work-request carrying the rkey/stag with the 
IB_SEND_FENCE flag?


Looking in the IB spec, its seems that the fence indicator only applies 
to previous rdma-read / atomic operations, eg in section  11.4.1.1 POST 
SEND REQUEST it says:
Fence indicator. If the fence indicator is set, then all prior RDMA 
Read and Atomic Work Requests on the queue must be completed before 
starting to process this Work Request.




The fast register and invalidate work requests require that they be 
completed by the device _before_ processing any subsequent work 
requests.  So you can post subsequent SEND WRs that utilize the rkey 
without problems.


In addition, invalidate allows a local fence which means the device will 
no begin processing the invalitdae until all _prior_ work requests 
complete (similar to a read fence but for all prior WRS).


Steve.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-22 Thread Or Gerlitz

Steve Wise wrote:

Are we sure we need to expose this to the user?
I believe this is the way to go if we want to let smart ULPs generate 
new rkey/stag per mapping. Simpler ULPs could then just put the same 
value for each map associated with the same mr.


Or.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-22 Thread Steve Wise



Or Gerlitz wrote:

Steve Wise wrote:

Are we sure we need to expose this to the user?
I believe this is the way to go if we want to let smart ULPs generate 
new rkey/stag per mapping. Simpler ULPs could then just put the same 
value for each map associated with the same mr.


Or.



Roland, what do you think?  I'm ok with adding this.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-21 Thread Or Gerlitz

Steve Wise wrote:
My point is that if you do the mappipng at allocation time, then the 
failure will happen when you allocate the page list vs when you post 
the send WR.  Maybe it doesn't matter, but the idea, I think, is to 
not fail post_send for lack of resources.  Everything should be 
pre-allocated pretty much by the time you post work requests...
fair-enough. I understand we are requiring that a page list can be 
reused without being freed, just make sure its documents.


Or.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-21 Thread Or Gerlitz

Steve Wise wrote:
Support for the IB BMME and iWARP equivalent memory extensions ... 
Usage Model:

- MR allocated with ib_alloc_mr()
- Page lists allocated via ib_alloc_fast_reg_page_list().
- MR made VALID  bound to a specific page list via 
ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via 
ib_post_send(IB_WR_INVALIDATE_MR)
AFAIK, the idea was to let the ulp post --two-- work requests, where 
the first creates the mapping and the second sends this mapping to 
the remote side, such that the second does not start before the first 
completes (i.e a fence).


Now, the above scheme means that the ulp knows the value of the 
rkey/stag at the time of posting these two work requests (since it 
has to encode it in the second one), so something has to be clarified 
re the rkey/stag here, do they change each time this MR is used? how 
many bits can be changed, etc.


The ULP knows the rkey/stag because its returned up front in the 
ib_alloc_fast_reg_mr().  And it doesn't change (ignoring the key issue 
which we haven't exposed yet to the ULP).  The same rkey/stag can be 
used for multiple mappings.  It can be made invalid at any point in 
time via the IB_WR_INVALIDATE_MR so the fact that you're leaving the 
same rkey/stag advertised is not a risk.
I understand that this (same rkey/stag used for all mapping produced for 
a specific mr) is what you are proposing, I still think there's a chance 
that by the spec and (not less important!) by existing HW support, its 
possible to have a different rkey/stag per mapping done on an mr, for 
example the IB spec uses a consumer owned key portion of the L_Key 
notation which makes me think there should be a way to have different 
rkey per mapping, Roland? Dror?

10.7.2.6 FAST REGISTER PHYSICAL MR
The Fast Register Physical MR Operation is allowed on Non-Shared 
Physical Memory Regions that were created with a Consumer owned key 
portion of the L_Key, and any associated R_Key

Or


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-21 Thread Or Gerlitz

Steve Wise wrote:
So you allocate the rkey/stag up front, allocate page_lists up front, 
then as needed you populate your page list and bind it to the 
rkey/stag via IB_WR_FAST_REG_MR, and invalidate that mapping via 
IB_WR_INVALIDATE_MR.  You can do this any number of times, and with 
proper fencing, you can pipeline these mappings.   Eventually when 
you're done doing IO (like for NFSRDMA when the mount is unmounted) 
you free up the page list(s) and mr/rkey/stag.

Yes, that was my thought as well.

Just to make sure, by proper fencing your understanding is that for 
both IB and iWARP the ULP should not wait for the fmr work request to 
complete and post the send work-request carrying the rkey/stag with the 
IB_SEND_FENCE flag?


Looking in the IB spec, its seems that the fence indicator only applies 
to previous rdma-read / atomic operations, eg in section  11.4.1.1 POST 
SEND REQUEST it says:
Fence indicator. If the fence indicator is set, then all prior RDMA 
Read and Atomic Work Requests on the queue must be completed before 
starting to process this Work Request.


Talking on usage, do you plan to patch the mainline nfs-rdma code to 
use these verbs?
Yes.  Tom Tucker will be doing this.  Jon Mason is implementing RDS 
changes to utilize this too.  The hope is all this makes 2.6.27/ofed-1.4.


I can also post test code (krping module) if anyone is interested.  
I'm developing that now.


Posting this code would be very much helpful (also to the discussion, I 
think), thanks.


Or.




___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-21 Thread Dror Goldenberg
 

-Original Message-
From: Or Gerlitz [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 21, 2008 12:25 PM
To: Steve Wise
Cc: [EMAIL PROTECTED]; general@lists.openfabrics.org; Dror Goldenberg
Subject: Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:
MEM_MGT_EXTENSIONS support

Steve Wise wrote:
 Support for the IB BMME and iWARP equivalent memory extensions ... 
 Usage Model:
 - MR allocated with ib_alloc_mr()
 - Page lists allocated via ib_alloc_fast_reg_page_list().
 - MR made VALID  bound to a specific page list via
 ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via
 ib_post_send(IB_WR_INVALIDATE_MR)
 AFAIK, the idea was to let the ulp post --two-- work requests, where 
 the first creates the mapping and the second sends this mapping to 
 the remote side, such that the second does not start before the first

 completes (i.e a fence).

 Now, the above scheme means that the ulp knows the value of the 
 rkey/stag at the time of posting these two work requests (since it 
 has to encode it in the second one), so something has to be clarified

 re the rkey/stag here, do they change each time this MR is used? how 
 many bits can be changed, etc.

 The ULP knows the rkey/stag because its returned up front in the 
 ib_alloc_fast_reg_mr().  And it doesn't change (ignoring the key issue

 which we haven't exposed yet to the ULP).  The same rkey/stag can be 
 used for multiple mappings.  It can be made invalid at any point in 
 time via the IB_WR_INVALIDATE_MR so the fact that you're leaving the 
 same rkey/stag advertised is not a risk.
I understand that this (same rkey/stag used for all mapping produced for
a specific mr) is what you are proposing, I still think there's a chance
that by the spec and (not less important!) by existing HW support, its
possible to have a different rkey/stag per mapping done on an mr, for
example the IB spec uses a consumer owned key portion of the L_Key 
notation which makes me think there should be a way to have different
rkey per mapping, Roland? Dror?


[dg] When you post a fast register WQE, you specify the new 8 LSBits to
be assigned to the MR. The rest 24 MSBits are the ones that you obtained
while allocating the MR, and they persist throughout the lifetime of
this MR.



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-20 Thread Or Gerlitz

Steve Wise wrote:

dma mapping would work too but then handling the map/unmap becomes an
issue.  I think it is way too complicated too add new verbs for
map/unmap fastreg page list (in addition to the alloc/free fastreg page
list that we are already adding) and force the consumer to do it.  And
if we expect the low-level driver to do it, then the map is easy (can be
done while posting the send) but the unmap is a pain -- it would have to
be done inside poll_cq when reapind the completion, and the low-level
driver would have to keep some complicated extra data structure to go
back from the completion to the original fast reg page list structure.
  
And certain platforms can fail map requests (like PPC64) because they 
have limited resources for dma mapping.  So then you'd fail a SQ work 
request when you might not want to...
I see the point in allocating the page lists in dma consistent memory to 
make the mechanics of letting the HCA to DMA the list easier and 
simpler, as I think Roland is suggesting in his post. However, I an not 
sure to understand how this helps in the PPC64 case, if the HCA does DMA 
to fetch the list, then IOMMU slots have to be consumed this way or 
another, correct?


Or.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-20 Thread Or Gerlitz

Steve Wise wrote:
Support for the IB BMME and iWARP equivalent memory extensions to 
non shared memory regions.  Usage Model:

- MR allocated with ib_alloc_mr()
- Page lists allocated via ib_alloc_fast_reg_page_list().
- MR made VALID and bound to a specific page list via 
ib_post_send(IB_WR_FAST_REG_MR)
- MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)

Steve,

I am trying to further understand what would be a real life ULP design 
here, and I think there are some more issues to clarify/define for the 
case of ULP which has to create a mapping for a list of pages and send 
this mapping (eg IB/rkey iWARP/stag) to a remote party that uses it for 
RDMA.


AFAIK, the idea was to let the ulp post --two-- work requests, where the 
first creates the mapping and the second sends this mapping to the 
remote side, such that the second does not start before the first 
completes (i.e a fence).


Now, the above scheme means that the ulp knows the value of the 
rkey/stag at the time of posting these two work requests (since it has 
to encode it in the second one), so something has to be clarified re the 
rkey/stag here, do they change each time this MR is used? how many bits 
can be changed, etc.


I guess my questions are to some extent RTFM ones, but, first, with some 
quick looking in the IB spec I did not manage to get enough answers 
(pointers appreciated...) and second, you are proposing an 
implementation here, so I think it makes sense to review the actual 
usage model to see all aspects needed for ULPs are covered...


Talking on usage, do you plan to patch the mainline nfs-rdma code to use 
these verbs?


Or.

- MR deallocated with ib_dereg_mr()
- page lists dealloced via ib_free_fast_reg_page_list()


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-20 Thread Steve Wise

Or Gerlitz wrote:

Steve Wise wrote:
Support for the IB BMME and iWARP equivalent memory extensions to non 
shared memory regions.  Usage Model:

- MR allocated with ib_alloc_mr()
- Page lists allocated via ib_alloc_fast_reg_page_list().
- MR made VALID and bound to a specific page list via 
ib_post_send(IB_WR_FAST_REG_MR)

- MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)

Steve,

I am trying to further understand what would be a real life ULP design 
here, and I think there are some more issues to clarify/define for the 
case of ULP which has to create a mapping for a list of pages and send 
this mapping (eg IB/rkey iWARP/stag) to a remote party that uses it 
for RDMA.


AFAIK, the idea was to let the ulp post --two-- work requests, where 
the first creates the mapping and the second sends this mapping to the 
remote side, such that the second does not start before the first 
completes (i.e a fence).


Now, the above scheme means that the ulp knows the value of the 
rkey/stag at the time of posting these two work requests (since it has 
to encode it in the second one), so something has to be clarified re 
the rkey/stag here, do they change each time this MR is used? how many 
bits can be changed, etc.


The ULP knows the rkey/stag because its returned up front in the 
ib_alloc_fast_reg_mr().  And it doesn't change (ignoring the key issue 
which we haven't exposed yet to the ULP).  The same rkey/stag can be 
used for multiple mappings.  It can be made invalid at any point in time 
via the IB_WR_INVALIDATE_MR so the fact that you're leaving the same 
rkey/stag advertised is not a risk.


So you allocate the rkey/stag up front, allocate page_lists up front, 
then as needed you populate your page list and bind it to the rkey/stag 
via IB_WR_FAST_REG_MR, and invalidate that mapping via 
IB_WR_INVALIDATE_MR.  You can do this any number of times, and with 
proper fencing, you can pipeline these mappings.   Eventually when 
you're done doing IO (like for NFSRDMA when the mount is unmounted) you 
free up the page list(s) and mr/rkey/stag.


So NFSRDMA will keep these fast_reg_mrs and page_list structs 
pre-allocated and hung off some context so that per RPC, they can be 
bound/registered, the IO executed, and then the MR invalidated as part 
of processing the RPC.




I guess my questions are to some extent RTFM ones, but, first, with 
some quick looking in the IB spec I did not manage to get enough 
answers (pointers appreciated...) and second, you are proposing an 
implementation here, so I think it makes sense to review the actual 
usage model to see all aspects needed for ULPs are covered...


Talking on usage, do you plan to patch the mainline nfs-rdma code to 
use these verbs?


Yes.  Tom Tucker will be doing this.  Jon Mason is implementing RDS 
changes to utilize this too.  The hope is all this makes 2.6.27/ofed-1.4.


I can also post test code (krping module) if anyone is interested.  I'm 
developing that now.


Steve.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-19 Thread Or Gerlitz

Roland Dreier wrote:


Yes, the point of this verb is that the low-level driver owns the page
list from when the fast register work request is posted until it
completes.  This should be explicitly documented somewhere.

OK, got it, so this is different case compared to the SG elements which 
are not owned by the driver once the posting call returns.


However the reason for having the low-level driver implement it is so
that all strange device-specific issues can be taken care of in the
driver.  For instance mlx4 is going to require that the page list be
aligned to 64 bytes, and will DMA from the memory, so we need to use
dma_alloc_consistent().  On the other hand cxgb3 is just going to copy
in software, so kmalloc is sufficient.

I see. Just wondering, in the mlx4 case, is it a must to use dma 
consistent memory allocation or dma mapping would work too?


Or.
Or.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-19 Thread Or Gerlitz

Steve Wise wrote:
Support for the IB BMME and iWARP equivalent memory extensions to 
non shared memory regions. Usage Model:


- MR allocated with ib_alloc_mr()
- Page lists allocated via ib_alloc_fast_reg_page_list().
- MR made VALID and bound to a specific page list via 
ib_post_send(IB_WR_FAST_REG_MR)
- MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)
- MR deallocated with ib_dereg_mr()
- page lists dealloced via ib_free_fast_reg_page_list().

Steve,

Does this design goes hand-in-hand with remote invalidation? such that 
if the remote side invalidated the mapping there no need to issue the 
IB_WR_INVALIDATE_MR work request.


Also, does the proposed design support fmr pages of granularity 
different than the OS ones? for example the OS pages are 4K and the ULP 
wants to use fmr of 512 byte pages (the block lists feature), etc. In 
that case doesn't the size of each page has to be specified in as a 
param to the alloc_fast_reg_mr() verb?


Applications can allocate a fast_reg mr once, and then can repeatedly
bind the mr to different physical memory SGLs via posting work requests
to the send queue.  For each outstanding mr-to-pbl binding in the SQ
pipe, a fast_reg_page_list needs to be allocated.  Thus pipelining can
be achieved while still allowing device-specific page_list processing.
mmm, is it a must for the ULP issue page list alloc/free per 
IB_WR_FAST_REG_MR call?



--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -676,6 +683,20 @@ struct ib_send_wr {
u16 pkey_index; /* valid for GSI only */
u8  port_num;   /* valid for DR SMPs on switch only 
*/
} ud;
+   struct {
+   u64 iova_start;
+   struct ib_mr*mr;
+   struct ib_fast_reg_page_list*page_list;
+   unsigned intpage_size;
+   unsigned intpage_list_len;
+   unsigned intfirst_byte_offset;
+   u32 length;
+   int access_flags;
+   
+   } fast_reg;
+   struct {
+   struct ib_mr*mr;
+   } local_inv;
} wr;
 };
I suggest to use a page_shift notation and not page_size to comply 
with the kernel semantics of other APIs.



Or.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-19 Thread Steve Wise

Or Gerlitz wrote:

Steve Wise wrote:
Support for the IB BMME and iWARP equivalent memory extensions to non 
shared memory regions. Usage Model:


- MR allocated with ib_alloc_mr()
- Page lists allocated via ib_alloc_fast_reg_page_list().
- MR made VALID and bound to a specific page list via 
ib_post_send(IB_WR_FAST_REG_MR)

- MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)
- MR deallocated with ib_dereg_mr()
- page lists dealloced via ib_free_fast_reg_page_list().

Steve,

Does this design goes hand-in-hand with remote invalidation? such that 
if the remote side invalidated the mapping there no need to issue the 
IB_WR_INVALIDATE_MR work request.




Yes.

Also, does the proposed design support fmr pages of granularity 
different than the OS ones? for example the OS pages are 4K and the 
ULP wants to use fmr of 512 byte pages (the block lists feature), 
etc. In that case doesn't the size of each page has to be specified in 
as a param to the alloc_fast_reg_mr() verb?


Page size is passed in at the registration time.  At allocation time, 
the HW only need to know what the max page list length (or PBL depth) 
will ever be so it can pre-allocate that at alloc time.  The the actualy 
page list length, the page size of each entry in the page list, as well 
as the page list itself is passed in via the 
post_send(IB_WR_FAST_REG_MR) work request.  See the fast_reg union in 
struct ib_send_wr.





Applications can allocate a fast_reg mr once, and then can repeatedly
bind the mr to different physical memory SGLs via posting work requests
to the send queue.  For each outstanding mr-to-pbl binding in the SQ
pipe, a fast_reg_page_list needs to be allocated.  Thus pipelining can
be achieved while still allowing device-specific page_list processing.
mmm, is it a must for the ULP issue page list alloc/free per 
IB_WR_FAST_REG_MR call?


No, the can be reused as needed.  They typically will only get allocated 
once, used many times, then freed when the application is done.  My 
point in the text above was that an application could allocate N page 
lists and use them in a pipeline for the same fast reg mr by fencing 
things appropriately in the SQ.




--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -676,6 +683,20 @@ struct ib_send_wr {
 u16pkey_index; /* valid for GSI only */
 u8port_num;   /* valid for DR SMPs on switch only */
 } ud;
+struct {
+u64iova_start;
+struct ib_mr *mr;
+struct ib_fast_reg_page_list*page_list;
+unsigned intpage_size;
+unsigned intpage_list_len;
+unsigned intfirst_byte_offset;
+u32length;
+intaccess_flags;
+   
+} fast_reg;

+struct {
+struct ib_mr *mr;
+} local_inv;
 } wr;
 };
I suggest to use a page_shift notation and not page_size to comply 
with the kernel semantics of other APIs.



Ok, I wondered about that.  It will also ensure a power of two.

Steve.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-19 Thread Steve Wise

Roland Dreier wrote:

  If ownership can be assumed, I suggest to have the core use the
  implementation of these two verbs as you did that for the Chelsio
  driver in case the HW driver did not implement it (i.e instead of
  returning ENOSYS). In that case, the alloc_list verb should do DMA
  mapping FROM device (I think...) since the device is going to do DMA
  to read the page list, and the free_list verb should do DMA unmapping,
  etc.

Yes, the point of this verb is that the low-level driver owns the page
list from when the fast register work request is posted until it
completes.  This should be explicitly documented somewhere.

  


I've added it to the comments for ib_alloc_fast_reg_page_list() as per 
Ralph Campbell's suggestion.




However the reason for having the low-level driver implement it is so
that all strange device-specific issues can be taken care of in the
driver.  For instance mlx4 is going to require that the page list be
aligned to 64 bytes, and will DMA from the memory, so we need to use
dma_alloc_consistent().  On the other hand cxgb3 is just going to copy
in software, so kmalloc is sufficient.

 - R.
  


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-19 Thread Talpey, Thomas
At 09:58 AM 5/19/2008, Steve Wise wrote:
Storage has been known to adopt non ^2 blocks, for instance including
block checksums in sectors, etc. If transferred, these will become quite
inefficient on ^2 hardware.

  
Is this true today for any of the existing RDMA ULPs that will utilize fastreg?


Ask the iSER and SRP folks. NFS won't.

Tom.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-19 Thread Steve Wise




Talpey, Thomas wrote:

  At 09:40 AM 5/19/2008, Steve Wise wrote:
  
  

  I suggest to use a "page_shift" notation and not "page_size" to comply 
with the kernel semantics of other APIs.

  

Ok, I wondered about that.  It will also ensure a power of two.

  
  
Does it have to be ^2? In the iWARP spec development, we envisioned
the possibility of arbitrary page sizes. I don't recall any such dependency
in the protocol architecture.

  


I didn't add block mode support since its not available anywhere in the
Linux RDMA API. I'd rather _not_ introduce that at this point.


  Storage has been known to adopt non ^2 blocks, for instance including
block checksums in sectors, etc. If transferred, these will become quite
inefficient on ^2 hardware.

  

Is this true today for any of the existing RDMA ULPs that will utilize
fastreg?

Steve.



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-19 Thread Roland Dreier
  I see. Just wondering, in the mlx4 case, is it a must to use dma
  consistent memory allocation or dma mapping would work too?

dma mapping would work too but then handling the map/unmap becomes an
issue.  I think it is way too complicated too add new verbs for
map/unmap fastreg page list (in addition to the alloc/free fastreg page
list that we are already adding) and force the consumer to do it.  And
if we expect the low-level driver to do it, then the map is easy (can be
done while posting the send) but the unmap is a pain -- it would have to
be done inside poll_cq when reapind the completion, and the low-level
driver would have to keep some complicated extra data structure to go
back from the completion to the original fast reg page list structure.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-19 Thread Steve Wise

Roland Dreier wrote:

  I see. Just wondering, in the mlx4 case, is it a must to use dma
  consistent memory allocation or dma mapping would work too?

dma mapping would work too but then handling the map/unmap becomes an
issue.  I think it is way too complicated too add new verbs for
map/unmap fastreg page list (in addition to the alloc/free fastreg page
list that we are already adding) and force the consumer to do it.  And
if we expect the low-level driver to do it, then the map is easy (can be
done while posting the send) but the unmap is a pain -- it would have to
be done inside poll_cq when reapind the completion, and the low-level
driver would have to keep some complicated extra data structure to go
back from the completion to the original fast reg page list structure.
  


And certain platforms can fail map requests (like PPC64) because they 
have limited resources for dma mapping.  So then you'd fail a SQ work 
request when you might not want to...


Steve.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-18 Thread Or Gerlitz

Steve Wise wrote:

- device-specific alloc/free of physical buffer lists for use in fast
register work requests.  This allows devices to allocate this memory as
needed (like via dma_alloc_coherent).


Steve,

Reading through the suggested API / patches and the previous threads I 
was not sure to understand if the HW driver must not assume that it has 
the ownership on the page --list-- structure until the registration work 
request is completed - or not.


Now, if ownership can not be assumed (eg as for the SG list elements 
pointed by send/recv WR), the driver has to clone it anyway, and thus I 
don't see the need in the ib_alloc/free_fast_reg_page_list verbs.


If ownership can be assumed, I suggest to have the core use the 
implementation of these two verbs as you did that for the Chelsio driver 
in case the HW driver did not implement it (i.e instead of returning 
ENOSYS). In that case, the alloc_list verb should do DMA mapping FROM 
device (I think...) since the device is going to do DMA to read the page 
list, and the free_list verb should do DMA unmapping, etc.


Or.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-18 Thread Steve Wise



Or Gerlitz wrote:

Steve Wise wrote:

- device-specific alloc/free of physical buffer lists for use in fast
register work requests.  This allows devices to allocate this memory as
needed (like via dma_alloc_coherent).


Steve,

Reading through the suggested API / patches and the previous threads I 
was not sure to understand if the HW driver must not assume that it has 
the ownership on the page --list-- structure until the registration work 
request is completed - or not.




Yes, the driver owns the page list structure until the WR completes (ie 
is reaped by the consumer via poll_cq()).


Now, if ownership can not be assumed (eg as for the SG list elements 
pointed by send/recv WR), the driver has to clone it anyway, and thus I 
don't see the need in the ib_alloc/free_fast_reg_page_list verbs.


If ownership can be assumed, I suggest to have the core use the 
implementation of these two verbs as you did that for the Chelsio driver 
in case the HW driver did not implement it (i.e instead of returning 
ENOSYS). In that case, the alloc_list verb should do DMA mapping FROM 
device (I think...) since the device is going to do DMA to read the page 
list, and the free_list verb should do DMA unmapping, etc.




Some devices don't need DMA mappings at all (chelsio for instance). The 
idea of a device-specific method was so the device could allocate a 
bigger structure to hold its own context info.  So a core service that 
sets up DMA, in my opinion, isn't really useful.


Steve.





Or.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-18 Thread Roland Dreier
  If ownership can be assumed, I suggest to have the core use the
  implementation of these two verbs as you did that for the Chelsio
  driver in case the HW driver did not implement it (i.e instead of
  returning ENOSYS). In that case, the alloc_list verb should do DMA
  mapping FROM device (I think...) since the device is going to do DMA
  to read the page list, and the free_list verb should do DMA unmapping,
  etc.

Yes, the point of this verb is that the low-level driver owns the page
list from when the fast register work request is posted until it
completes.  This should be explicitly documented somewhere.

However the reason for having the low-level driver implement it is so
that all strange device-specific issues can be taken care of in the
driver.  For instance mlx4 is going to require that the page list be
aligned to 64 bytes, and will DMA from the memory, so we need to use
dma_alloc_consistent().  On the other hand cxgb3 is just going to copy
in software, so kmalloc is sufficient.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-16 Thread Steve Wise

Support for the IB BMME and iWARP equivalent memory extensions to 
non shared memory regions.  This includes:

- allocation of an ib_mr for use in fast register work requests

- device-specific alloc/free of physical buffer lists for use in fast
register work requests.  This allows devices to allocate this memory as
needed (like via dma_alloc_coherent).

- fast register memory region work request

- invalidate local memory region work request

- read with invalidate local memory region work request (iWARP only)


Design details:

- New device capability flag added: IB_DEVICE_MEM_MGT_EXTENSIONS indicates
device support for this feature.

- New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request.

- New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr.

- New API function, ib_alloc_mr() used to allocate fast_reg memory
regions.

- New API function, ib_alloc_fast_reg_page_list to allocate
device-specific page lists.

- New API function, ib_free_fast_reg_page_list to free said page lists.


Usage Model:

- MR allocated with ib_alloc_mr()

- Page lists allocated via ib_alloc_fast_reg_page_list().

- MR made VALID and bound to a specific page list via
ib_post_send(IB_WR_FAST_REG_MR)

- MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR)

- MR deallocated with ib_dereg_mr()

- page lists dealloced via ib_free_fast_reg_page_list().

Applications can allocate a fast_reg mr once, and then can repeatedly
bind the mr to different physical memory SGLs via posting work requests
to the send queue.  For each outstanding mr-to-pbl binding in the SQ
pipe, a fast_reg_page_list needs to be allocated.  Thus pipelining can
be achieved while still allowing device-specific page_list processing.

Signed-off-by: Steve Wise [EMAIL PROTECTED]
---

 drivers/infiniband/core/verbs.c |   46 
 include/rdma/ib_verbs.h |   56 +++
 2 files changed, 102 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 0504208..0a334b4 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr)
 }
 EXPORT_SYMBOL(ib_dereg_mr);
 
+struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len)
+{
+   struct ib_mr *mr;
+
+   if (!pd-device-alloc_fast_reg_mr)
+   return ERR_PTR(-ENOSYS);
+
+   mr = pd-device-alloc_fast_reg_mr(pd, max_page_list_len);
+
+   if (!IS_ERR(mr)) {
+   mr-device  = pd-device;
+   mr-pd  = pd;
+   mr-uobject = NULL;
+   atomic_inc(pd-usecnt);
+   atomic_set(mr-usecnt, 0);
+   }
+
+   return mr;
+}
+EXPORT_SYMBOL(ib_alloc_fast_reg_mr);
+
+struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
+   struct ib_device *device, int max_page_list_len)
+{
+   struct ib_fast_reg_page_list *page_list;
+
+   if (!device-alloc_fast_reg_page_list)
+   return ERR_PTR(-ENOSYS);
+
+   page_list = device-alloc_fast_reg_page_list(device, max_page_list_len);
+
+   if (!IS_ERR(page_list)) {
+   page_list-device = device;
+   page_list-max_page_list_len = max_page_list_len;
+   }
+
+   return page_list;
+}
+EXPORT_SYMBOL(ib_alloc_fast_reg_page_list);
+
+void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list)
+{
+   page_list-device-free_fast_reg_page_list(page_list);
+}
+EXPORT_SYMBOL(ib_free_fast_reg_page_list);
+
 /* Memory windows */
 
 struct ib_mw *ib_alloc_mw(struct ib_pd *pd)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 911a661..c4ace0f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -106,6 +106,7 @@ enum ib_device_cap_flags {
IB_DEVICE_UD_IP_CSUM= (118),
IB_DEVICE_UD_TSO= (119),
IB_DEVICE_SEND_W_INV= (121),
+   IB_DEVICE_MEM_MGT_EXTENSIONS= (122),
 };
 
 enum ib_atomic_cap {
@@ -151,6 +152,7 @@ struct ib_device_attr {
int max_srq;
int max_srq_wr;
int max_srq_sge;
+   unsigned intmax_fast_reg_page_list_len;
u16 max_pkeys;
u8  local_ca_ack_delay;
 };
@@ -414,6 +416,8 @@ enum ib_wc_opcode {
IB_WC_FETCH_ADD,
IB_WC_BIND_MW,
IB_WC_LSO,
+   IB_WC_FAST_REG_MR,
+   IB_WC_INVALIDATE_MR,
 /*
  * Set value of IB_WC_RECV so consumers can test if a completion is a
  * receive by testing (opcode  IB_WC_RECV).
@@ -628,6 +632,9 @@ enum ib_wr_opcode {
IB_WR_ATOMIC_FETCH_AND_ADD,
IB_WR_LSO,
IB_WR_SEND_WITH_INV,
+   IB_WR_FAST_REG_MR,
+   IB_WR_INVALIDATE_MR,
+   IB_WR_READ_WITH_INV,
 };
 
 enum ib_send_flags {
@@ -676,6 +683,20 @@ struct ib_send_wr {