Hey All,
This patchset introduce a feature called indirect memory registration.
The RDMA stack has always imposed constraints on the nature of a given list
of buffers for memory registration. The scatter list of buffers must meet
the condition where all the SG elements must have a minimum block alignment
(page_shift) and the first SG element is allowed to have a first byte offset.
This can make life hard for ULPs that want to support any arbitrary scattered
lists that don't meet the above constraint. Two immediate examples are iser
and srp which can be handed with SG lists which are not "nicely" aligned while
several components in the IO stack (e.g. block, scsi) support arbitrary SG
lists. Some work loads can yield such SG lists quite commonly.
This introduces a challenge for RDMA storage protocols.
There are couple of possible (sub-optimal) solutions to handle this limitation:
- srp can use multiple memory regions to register a single SG list and send an
indirect data buffer descriptor. The down-side in this approach is that in
certain
work loads, srp may not have sufficient available resources (i.e. memory
regions)
to register a scsi SG list which will cause the dynamic queue size to shrink
and
cause unpredictable latencies. Another (minor) down-side is that an indirect
descriptor will cause the target to initiate multiple rdma reads/writes (one
for
each rkey).
- This is not possible for iser. The iser protocol mandates that only a single
stag can
be sent for a unidirectional IO. Two possible solutions are:
* Allocate a well aligned buffer list and copy the data to/from this SG list
of buffers (has the obvious down-side of not being zero-copy, and
introduces atomic
allocations in the IO path).
* Hold another pool of MRs with the minimal block alignment guaranteed from
scsi (512B)
and resort to this pool for an unaligned SG list. The down-sides here are:
- This does not cover SG-IO where there is no minimal alignment
- involves a heuristic approach for this pool size
- Impact of cache misses imposed by longer page lists registered in the
device translation tables
Indirect memory registration solves this problem by allowing the
application/ULP to pass
a list of ib_sge elements which can be byte aligned. The proposed API attempts
to follow
the well-known fast registration scheme and can be easily adopted in any
application.
Note: We ran out of capability bits in the device_cap_flags, so I modified the
field to
be a (u64). I can alternatively introduce a second device_cap_flags2 if
this has
negative effects with user-space.
See a former discussion on the RFC version of this in
http://marc.info/?l=linux-rdma&w=2&r=1&s=indirect+fast+memory+registration&q=b
I'll appreciate the community's code review.
Adir Lev (1):
IB/iser: Add indirect registration support
Sagi Grimberg (4):
IB/core: Introduce Fast Indirect Memory Registration verbs API
IB/mlx5: Implement Fast Indirect Memory Registration Feature
IB/iser: Pass iser device to registration routines
IB/iser: Add debug prints to the various memory registration methods
drivers/infiniband/core/verbs.c | 28 +++++++
drivers/infiniband/hw/mlx5/cq.c | 2 +
drivers/infiniband/hw/mlx5/main.c | 4 +
drivers/infiniband/hw/mlx5/mlx5_ib.h | 19 +++++
drivers/infiniband/hw/mlx5/mr.c | 66 +++++++++++++++++
drivers/infiniband/hw/mlx5/qp.c | 106 +++++++++++++++++++++++++++
drivers/infiniband/ulp/iser/iscsi_iser.h | 8 ++
drivers/infiniband/ulp/iser/iser_memory.c | 112 +++++++++++++++++++++++++++--
drivers/infiniband/ulp/iser/iser_verbs.c | 53 ++++++++++++--
include/rdma/ib_verbs.h | 52 +++++++++++++-
10 files changed, 434 insertions(+), 16 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html