Re: [dm-devel] [PATCH v16 03/12] block: add copy offload support

2023-09-23 Thread Nitesh Shetty
On Fri, Sep 22, 2023 at 06:56:50PM +0900, Jinyoung Choi wrote:
> > +/*
> > + * This must only be called once all bios have been issued so that the 
> > refcount
> > + * can only decrease. This just waits for all bios to complete.
> > + * Returns the length of bytes copied or error
> > + */
> > +static ssize_t blkdev_copy_wait_io_completion(struct blkdev_copy_io *cio)
> 
> Hi, Nitesh,
> 
> don't functions waiting for completion usually set their names to 
> 'wait_for_completion_'?
> (e.g. blkdev_copy_wait_for_completion_io)
> 
> 
> > +ssize_t blkdev_copy_offload(struct block_device *bdev, loff_t pos_in,
> > +                            loff_t pos_out, size_t len,
> > +                            void (*endio)(void *, int, ssize_t),
> > +                            void *private, gfp_t gfp)
> > +{
> > +        struct blkdev_copy_io *cio;
> > +        struct blkdev_copy_offload_io *offload_io;
> > +        struct bio *src_bio, *dst_bio;
> > +        ssize_t rem, chunk, ret;
> > +        ssize_t max_copy_bytes = bdev_max_copy_sectors(bdev) << 
> > SECTOR_SHIFT;
> 
> wouldn't it be better to use size_t for variables that don't return?
> values such as chunk and max_copy_bytes may be defined as 'unsigned'.

Agree, we will keep ret as ssize_t and move others to size_t.
Acked for all other comments, will address them in next version.

Thank You,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH v16 00/12] Implement copy offload support

2023-09-20 Thread Nitesh Shetty
The patch series covers the points discussed in past and most recently
in LSFMM'23[0].
We have covered the initial agreed requirements in this patch set and
further additional features suggested by community.

This is next iteration of our previous patch set v15[1].
Copy offload is performed using two bio's -
1. Take a plug
2. The first bio containing source info is prepared and sent,
a request is formed.
3. This is followed by preparing and sending the second bio containing the
destination info.
4. This bio is merged with the request containing the source info.
5. The plug is released, and the request containing source and destination
bio's is sent to the driver.
This design helps to avoid putting payload (token) in the request,
as sending payload that is not data to the device is considered a bad
approach.

So copy offload works only for request based storage drivers.
We can make use of copy emulation in case copy offload capability is
absent.

Overall series supports:

1. Driver
- NVMe Copy command (single NS, TP 4065), including support
in nvme-target (for block and file back end).

2. Block layer
- Block-generic copy (REQ_OP_COPY_DST/SRC), operation with
  interface accommodating two block-devs
- Merging copy requests in request layer
- Emulation, for in-kernel user when offload is natively 
absent
- dm-linear support (for cases not requiring split)

3. User-interface
- copy_file_range

Testing
===
Copy offload can be tested on:
a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
parameters mssrl,mcl, msrc. For more info [2].
b. Null block device
c. NVMe Fabrics loopback.
d. blktests[3]

Emulation can be tested on any device.

fio[4].

Infra and plumbing:
===
We populate copy_file_range callback in def_blk_fops. 
For devices that support copy-offload, use blkdev_copy_offload to
achieve in-device copy.
However for cases, where device doesn't support offload,
use generic_copy_file_range.
For in-kernel users (like NVMe fabrics), use blkdev_copy_offload
if device is copy offload capable or else use emulation 
using blkdev_copy_emulation.
Modify checks in generic_copy_file_range to support block-device.

Blktests[3]
==
tests/block/035-040: Runs copy offload and emulation on null
  block device.
tests/block/050,055: Runs copy offload and emulation on test
  nvme block device.
tests/nvme/056-067: Create a loop backed fabrics device and
  run copy offload and emulation.

Future Work
===
- loopback device copy offload support
- upstream fio to use copy offload
- upstream blktest to test copy offload
- update man pages for copy_file_range
- expand in-kernel users of copy offload

These are to be taken up after this minimal series is agreed upon.

Additional links:
=
[0] 
https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zt14mmpgmivczugb33c60...@mail.gmail.com/

https://lore.kernel.org/linux-nvme/f0e19ae4-b37a-e9a3-2be7-a5afb334a...@nvidia.com/

https://lore.kernel.org/linux-nvme/20230113094648.15614-1-nj.she...@samsung.com/
[1] 
https://lore.kernel.org/all/20230906163844.18754-1-nj.she...@samsung.com/
[2] 
https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v15
[4] https://github.com/OpenMPDK/fio/tree/copyoffload-3.35-v14

Thanks for review!

Changes since v15:
=
- fs, nvmet: don't fallback to copy emulation for copy offload IO
failure, user can still use emulation by disabling
device offload (Hannes)
- block: patch,function description changes (Hannes)
- added Reviewed-by from Hannes and Luis.

Changes since v14:
=
- block: (Bart Van Assche)
1. BLK_ prefix addition to COPY_MAX_BYES and COPY_MAX_SEGMENTS
2. Improved function,patch,cover-letter description
3. Simplified refcount updating.
- null-blk, nvme:
4. static warning fixes (kernel test robot)

Changes since v13:
=
- block:
1. Simplified copy offload and emulation helpers, now
  caller needs to decide between offload/emulation fallback
2. src,dst bio order change (Christoph Hellwig)
3. refcount changes similar to dio (Christoph Hellwig)
4. Single outstanding IO for copy

[dm-devel] [PATCH v16 08/12] nvmet: add copy command support for bdev and file ns

2023-09-20 Thread Nitesh Shetty
Add support for handling nvme_cmd_copy command on target.

For bdev-ns if backing device supports copy offload we call device copy
offload (blkdev_copy_offload).
In case of absence of device copy offload capability, we use copy emulation
(blkdev_copy_emulation)

For file-ns we call vfs_copy_file_range to service our request.

Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.

loop target has copy support, which can be used to test copy offload.
trace event support for nvme_cmd_copy.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/target/admin-cmd.c   |  9 +++-
 drivers/nvme/target/io-cmd-bdev.c | 71 +++
 drivers/nvme/target/io-cmd-file.c | 50 ++
 drivers/nvme/target/nvmet.h   |  1 +
 drivers/nvme/target/trace.c   | 19 +
 5 files changed, 148 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 39cb570f833d..4e1a6ca09937 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -433,8 +433,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req 
*req)
id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-   NVME_CTRL_ONCS_WRITE_ZEROES);
-
+   NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
/* XXX: don't report vwc if the underlying device is write through */
id->vwc = NVME_CTRL_VWC_PRESENT;
 
@@ -536,6 +535,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req 
*req)
 
if (req->ns->bdev)
nvmet_bdev_set_limits(req->ns->bdev, id);
+   else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
+   (PAGE_SHIFT - SECTOR_SHIFT));
+   id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
+   }
 
/*
 * We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index 468833675cc9..2d5cef6788be 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,18 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct 
nvme_id_ns *id)
id->npda = id->npdg;
/* NOWS = Namespace Optimal Write Size */
id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+   if (bdev_max_copy_sectors(bdev)) {
+   id->msrc = id->msrc;
+   id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
+   SECTOR_SHIFT) / bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   } else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
+   bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   }
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -449,6 +461,61 @@ static void nvmet_bdev_execute_write_zeroes(struct 
nvmet_req *req)
}
 }
 
+static void nvmet_bdev_copy_endio(void *private, int status,
+   ssize_t copied)
+{
+   struct nvmet_req *rq = (struct nvmet_req *)private;
+   u16 nvme_status;
+
+   if (copied == rq->copy_len)
+   rq->cqe->result.u32 = cpu_to_le32(1);
+   else
+   rq->cqe->result.u32 = cpu_to_le32(0);
+
+   nvme_status = errno_to_nvme_status(rq, status);
+   nvmet_req_complete(rq, nvme_status);
+}
+
+/*
+ * At present we handle only one range entry, since copy offload is aligned 
with
+ * copy_file_range, only one entry is passed from block layer.
+ */
+static void nvmet_bdev_execute_copy(struct nvmet_req *rq)
+{
+   struct nvme_copy_range range;
+   struct nvme_command *cmd = rq->cmd;
+   ssize_t ret;
+   off_t dst, src;
+
+   u16 status;
+
+   status = nvmet_copy_from_sgl(rq, 0, , sizeof(range));
+   if (status)
+   goto err_rq_complete;
+
+   dst = le64_to_cpu(cmd->copy.sdlba) << rq->ns->blksize_shift;
+   src = le64_to_cpu(range.slba) << rq->ns->blksize_shift;
+   rq->copy_len = (range.nlb + 1) << rq->ns->blksize_shift;
+
+   if (bdev_max_copy_sectors(rq->ns->bdev)) {
+   ret = blkdev_copy_offload(rq->ns->bdev, dst, src, rq->copy_len,
+ nvmet_bdev_copy_endio,
+ (void *)rq, GFP

[dm-devel] [PATCH v16 12/12] null_blk: add support for copy offload

2023-09-20 Thread Nitesh Shetty
Implementation is based on existing read and write infrastructure.
copy_max_bytes: A new configfs and module parameter is introduced, which
can be used to set hardware/driver supported maximum copy limit.
Only request based queue mode will support for copy offload.
Added tracefs support to copy IO tracing.

Reviewed-by: Hannes Reinecke 
Suggested-by: Damien Le Moal 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
---
 Documentation/block/null_blk.rst  |  5 ++
 drivers/block/null_blk/main.c | 97 ++-
 drivers/block/null_blk/null_blk.h |  1 +
 drivers/block/null_blk/trace.h| 23 
 4 files changed, 123 insertions(+), 3 deletions(-)

diff --git a/Documentation/block/null_blk.rst b/Documentation/block/null_blk.rst
index 4dd78f24d10a..6153e02fcf13 100644
--- a/Documentation/block/null_blk.rst
+++ b/Documentation/block/null_blk.rst
@@ -149,3 +149,8 @@ zone_size=[MB]: Default: 256
 zone_nr_conv=[nr_conv]: Default: 0
   The number of conventional zones to create when block device is zoned.  If
   zone_nr_conv >= nr_zones, it will be reduced to nr_zones - 1.
+
+copy_max_bytes=[size in bytes]: Default: COPY_MAX_BYTES
+  A module and configfs parameter which can be used to set hardware/driver
+  supported maximum copy offload limit.
+  COPY_MAX_BYTES(=128MB at present) is defined in fs.h
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index c56bef0edc5e..22361f4d5f71 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -160,6 +160,10 @@ static int g_max_sectors;
 module_param_named(max_sectors, g_max_sectors, int, 0444);
 MODULE_PARM_DESC(max_sectors, "Maximum size of a command (in 512B sectors)");
 
+static unsigned long g_copy_max_bytes = BLK_COPY_MAX_BYTES;
+module_param_named(copy_max_bytes, g_copy_max_bytes, ulong, 0444);
+MODULE_PARM_DESC(copy_max_bytes, "Maximum size of a copy command (in bytes)");
+
 static unsigned int nr_devices = 1;
 module_param(nr_devices, uint, 0444);
 MODULE_PARM_DESC(nr_devices, "Number of devices to register");
@@ -412,6 +416,7 @@ NULLB_DEVICE_ATTR(home_node, uint, NULL);
 NULLB_DEVICE_ATTR(queue_mode, uint, NULL);
 NULLB_DEVICE_ATTR(blocksize, uint, NULL);
 NULLB_DEVICE_ATTR(max_sectors, uint, NULL);
+NULLB_DEVICE_ATTR(copy_max_bytes, uint, NULL);
 NULLB_DEVICE_ATTR(irqmode, uint, NULL);
 NULLB_DEVICE_ATTR(hw_queue_depth, uint, NULL);
 NULLB_DEVICE_ATTR(index, uint, NULL);
@@ -553,6 +558,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
_device_attr_queue_mode,
_device_attr_blocksize,
_device_attr_max_sectors,
+   _device_attr_copy_max_bytes,
_device_attr_irqmode,
_device_attr_hw_queue_depth,
_device_attr_index,
@@ -659,7 +665,8 @@ static ssize_t memb_group_features_show(struct config_item 
*item, char *page)
"poll_queues,power,queue_mode,shared_tag_bitmap,size,"
"submit_queues,use_per_node_hctx,virt_boundary,zoned,"
"zone_capacity,zone_max_active,zone_max_open,"
-   "zone_nr_conv,zone_offline,zone_readonly,zone_size\n");
+   "zone_nr_conv,zone_offline,zone_readonly,zone_size,"
+   "copy_max_bytes\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -725,6 +732,7 @@ static struct nullb_device *null_alloc_dev(void)
dev->queue_mode = g_queue_mode;
dev->blocksize = g_bs;
dev->max_sectors = g_max_sectors;
+   dev->copy_max_bytes = g_copy_max_bytes;
dev->irqmode = g_irqmode;
dev->hw_queue_depth = g_hw_queue_depth;
dev->blocking = g_blocking;
@@ -1274,6 +1282,81 @@ static int null_transfer(struct nullb *nullb, struct 
page *page,
return err;
 }
 
+static inline int nullb_setup_copy(struct nullb *nullb, struct request *req,
+  bool is_fua)
+{
+   sector_t sector_in = 0, sector_out = 0;
+   loff_t offset_in, offset_out;
+   void *in, *out;
+   ssize_t chunk, rem = 0;
+   struct bio *bio;
+   struct nullb_page *t_page_in, *t_page_out;
+   u16 seg = 1;
+   int status = -EIO;
+
+   if (blk_rq_nr_phys_segments(req) != BLK_COPY_MAX_SEGMENTS)
+   return status;
+
+   /*
+* First bio contains information about source and last bio contains
+* information about destination.
+*/
+   __rq_for_each_bio(bio, req) {
+   if (seg == blk_rq_nr_phys_segments(req)) {
+   sector_out = bio->bi_iter.bi_sector;
+   if (rem != bio->bi_iter.bi_size)
+   return status;
+   } else {
+   sector_in = bio->bi_iter.bi_sector;
+   rem = bio->bi_

[dm-devel] [PATCH v16 02/12] Add infrastructure for copy offload in block and request layer.

2023-09-20 Thread Nitesh Shetty
We add two new opcode REQ_OP_COPY_SRC, REQ_OP_COPY_DST.
Since copy is a composite operation involving src and dst sectors/lba,
each needs to be represented by a separate bio to make it compatible
with device mapper.
We expect caller to take a plug and send bio with source information,
followed by bio with destination information.
Once the src bio arrives we form a request and wait for destination
bio. Upon arrival of destination we merge these two bio's and send
corresponding request down to device driver.
Merging non copy offload bio is avoided by checking for copy specific
opcodes in merge function.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 block/blk-core.c  |  7 +++
 block/blk-merge.c | 41 +++
 block/blk.h   | 16 +++
 block/elevator.h  |  1 +
 include/linux/bio.h   |  6 +-
 include/linux/blk_types.h | 10 ++
 6 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9d51e9894ece..33aadafdb7f9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -121,6 +121,8 @@ static const char *const blk_op_name[] = {
REQ_OP_NAME(ZONE_FINISH),
REQ_OP_NAME(ZONE_APPEND),
REQ_OP_NAME(WRITE_ZEROES),
+   REQ_OP_NAME(COPY_SRC),
+   REQ_OP_NAME(COPY_DST),
REQ_OP_NAME(DRV_IN),
REQ_OP_NAME(DRV_OUT),
 };
@@ -792,6 +794,11 @@ void submit_bio_noacct(struct bio *bio)
if (!q->limits.max_write_zeroes_sectors)
goto not_supported;
break;
+   case REQ_OP_COPY_SRC:
+   case REQ_OP_COPY_DST:
+   if (!q->limits.max_copy_sectors)
+   goto not_supported;
+   break;
default:
break;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 65e75efa9bd3..bcb55ba48107 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -158,6 +158,20 @@ static struct bio *bio_split_write_zeroes(struct bio *bio,
return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *bio_split_copy(struct bio *bio,
+ const struct queue_limits *lim,
+ unsigned int *nsegs)
+{
+   *nsegs = 1;
+   if (bio_sectors(bio) <= lim->max_copy_sectors)
+   return NULL;
+   /*
+* We don't support splitting for a copy bio. End it with EIO if
+* splitting is required and return an error pointer.
+*/
+   return ERR_PTR(-EIO);
+}
+
 /*
  * Return the maximum number of sectors from the start of a bio that may be
  * submitted as a single request to a block device. If enough sectors remain,
@@ -366,6 +380,12 @@ struct bio *__bio_split_to_limits(struct bio *bio,
case REQ_OP_WRITE_ZEROES:
split = bio_split_write_zeroes(bio, lim, nr_segs, bs);
break;
+   case REQ_OP_COPY_SRC:
+   case REQ_OP_COPY_DST:
+   split = bio_split_copy(bio, lim, nr_segs);
+   if (IS_ERR(split))
+   return NULL;
+   break;
default:
split = bio_split_rw(bio, lim, nr_segs, bs,
get_max_io_size(bio, lim) << SECTOR_SHIFT);
@@ -922,6 +942,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
if (!rq_mergeable(rq) || !bio_mergeable(bio))
return false;
 
+   if (blk_copy_offload_mergable(rq, bio))
+   return true;
+
if (req_op(rq) != bio_op(bio))
return false;
 
@@ -951,6 +974,8 @@ enum elv_merge blk_try_merge(struct request *rq, struct bio 
*bio)
 {
if (blk_discard_mergable(rq))
return ELEVATOR_DISCARD_MERGE;
+   else if (blk_copy_offload_mergable(rq, bio))
+   return ELEVATOR_COPY_OFFLOAD_MERGE;
else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
return ELEVATOR_BACK_MERGE;
else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
@@ -1053,6 +1078,20 @@ static enum bio_merge_status 
bio_attempt_discard_merge(struct request_queue *q,
return BIO_MERGE_FAILED;
 }
 
+static enum bio_merge_status bio_attempt_copy_offload_merge(struct request 
*req,
+   struct bio *bio)
+{
+   if (req->__data_len != bio->bi_iter.bi_size)
+   return BIO_MERGE_FAILED;
+
+   req->biotail->bi_next = bio;
+   req->biotail = bio;
+   req->nr_phys_segments++;
+   req->__data_len += bio->bi_iter.bi_size;
+
+   return BIO_MERGE_OK;
+}
+
 static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
   struct request *rq,
   struct bio *bio,

[dm-devel] [PATCH v16 04/12] block: add emulation for copy

2023-09-20 Thread Nitesh Shetty
For the devices which does not support copy, copy emulation is added.
It is required for in-kernel users like fabrics, where file descriptor is
not available and hence they can't use copy_file_range.
Copy-emulation is implemented by reading from source into memory and
writing to the corresponding destination.
At present in kernel user of emulation is fabrics.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
Signed-off-by: Anuj Gupta 
---
 block/blk-lib.c| 223 +
 include/linux/blkdev.h |   4 +
 2 files changed, 227 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 50d10fa3c4c5..da3594d25a3f 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -26,6 +26,20 @@ struct blkdev_copy_offload_io {
loff_t offset;
 };
 
+/* Keeps track of single outstanding copy emulation IO */
+struct blkdev_copy_emulation_io {
+   struct blkdev_copy_io *cio;
+   struct work_struct emulation_work;
+   void *buf;
+   ssize_t buf_len;
+   loff_t pos_in;
+   loff_t pos_out;
+   ssize_t len;
+   struct block_device *bdev_in;
+   struct block_device *bdev_out;
+   gfp_t gfp;
+};
+
 static sector_t bio_discard_limit(struct block_device *bdev, sector_t sector)
 {
unsigned int discard_granularity = bdev_discard_granularity(bdev);
@@ -316,6 +330,215 @@ ssize_t blkdev_copy_offload(struct block_device *bdev, 
loff_t pos_in,
 }
 EXPORT_SYMBOL_GPL(blkdev_copy_offload);
 
+static void *blkdev_copy_alloc_buf(ssize_t req_size, ssize_t *alloc_size,
+  gfp_t gfp)
+{
+   int min_size = PAGE_SIZE;
+   char *buf;
+
+   while (req_size >= min_size) {
+   buf = kvmalloc(req_size, gfp);
+   if (buf) {
+   *alloc_size = req_size;
+   return buf;
+   }
+   req_size >>= 1;
+   }
+
+   return NULL;
+}
+
+static struct bio *bio_map_buf(void *data, unsigned int len, gfp_t gfp)
+{
+   unsigned long kaddr = (unsigned long)data;
+   unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+   unsigned long start = kaddr >> PAGE_SHIFT;
+   const int nr_pages = end - start;
+   bool is_vmalloc = is_vmalloc_addr(data);
+   struct page *page;
+   int offset, i;
+   struct bio *bio;
+
+   bio = bio_kmalloc(nr_pages, gfp);
+   if (!bio)
+   return ERR_PTR(-ENOMEM);
+   bio_init(bio, NULL, bio->bi_inline_vecs, nr_pages, 0);
+
+   if (is_vmalloc) {
+   flush_kernel_vmap_range(data, len);
+   bio->bi_private = data;
+   }
+
+   offset = offset_in_page(kaddr);
+   for (i = 0; i < nr_pages; i++) {
+   unsigned int bytes = PAGE_SIZE - offset;
+
+   if (len <= 0)
+   break;
+
+   if (bytes > len)
+   bytes = len;
+
+   if (!is_vmalloc)
+   page = virt_to_page(data);
+   else
+   page = vmalloc_to_page(data);
+   if (bio_add_page(bio, page, bytes, offset) < bytes) {
+   /* we don't support partial mappings */
+   bio_uninit(bio);
+   kfree(bio);
+   return ERR_PTR(-EINVAL);
+   }
+
+   data += bytes;
+   len -= bytes;
+   offset = 0;
+   }
+
+   return bio;
+}
+
+static void blkdev_copy_emulation_work(struct work_struct *work)
+{
+   struct blkdev_copy_emulation_io *emulation_io = container_of(work,
+   struct blkdev_copy_emulation_io, emulation_work);
+   struct blkdev_copy_io *cio = emulation_io->cio;
+   struct bio *read_bio, *write_bio;
+   loff_t pos_in = emulation_io->pos_in, pos_out = emulation_io->pos_out;
+   ssize_t rem, chunk;
+   int ret = 0;
+
+   for (rem = emulation_io->len; rem > 0; rem -= chunk) {
+   chunk = min_t(int, emulation_io->buf_len, rem);
+
+   read_bio = bio_map_buf(emulation_io->buf,
+  emulation_io->buf_len,
+  emulation_io->gfp);
+   if (IS_ERR(read_bio)) {
+   ret = PTR_ERR(read_bio);
+   break;
+   }
+   read_bio->bi_opf = REQ_OP_READ | REQ_SYNC;
+   bio_set_dev(read_bio, emulation_io->bdev_in);
+   read_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
+   read_bio->bi_iter.bi_size = chunk;
+   ret = submit_bio_wait(read_bio);
+   kfree(read_bio);
+   if (ret)
+   break;
+
+   write_bio = bio_map_buf(emulation_io->buf,
+   emulation_io->buf_len,
+ 

[dm-devel] [PATCH v16 09/12] dm: Add support for copy offload

2023-09-20 Thread Nitesh Shetty
Before enabling copy for dm target, check if underlying devices and
dm target support copy. Avoid split happening inside dm target.
Fail early if the request needs split, currently splitting copy
request is not supported.

Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-table.c | 37 +++
 drivers/md/dm.c   |  7 +++
 include/linux/device-mapper.h |  3 +++
 3 files changed, 47 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 37b48f63ae6a..8803c351624c 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1878,6 +1878,38 @@ static bool dm_table_supports_nowait(struct dm_table *t)
return true;
 }
 
+static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
+  sector_t start, sector_t len, void *data)
+{
+   struct request_queue *q = bdev_get_queue(dev->bdev);
+
+   return !q->limits.max_copy_sectors;
+}
+
+static bool dm_table_supports_copy(struct dm_table *t)
+{
+   struct dm_target *ti;
+   unsigned int i;
+
+   for (i = 0; i < t->num_targets; i++) {
+   ti = dm_table_get_target(t, i);
+
+   if (!ti->copy_offload_supported)
+   return false;
+
+   /*
+* target provides copy support (as implied by setting
+* 'copy_offload_supported')
+* and it relies on _all_ data devices having copy support.
+*/
+   if (!ti->type->iterate_devices ||
+   ti->type->iterate_devices(ti, device_not_copy_capable, 
NULL))
+   return false;
+   }
+
+   return true;
+}
+
 static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
  sector_t start, sector_t len, void *data)
 {
@@ -1960,6 +1992,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct 
request_queue *q,
q->limits.discard_misaligned = 0;
}
 
+   if (!dm_table_supports_copy(t)) {
+   q->limits.max_copy_sectors = 0;
+   q->limits.max_copy_hw_sectors = 0;
+   }
+
if (!dm_table_supports_secure_erase(t))
q->limits.max_secure_erase_sectors = 0;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 64a1f306c96c..eca336487d44 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1714,6 +1714,13 @@ static blk_status_t __split_and_process_bio(struct 
clone_info *ci)
if (unlikely(ci->is_abnormal_io))
return __process_abnormal_io(ci, ti);
 
+   if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
+   max_io_len(ti, ci->sector) < ci->sector_count)) {
+   DMERR("Error, IO size(%u) > max target size(%llu)\n",
+ ci->sector_count, max_io_len(ti, ci->sector));
+   return BLK_STS_IOERR;
+   }
+
/*
 * Only support bio polling for normal IO, and the target io is
 * exactly inside the dm_io instance (verified in dm_poll_dm_io)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 69d0435c7ebb..98db52d1c773 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -396,6 +396,9 @@ struct dm_target {
 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
 */
bool needs_bio_set_dev:1;
+
+   /* copy offload is supported */
+   bool copy_offload_supported:1;
 };
 
 void *dm_per_bio_data(struct bio *bio, size_t data_size);
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v16 05/12] fs/read_write: Enable copy_file_range for block device.

2023-09-20 Thread Nitesh Shetty
From: Anuj Gupta 

This is a prep patch. Allow copy_file_range to work for block devices.
Relaxing generic_copy_file_checks allows us to reuse the existing infra,
instead of adding a new user interface for block copy offload.
Change generic_copy_file_checks to use ->f_mapping->host for both inode_in
and inode_out. Allow block device in generic_file_rw_checks.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 fs/read_write.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 4771701c896b..f0f52bf48f57 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1405,8 +1405,8 @@ static int generic_copy_file_checks(struct file *file_in, 
loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t *req_count, unsigned int flags)
 {
-   struct inode *inode_in = file_inode(file_in);
-   struct inode *inode_out = file_inode(file_out);
+   struct inode *inode_in = file_in->f_mapping->host;
+   struct inode *inode_out = file_out->f_mapping->host;
uint64_t count = *req_count;
loff_t size_in;
int ret;
@@ -1708,7 +1708,9 @@ int generic_file_rw_checks(struct file *file_in, struct 
file *file_out)
/* Don't copy dirs, pipes, sockets... */
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
return -EISDIR;
-   if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+   if (!S_ISREG(inode_in->i_mode) && !S_ISBLK(inode_in->i_mode))
+   return -EINVAL;
+   if ((inode_in->i_mode & S_IFMT) != (inode_out->i_mode & S_IFMT))
return -EINVAL;
 
if (!(file_in->f_mode & FMODE_READ) ||
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v16 10/12] dm: Enable copy offload for dm-linear target

2023-09-20 Thread Nitesh Shetty
Setting copy_offload_supported flag to enable offload.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-linear.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index f4448d520ee9..1d1ee30bbefb 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -62,6 +62,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
ti->num_discard_bios = 1;
ti->num_secure_erase_bios = 1;
ti->num_write_zeroes_bios = 1;
+   ti->copy_offload_supported = 1;
ti->private = lc;
return 0;
 
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v16 06/12] fs, block: copy_file_range for def_blk_ops for direct block device

2023-09-20 Thread Nitesh Shetty
For direct block device opened with O_DIRECT, use copy_file_range to
issue device copy offload, or use generic_copy_file_range in case
device copy offload capability is absent or the device files are not open
with O_DIRECT.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 block/fops.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/block/fops.c b/block/fops.c
index acff3d5d22d4..6aa537c0e24f 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -735,6 +735,30 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
return ret;
 }
 
+static ssize_t blkdev_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
+{
+   struct block_device *in_bdev = I_BDEV(bdev_file_inode(file_in));
+   struct block_device *out_bdev = I_BDEV(bdev_file_inode(file_out));
+   ssize_t copied = 0;
+
+   if ((in_bdev == out_bdev) && bdev_max_copy_sectors(in_bdev) &&
+   (file_in->f_iocb_flags & IOCB_DIRECT) &&
+   (file_out->f_iocb_flags & IOCB_DIRECT)) {
+   copied = blkdev_copy_offload(in_bdev, pos_in, pos_out, len,
+NULL, NULL, GFP_KERNEL);
+   if (copied < 0)
+   copied = 0;
+   } else {
+   copied = generic_copy_file_range(file_in, pos_in + copied,
+file_out, pos_out + copied,
+len - copied, flags);
+   }
+
+   return copied;
+}
+
 #defineBLKDEV_FALLOC_FL_SUPPORTED  
\
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
@@ -828,6 +852,7 @@ const struct file_operations def_blk_fops = {
.splice_read= filemap_splice_read,
.splice_write   = iter_file_splice_write,
.fallocate  = blkdev_fallocate,
+   .copy_file_range = blkdev_copy_file_range,
 };
 
 static __init int blkdev_init(void)
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v16 03/12] block: add copy offload support

2023-09-20 Thread Nitesh Shetty
Introduce blkdev_copy_offload to perform copy offload.
Issue REQ_OP_COPY_SRC with source info along with taking a plug.
This flows till request layer and waits for dst bio to arrive.
Issue REQ_OP_COPY_DST with destination info and this bio reaches request
layer and merges with src request.
For any reason, if a request comes to the driver with only one of src/dst
bio, we fail the copy offload.

Larger copy will be divided, based on max_copy_sectors limit.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 block/blk-lib.c| 201 +
 include/linux/blkdev.h |   4 +
 2 files changed, 205 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..50d10fa3c4c5 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -10,6 +10,22 @@
 
 #include "blk.h"
 
+/* Keeps track of all outstanding copy IO */
+struct blkdev_copy_io {
+   atomic_t refcount;
+   ssize_t copied;
+   int status;
+   struct task_struct *waiter;
+   void (*endio)(void *private, int status, ssize_t copied);
+   void *private;
+};
+
+/* Keeps track of single outstanding copy offload IO */
+struct blkdev_copy_offload_io {
+   struct blkdev_copy_io *cio;
+   loff_t offset;
+};
+
 static sector_t bio_discard_limit(struct block_device *bdev, sector_t sector)
 {
unsigned int discard_granularity = bdev_discard_granularity(bdev);
@@ -115,6 +131,191 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+static inline ssize_t blkdev_copy_sanity_check(struct block_device *bdev_in,
+  loff_t pos_in,
+  struct block_device *bdev_out,
+  loff_t pos_out, size_t len)
+{
+   unsigned int align = max(bdev_logical_block_size(bdev_out),
+bdev_logical_block_size(bdev_in)) - 1;
+
+   if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
+   len >= BLK_COPY_MAX_BYTES)
+   return -EINVAL;
+
+   return 0;
+}
+
+static inline void blkdev_copy_endio(struct blkdev_copy_io *cio)
+{
+   if (cio->endio) {
+   cio->endio(cio->private, cio->status, cio->copied);
+   kfree(cio);
+   } else {
+   struct task_struct *waiter = cio->waiter;
+
+   WRITE_ONCE(cio->waiter, NULL);
+   blk_wake_io_task(waiter);
+   }
+}
+
+/*
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to complete.
+ * Returns the length of bytes copied or error
+ */
+static ssize_t blkdev_copy_wait_io_completion(struct blkdev_copy_io *cio)
+{
+   ssize_t ret;
+
+   for (;;) {
+   __set_current_state(TASK_UNINTERRUPTIBLE);
+   if (!READ_ONCE(cio->waiter))
+   break;
+   blk_io_schedule();
+   }
+   __set_current_state(TASK_RUNNING);
+   ret = cio->copied;
+   kfree(cio);
+
+   return ret;
+}
+
+static void blkdev_copy_offload_dst_endio(struct bio *bio)
+{
+   struct blkdev_copy_offload_io *offload_io = bio->bi_private;
+   struct blkdev_copy_io *cio = offload_io->cio;
+
+   if (bio->bi_status) {
+   cio->copied = min_t(ssize_t, offload_io->offset, cio->copied);
+   if (!cio->status)
+   cio->status = blk_status_to_errno(bio->bi_status);
+   }
+   bio_put(bio);
+
+   if (atomic_dec_and_test(>refcount))
+   blkdev_copy_endio(cio);
+}
+
+/*
+ * @bdev:  block device
+ * @pos_in:source offset
+ * @pos_out:   destination offset
+ * @len:   length in bytes to be copied
+ * @endio: endio function to be called on completion of copy operation,
+ * for synchronous operation this should be NULL
+ * @private:   endio function will be called with this private data,
+ * for synchronous operation this should be NULL
+ * @gfp_mask:  memory allocation flags (for bio_alloc)
+ *
+ * For synchronous operation returns the length of bytes copied or error
+ * For asynchronous operation returns -EIOCBQUEUED or error
+ *
+ * Description:
+ * Copy source offset to destination offset within block device, using
+ * device's native copy offload feature.
+ * We perform copy operation using 2 bio's.
+ * 1. We take a plug and send a REQ_OP_COPY_SRC bio along with source
+ * sector and length. Once this bio reaches request layer, we form a
+ * request and wait for dst bio to arrive.
+ * 2. We issue REQ_OP_COPY_DST bio along with destination sector, length.
+ * Once this bio reaches request layer and find a request with previously
+ * sent source info we merge the des

[dm-devel] [PATCH v16 11/12] null: Enable trace capability for null block

2023-09-20 Thread Nitesh Shetty
This is a prep patch to enable copy trace capability.
At present only zoned null_block is using trace, so we decoupled trace
and zoned dependency to make it usable in null_blk driver also.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 drivers/block/null_blk/Makefile | 2 --
 drivers/block/null_blk/main.c   | 3 +++
 drivers/block/null_blk/trace.h  | 2 ++
 drivers/block/null_blk/zoned.c  | 1 -
 4 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/block/null_blk/Makefile b/drivers/block/null_blk/Makefile
index 84c36e512ab8..672adcf0ad24 100644
--- a/drivers/block/null_blk/Makefile
+++ b/drivers/block/null_blk/Makefile
@@ -5,7 +5,5 @@ ccflags-y   += -I$(src)
 
 obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk.o
 null_blk-objs  := main.o
-ifeq ($(CONFIG_BLK_DEV_ZONED), y)
 null_blk-$(CONFIG_TRACING) += trace.o
-endif
 null_blk-$(CONFIG_BLK_DEV_ZONED) += zoned.o
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 968090935eb2..c56bef0edc5e 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -11,6 +11,9 @@
 #include 
 #include "null_blk.h"
 
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
 #undef pr_fmt
 #define pr_fmt(fmt)"null_blk: " fmt
 
diff --git a/drivers/block/null_blk/trace.h b/drivers/block/null_blk/trace.h
index 6b2b370e786f..91446c34eac2 100644
--- a/drivers/block/null_blk/trace.h
+++ b/drivers/block/null_blk/trace.h
@@ -30,6 +30,7 @@ static inline void __assign_disk_name(char *name, struct 
gendisk *disk)
 }
 #endif
 
+#ifdef CONFIG_BLK_DEV_ZONED
 TRACE_EVENT(nullb_zone_op,
TP_PROTO(struct nullb_cmd *cmd, unsigned int zone_no,
 unsigned int zone_cond),
@@ -67,6 +68,7 @@ TRACE_EVENT(nullb_report_zones,
TP_printk("%s nr_zones=%u",
  __print_disk_name(__entry->disk), __entry->nr_zones)
 );
+#endif /* CONFIG_BLK_DEV_ZONED */
 
 #endif /* _TRACE_NULLB_H */
 
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 55c5b48bc276..9694461a31a4 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -3,7 +3,6 @@
 #include 
 #include "null_blk.h"
 
-#define CREATE_TRACE_POINTS
 #include "trace.h"
 
 #undef pr_fmt
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v16 01/12] block: Introduce queue limits and sysfs for copy-offload support

2023-09-20 Thread Nitesh Shetty
Add device limits as sysfs entries,
- copy_max_bytes (RW)
- copy_max_hw_bytes (RO)

Above limits help to split the copy payload in block layer.
copy_max_bytes: maximum total length of copy in single payload.
copy_max_hw_bytes: Reflects the device supported maximum limit.

Reviewed-by: Hannes Reinecke 
Reviewed-by: Luis Chamberlain 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Anuj Gupta 
---
 Documentation/ABI/stable/sysfs-block | 23 ++
 block/blk-settings.c | 24 +++
 block/blk-sysfs.c| 36 
 include/linux/blkdev.h   | 13 ++
 4 files changed, 96 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block 
b/Documentation/ABI/stable/sysfs-block
index 1fe9a553c37b..96ba701e57da 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,29 @@ Description:
last zone of the device which may be smaller.
 
 
+What:  /sys/block//queue/copy_max_bytes
+Date:  August 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] This is the maximum number of bytes that the block layer
+   will allow for a copy request. This is always smaller or
+   equal to the maximum size allowed by the hardware, indicated by
+   'copy_max_hw_bytes'. An attempt to set a value higher than
+   'copy_max_hw_bytes' will truncate this to 'copy_max_hw_bytes'.
+   Writing '0' to this file will disable offloading copies for this
+   device, instead copy is done via emulation.
+
+
+What:  /sys/block//queue/copy_max_hw_bytes
+Date:  August 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RO] This is the maximum number of bytes that the hardware
+   will allow for single data copy request.
+   A value of 0 means that the device does not support
+   copy offload.
+
+
 What:  /sys/block//queue/crypto/
 Date:  February 2022
 Contact:   linux-bl...@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0046b447268f..4441711ac364 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
+   lim->max_copy_hw_sectors = 0;
+   lim->max_copy_sectors = 0;
 }
 
 /**
@@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_dev_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
lim->max_zone_append_sectors = UINT_MAX;
+   lim->max_copy_hw_sectors = UINT_MAX;
+   lim->max_copy_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/*
+ * blk_queue_max_copy_hw_sectors - set max sectors for a single copy payload
+ * @q: the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ */
+void blk_queue_max_copy_hw_sectors(struct request_queue *q,
+  unsigned int max_copy_sectors)
+{
+   if (max_copy_sectors > (BLK_COPY_MAX_BYTES >> SECTOR_SHIFT))
+   max_copy_sectors = BLK_COPY_MAX_BYTES >> SECTOR_SHIFT;
+
+   q->limits.max_copy_hw_sectors = max_copy_sectors;
+   q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_hw_sectors);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -578,6 +598,10 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
   b->max_segment_size);
 
+   t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+   t->max_copy_hw_sectors = min(t->max_copy_hw_sectors,
+b->max_copy_hw_sectors);
+
t->misaligned |= b->misaligned;
 
alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 63e481262336..4840e21adefa 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -199,6 +199,37 @@ static ssize_t queue_discard_zeroes_data_show(struct 
request_queue *q, char *pag
return queue_var_show(0, page);
 }
 
+static ssize_t queue_copy_hw_max_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, "%llu\n", (unsigned long long)
+  q->limits.max_copy_hw_sectors << SECTOR_SHIFT);
+}
+
+static ssize_t queue

[dm-devel] [PATCH v16 07/12] nvme: add copy offload support

2023-09-20 Thread Nitesh Shetty
Current design only supports single source range.
We receive a request with REQ_OP_COPY_SRC.
Parse this request which consists of src(1st) and dst(2nd) bios.
Form a copy command (TP 4065)

trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier González 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/host/constants.c |  1 +
 drivers/nvme/host/core.c  | 79 +++
 drivers/nvme/host/trace.c | 19 +
 include/linux/blkdev.h|  1 +
 include/linux/nvme.h  | 43 +--
 5 files changed, 140 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index 20f46c230885..2f504a2b1fe8 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -19,6 +19,7 @@ static const char * const nvme_ops[] = {
[nvme_cmd_resv_report] = "Reservation Report",
[nvme_cmd_resv_acquire] = "Reservation Acquire",
[nvme_cmd_resv_release] = "Reservation Release",
+   [nvme_cmd_copy] = "Copy Offload",
[nvme_cmd_zone_mgmt_send] = "Zone Management Send",
[nvme_cmd_zone_mgmt_recv] = "Zone Management Receive",
[nvme_cmd_zone_append] = "Zone Append",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 21783aa2ee8e..4522c702610b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -763,6 +763,63 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
+static inline blk_status_t nvme_setup_copy_offload(struct nvme_ns *ns,
+  struct request *req,
+  struct nvme_command *cmnd)
+{
+   struct nvme_copy_range *range = NULL;
+   struct bio *bio;
+   u64 dst_lba = 0, src_lba, n_lba;
+   u16 nr_range = 1, control = 0, seg = 1;
+
+   if (blk_rq_nr_phys_segments(req) != BLK_COPY_MAX_SEGMENTS)
+   return BLK_STS_IOERR;
+
+   /*
+* First bio contains information about source and last bio contains
+* information about destination.
+*/
+   __rq_for_each_bio(bio, req) {
+   if (seg == blk_rq_nr_phys_segments(req)) {
+   dst_lba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
+   if (n_lba != bio->bi_iter.bi_size >> ns->lba_shift)
+   return BLK_STS_IOERR;
+   } else {
+   src_lba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
+   n_lba = bio->bi_iter.bi_size >> ns->lba_shift;
+   }
+   seg++;
+   }
+
+   if (req->cmd_flags & REQ_FUA)
+   control |= NVME_RW_FUA;
+
+   if (req->cmd_flags & REQ_FAILFAST_DEV)
+   control |= NVME_RW_LR;
+
+   memset(cmnd, 0, sizeof(*cmnd));
+   cmnd->copy.opcode = nvme_cmd_copy;
+   cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+   cmnd->copy.control = cpu_to_le16(control);
+   cmnd->copy.sdlba = cpu_to_le64(dst_lba);
+   cmnd->copy.nr_range = 0;
+
+   range = kmalloc_array(nr_range, sizeof(*range),
+ GFP_ATOMIC | __GFP_NOWARN);
+   if (!range)
+   return BLK_STS_RESOURCE;
+
+   range[0].slba = cpu_to_le64(src_lba);
+   range[0].nlb = cpu_to_le16(n_lba - 1);
+
+   req->special_vec.bv_page = virt_to_page(range);
+   req->special_vec.bv_offset = offset_in_page(range);
+   req->special_vec.bv_len = sizeof(*range) * nr_range;
+   req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+   return BLK_STS_OK;
+}
+
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
 {
@@ -1005,6 +1062,11 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct 
request *req)
case REQ_OP_ZONE_APPEND:
ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
break;
+   case REQ_OP_COPY_SRC:
+   ret = nvme_setup_copy_offload(ns, req, cmd);
+   break;
+   case REQ_OP_COPY_DST:
+   return BLK_STS_IOERR;
default:
WARN_ON_ONCE(1);
return BLK_STS_IOERR;
@@ -1745,6 +1807,21 @@ static void nvme_config_discard(struct gendisk *disk, 
struct nvme_ns *ns)
blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
+static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
+   struct nvme_id_ns *id)
+{
+   struct nvme_ctrl *ctrl = ns->ctrl;
+   struct request_queue *q = disk->queue;
+
+   if (!(ctr

Re: [dm-devel] [PATCH v15 04/12] block: add emulation for copy

2023-09-11 Thread Nitesh Shetty

On 11/09/23 09:39AM, Hannes Reinecke wrote:

On 9/11/23 09:09, Nitesh Shetty wrote:

On Fri, Sep 08, 2023 at 08:06:38AM +0200, Hannes Reinecke wrote:

On 9/6/23 18:38, Nitesh Shetty wrote:

For the devices which does not support copy, copy emulation is added.
It is required for in-kernel users like fabrics, where file descriptor is
not available and hence they can't use copy_file_range.
Copy-emulation is implemented by reading from source into memory and
writing to the corresponding destination.
Also emulation can be used, if copy offload fails or partially completes.
At present in kernel user of emulation is NVMe fabrics.


Leave out the last sentence; I really would like to see it enabled for SCSI,
too (we do have copy offload commands for SCSI ...).


Sure, will do that


And it raises all the questions which have bogged us down right from the
start: where is the point in calling copy offload if copy offload is not
implemented or slower than copying it by hand?
And how can the caller differentiate whether copy offload bring a benefit to
him?

IOW: wouldn't it be better to return -EOPNOTSUPP if copy offload is not
available?


Present approach treats copy as a background operation and the idea is to
maximize the chances of achieving copy by falling back to emulation.
Having said that, it should be possible to return -EOPNOTSUPP,
in case of offload IO failure or device not supporting offload.
We will update this in next version.

That is also what I meant with my comments to patch 09/12: I don't see 
it as a benefit to _always_ fall back to a generic copy-offload 
emulation. After all, that hardly brings any benefit.


Agreed, we will correct this by returning error to user in case copy offload
fails, instead of falling back to block layer emulation.

We do need block layer emulation for fabrics, where we call emulation
if target doesn't support offload. In fabrics scenarios sending
offload command from host and achieve copy using block layer
emulation on target is better than sending read+write from host.

Where I do see a benefit is to tie in the generic copy-offload 
_infrastructure_ to existing mechanisms (like dm-kcopyd).
But if there is no copy-offload infrastructure available then we 
really should return -EOPNOTSUPP as it really is not supported.



Agreed, we will add this in next phase, once present series gets merged.


In the end, copy offload is not a command which 'always works'.
It's a command which _might_ deliver benefits (ie better performance) 
if dedicated implementations are available and certain parameters are 
met. If not then copy offload is not the best choice, and applications 
will need to be made aware of that.


Agreed. We will leave the choice to user, to use either block layer offload
or emulation.


Thank you,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v15 04/12] block: add emulation for copy

2023-09-11 Thread Nitesh Shetty
On Fri, Sep 08, 2023 at 08:06:38AM +0200, Hannes Reinecke wrote:
> On 9/6/23 18:38, Nitesh Shetty wrote:
> > For the devices which does not support copy, copy emulation is added.
> > It is required for in-kernel users like fabrics, where file descriptor is
> > not available and hence they can't use copy_file_range.
> > Copy-emulation is implemented by reading from source into memory and
> > writing to the corresponding destination.
> > Also emulation can be used, if copy offload fails or partially completes.
> > At present in kernel user of emulation is NVMe fabrics.
> > 
> Leave out the last sentence; I really would like to see it enabled for SCSI,
> too (we do have copy offload commands for SCSI ...).
> 
Sure, will do that

> And it raises all the questions which have bogged us down right from the
> start: where is the point in calling copy offload if copy offload is not
> implemented or slower than copying it by hand?
> And how can the caller differentiate whether copy offload bring a benefit to
> him?
> 
> IOW: wouldn't it be better to return -EOPNOTSUPP if copy offload is not
> available?

Present approach treats copy as a background operation and the idea is to
maximize the chances of achieving copy by falling back to emulation.
Having said that, it should be possible to return -EOPNOTSUPP,
in case of offload IO failure or device not supporting offload.
We will update this in next version.

Thank you,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v15 09/12] dm: Add support for copy offload

2023-09-11 Thread Nitesh Shetty
On Fri, Sep 08, 2023 at 08:13:37AM +0200, Hannes Reinecke wrote:
> On 9/6/23 18:38, Nitesh Shetty wrote:
> > Before enabling copy for dm target, check if underlying devices and
> > dm target support copy. Avoid split happening inside dm target.
> > Fail early if the request needs split, currently splitting copy
> > request is not supported.
> > 
> And here is where I would have expected the emulation to take place;
> didn't you have it in one of the earlier iterations?

No, but it was the other way round.
In dm-kcopyd we used device offload, if that was possible, before using default
dm-mapper copy. It was dropped in the current series,
to streamline the patches and make the series easier to review.

> After all, device-mapper already has the infrastructure for copying
> data between devices, so adding a copy-offload emulation for device-mapper
> should be trivial.
I did not understand this, can you please elaborate ?

Thank you,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v15 03/12] block: add copy offload support

2023-09-07 Thread Nitesh Shetty

On 07/09/23 07:49AM, Hannes Reinecke wrote:

On 9/6/23 18:38, Nitesh Shetty wrote:

Hmm. That looks a bit odd. Why do you have to use wait_for_completion?


wait_for_completion is waiting for all the copy IOs to complete,
when caller does not pass endio handler.
Copy IO submissions are still async, as in previous revisions.

Can't you submit the 'src' bio, and then submit the 'dst' bio from the 
endio handler of the 'src' bio?

We can't do this with the current bio merging approach.
'src' bio waits for the 'dst' bio to arrive in request layer.
Note that both bio's should be present in request reaching the driver,
to form the copy-cmd.

Thank You,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v15 02/12] Add infrastructure for copy offload in block and request layer.

2023-09-07 Thread Nitesh Shetty

On 07/09/23 07:39AM, Hannes Reinecke wrote:

On 9/6/23 18:38, Nitesh Shetty wrote:

We add two new opcode REQ_OP_COPY_SRC, REQ_OP_COPY_DST.
Since copy is a composite operation involving src and dst sectors/lba,
each needs to be represented by a separate bio to make it compatible
with device mapper.
We expect caller to take a plug and send bio with source information,
followed by bio with destination information.
Once the src bio arrives we form a request and wait for destination
bio. Upon arrival of destination we merge these two bio's and send
corresponding request down to device driver.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 block/blk-core.c  |  7 +++
 block/blk-merge.c | 41 +++
 block/blk.h   | 16 +++
 block/elevator.h  |  1 +
 include/linux/bio.h   |  6 +-
 include/linux/blk_types.h | 10 ++
 6 files changed, 76 insertions(+), 5 deletions(-)


Having two separate bios is okay, and what one would expect.
What is slightly strange is the merging functionality;
That could do with some more explanation why this approach was taken.


Combining the two bios is necessary to form a single copy command.
And that's what we do by putting two bios in the single request, and send
this down to the driver.
This helps to avoid putting payload (token) in the request.
This change came as a feedback, as sending payload that is not data to the
device is considered a bad idea [1].
Current approach is similar to bio merging in discard.

And also some checks in the merging code to avoid merging non-copy 
offload  bios.

blk_copy_offload_mergable takes care of this, as this checks REQ_OP_COPY_SRC
and REQ_OP_COPY_DST

[1] 
https://lore.kernel.org/lkml/20230605121732.28468-1-nj.she...@samsung.com/T/#mfa7104c5f5f8579cd20f668a9d5e83b4ac8bc58a

Thank You,
Nitesh Shetty

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH v15 10/12] dm: Enable copy offload for dm-linear target

2023-09-06 Thread Nitesh Shetty
Setting copy_offload_supported flag to enable offload.

Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-linear.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index f4448d520ee9..1d1ee30bbefb 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -62,6 +62,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
ti->num_discard_bios = 1;
ti->num_secure_erase_bios = 1;
ti->num_write_zeroes_bios = 1;
+   ti->copy_offload_supported = 1;
ti->private = lc;
return 0;
 
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v15 11/12] null: Enable trace capability for null block

2023-09-06 Thread Nitesh Shetty
This is a prep patch to enable copy trace capability.
At present only zoned null_block is using trace, so we decoupled trace
and zoned dependency to make it usable in null_blk driver also.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 drivers/block/null_blk/Makefile | 2 --
 drivers/block/null_blk/main.c   | 3 +++
 drivers/block/null_blk/trace.h  | 2 ++
 drivers/block/null_blk/zoned.c  | 1 -
 4 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/block/null_blk/Makefile b/drivers/block/null_blk/Makefile
index 84c36e512ab8..672adcf0ad24 100644
--- a/drivers/block/null_blk/Makefile
+++ b/drivers/block/null_blk/Makefile
@@ -5,7 +5,5 @@ ccflags-y   += -I$(src)
 
 obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk.o
 null_blk-objs  := main.o
-ifeq ($(CONFIG_BLK_DEV_ZONED), y)
 null_blk-$(CONFIG_TRACING) += trace.o
-endif
 null_blk-$(CONFIG_BLK_DEV_ZONED) += zoned.o
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 864013019d6b..b48901b2b573 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -11,6 +11,9 @@
 #include 
 #include "null_blk.h"
 
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
 #undef pr_fmt
 #define pr_fmt(fmt)"null_blk: " fmt
 
diff --git a/drivers/block/null_blk/trace.h b/drivers/block/null_blk/trace.h
index 6b2b370e786f..91446c34eac2 100644
--- a/drivers/block/null_blk/trace.h
+++ b/drivers/block/null_blk/trace.h
@@ -30,6 +30,7 @@ static inline void __assign_disk_name(char *name, struct 
gendisk *disk)
 }
 #endif
 
+#ifdef CONFIG_BLK_DEV_ZONED
 TRACE_EVENT(nullb_zone_op,
TP_PROTO(struct nullb_cmd *cmd, unsigned int zone_no,
 unsigned int zone_cond),
@@ -67,6 +68,7 @@ TRACE_EVENT(nullb_report_zones,
TP_printk("%s nr_zones=%u",
  __print_disk_name(__entry->disk), __entry->nr_zones)
 );
+#endif /* CONFIG_BLK_DEV_ZONED */
 
 #endif /* _TRACE_NULLB_H */
 
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 55c5b48bc276..9694461a31a4 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -3,7 +3,6 @@
 #include 
 #include "null_blk.h"
 
-#define CREATE_TRACE_POINTS
 #include "trace.h"
 
 #undef pr_fmt
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v15 12/12] null_blk: add support for copy offload

2023-09-06 Thread Nitesh Shetty
Implementation is based on existing read and write infrastructure.
copy_max_bytes: A new configfs and module parameter is introduced, which
can be used to set hardware/driver supported maximum copy limit.
Only request based queue mode will support for copy offload.
Added tracefs support to copy IO tracing.

Suggested-by: Damien Le Moal 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
---
 Documentation/block/null_blk.rst  |  5 ++
 drivers/block/null_blk/main.c | 97 ++-
 drivers/block/null_blk/null_blk.h |  1 +
 drivers/block/null_blk/trace.h| 23 
 4 files changed, 123 insertions(+), 3 deletions(-)

diff --git a/Documentation/block/null_blk.rst b/Documentation/block/null_blk.rst
index 4dd78f24d10a..6153e02fcf13 100644
--- a/Documentation/block/null_blk.rst
+++ b/Documentation/block/null_blk.rst
@@ -149,3 +149,8 @@ zone_size=[MB]: Default: 256
 zone_nr_conv=[nr_conv]: Default: 0
   The number of conventional zones to create when block device is zoned.  If
   zone_nr_conv >= nr_zones, it will be reduced to nr_zones - 1.
+
+copy_max_bytes=[size in bytes]: Default: COPY_MAX_BYTES
+  A module and configfs parameter which can be used to set hardware/driver
+  supported maximum copy offload limit.
+  COPY_MAX_BYTES(=128MB at present) is defined in fs.h
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index b48901b2b573..26124f2baadc 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -160,6 +160,10 @@ static int g_max_sectors;
 module_param_named(max_sectors, g_max_sectors, int, 0444);
 MODULE_PARM_DESC(max_sectors, "Maximum size of a command (in 512B sectors)");
 
+static unsigned long g_copy_max_bytes = BLK_COPY_MAX_BYTES;
+module_param_named(copy_max_bytes, g_copy_max_bytes, ulong, 0444);
+MODULE_PARM_DESC(copy_max_bytes, "Maximum size of a copy command (in bytes)");
+
 static unsigned int nr_devices = 1;
 module_param(nr_devices, uint, 0444);
 MODULE_PARM_DESC(nr_devices, "Number of devices to register");
@@ -412,6 +416,7 @@ NULLB_DEVICE_ATTR(home_node, uint, NULL);
 NULLB_DEVICE_ATTR(queue_mode, uint, NULL);
 NULLB_DEVICE_ATTR(blocksize, uint, NULL);
 NULLB_DEVICE_ATTR(max_sectors, uint, NULL);
+NULLB_DEVICE_ATTR(copy_max_bytes, uint, NULL);
 NULLB_DEVICE_ATTR(irqmode, uint, NULL);
 NULLB_DEVICE_ATTR(hw_queue_depth, uint, NULL);
 NULLB_DEVICE_ATTR(index, uint, NULL);
@@ -553,6 +558,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
_device_attr_queue_mode,
_device_attr_blocksize,
_device_attr_max_sectors,
+   _device_attr_copy_max_bytes,
_device_attr_irqmode,
_device_attr_hw_queue_depth,
_device_attr_index,
@@ -659,7 +665,8 @@ static ssize_t memb_group_features_show(struct config_item 
*item, char *page)
"poll_queues,power,queue_mode,shared_tag_bitmap,size,"
"submit_queues,use_per_node_hctx,virt_boundary,zoned,"
"zone_capacity,zone_max_active,zone_max_open,"
-   "zone_nr_conv,zone_offline,zone_readonly,zone_size\n");
+   "zone_nr_conv,zone_offline,zone_readonly,zone_size,"
+   "copy_max_bytes\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -725,6 +732,7 @@ static struct nullb_device *null_alloc_dev(void)
dev->queue_mode = g_queue_mode;
dev->blocksize = g_bs;
dev->max_sectors = g_max_sectors;
+   dev->copy_max_bytes = g_copy_max_bytes;
dev->irqmode = g_irqmode;
dev->hw_queue_depth = g_hw_queue_depth;
dev->blocking = g_blocking;
@@ -1274,6 +1282,81 @@ static int null_transfer(struct nullb *nullb, struct 
page *page,
return err;
 }
 
+static inline int nullb_setup_copy(struct nullb *nullb, struct request *req,
+  bool is_fua)
+{
+   sector_t sector_in = 0, sector_out = 0;
+   loff_t offset_in, offset_out;
+   void *in, *out;
+   ssize_t chunk, rem = 0;
+   struct bio *bio;
+   struct nullb_page *t_page_in, *t_page_out;
+   u16 seg = 1;
+   int status = -EIO;
+
+   if (blk_rq_nr_phys_segments(req) != BLK_COPY_MAX_SEGMENTS)
+   return status;
+
+   /*
+* First bio contains information about source and last bio contains
+* information about destination.
+*/
+   __rq_for_each_bio(bio, req) {
+   if (seg == blk_rq_nr_phys_segments(req)) {
+   sector_out = bio->bi_iter.bi_sector;
+   if (rem != bio->bi_iter.bi_size)
+   return status;
+   } else {
+   sector_in = bio->bi_iter.bi_sector;
+   rem = bio->bi_iter.bi_size;
+   }
+   seg++;

[dm-devel] [PATCH v15 09/12] dm: Add support for copy offload

2023-09-06 Thread Nitesh Shetty
Before enabling copy for dm target, check if underlying devices and
dm target support copy. Avoid split happening inside dm target.
Fail early if the request needs split, currently splitting copy
request is not supported.

Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-table.c | 37 +++
 drivers/md/dm.c   |  7 +++
 include/linux/device-mapper.h |  3 +++
 3 files changed, 47 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 7d208b2b1a19..a192c19b68e4 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1862,6 +1862,38 @@ static bool dm_table_supports_nowait(struct dm_table *t)
return true;
 }
 
+static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
+  sector_t start, sector_t len, void *data)
+{
+   struct request_queue *q = bdev_get_queue(dev->bdev);
+
+   return !q->limits.max_copy_sectors;
+}
+
+static bool dm_table_supports_copy(struct dm_table *t)
+{
+   struct dm_target *ti;
+   unsigned int i;
+
+   for (i = 0; i < t->num_targets; i++) {
+   ti = dm_table_get_target(t, i);
+
+   if (!ti->copy_offload_supported)
+   return false;
+
+   /*
+* target provides copy support (as implied by setting
+* 'copy_offload_supported')
+* and it relies on _all_ data devices having copy support.
+*/
+   if (!ti->type->iterate_devices ||
+   ti->type->iterate_devices(ti, device_not_copy_capable, 
NULL))
+   return false;
+   }
+
+   return true;
+}
+
 static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
  sector_t start, sector_t len, void *data)
 {
@@ -1944,6 +1976,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct 
request_queue *q,
q->limits.discard_misaligned = 0;
}
 
+   if (!dm_table_supports_copy(t)) {
+   q->limits.max_copy_sectors = 0;
+   q->limits.max_copy_hw_sectors = 0;
+   }
+
if (!dm_table_supports_secure_erase(t))
q->limits.max_secure_erase_sectors = 0;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f0f118ab20fa..f9d6215e6d4d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1732,6 +1732,13 @@ static blk_status_t __split_and_process_bio(struct 
clone_info *ci)
if (unlikely(ci->is_abnormal_io))
return __process_abnormal_io(ci, ti);
 
+   if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
+   max_io_len(ti, ci->sector) < ci->sector_count)) {
+   DMERR("Error, IO size(%u) > max target size(%llu)\n",
+ ci->sector_count, max_io_len(ti, ci->sector));
+   return BLK_STS_IOERR;
+   }
+
/*
 * Only support bio polling for normal IO, and the target io is
 * exactly inside the dm_io instance (verified in dm_poll_dm_io)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 69d0435c7ebb..98db52d1c773 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -396,6 +396,9 @@ struct dm_target {
 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
 */
bool needs_bio_set_dev:1;
+
+   /* copy offload is supported */
+   bool copy_offload_supported:1;
 };
 
 void *dm_per_bio_data(struct bio *bio, size_t data_size);
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v15 07/12] nvme: add copy offload support

2023-09-06 Thread Nitesh Shetty
Current design only supports single source range.
We receive a request with REQ_OP_COPY_SRC.
Parse this request which consists of src(1st) and dst(2nd) bios.
Form a copy command (TP 4065)

trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.

Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier González 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/host/constants.c |  1 +
 drivers/nvme/host/core.c  | 79 +++
 drivers/nvme/host/trace.c | 19 +
 include/linux/blkdev.h|  1 +
 include/linux/nvme.h  | 43 +--
 5 files changed, 140 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index 20f46c230885..2f504a2b1fe8 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -19,6 +19,7 @@ static const char * const nvme_ops[] = {
[nvme_cmd_resv_report] = "Reservation Report",
[nvme_cmd_resv_acquire] = "Reservation Acquire",
[nvme_cmd_resv_release] = "Reservation Release",
+   [nvme_cmd_copy] = "Copy Offload",
[nvme_cmd_zone_mgmt_send] = "Zone Management Send",
[nvme_cmd_zone_mgmt_recv] = "Zone Management Receive",
[nvme_cmd_zone_append] = "Zone Append",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f3a01b79148c..ca47af74afcc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -763,6 +763,63 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
+static inline blk_status_t nvme_setup_copy_offload(struct nvme_ns *ns,
+  struct request *req,
+  struct nvme_command *cmnd)
+{
+   struct nvme_copy_range *range = NULL;
+   struct bio *bio;
+   u64 dst_lba = 0, src_lba, n_lba;
+   u16 nr_range = 1, control = 0, seg = 1;
+
+   if (blk_rq_nr_phys_segments(req) != BLK_COPY_MAX_SEGMENTS)
+   return BLK_STS_IOERR;
+
+   /*
+* First bio contains information about source and last bio contains
+* information about destination.
+*/
+   __rq_for_each_bio(bio, req) {
+   if (seg == blk_rq_nr_phys_segments(req)) {
+   dst_lba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
+   if (n_lba != bio->bi_iter.bi_size >> ns->lba_shift)
+   return BLK_STS_IOERR;
+   } else {
+   src_lba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
+   n_lba = bio->bi_iter.bi_size >> ns->lba_shift;
+   }
+   seg++;
+   }
+
+   if (req->cmd_flags & REQ_FUA)
+   control |= NVME_RW_FUA;
+
+   if (req->cmd_flags & REQ_FAILFAST_DEV)
+   control |= NVME_RW_LR;
+
+   memset(cmnd, 0, sizeof(*cmnd));
+   cmnd->copy.opcode = nvme_cmd_copy;
+   cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+   cmnd->copy.control = cpu_to_le16(control);
+   cmnd->copy.sdlba = cpu_to_le64(dst_lba);
+   cmnd->copy.nr_range = 0;
+
+   range = kmalloc_array(nr_range, sizeof(*range),
+ GFP_ATOMIC | __GFP_NOWARN);
+   if (!range)
+   return BLK_STS_RESOURCE;
+
+   range[0].slba = cpu_to_le64(src_lba);
+   range[0].nlb = cpu_to_le16(n_lba - 1);
+
+   req->special_vec.bv_page = virt_to_page(range);
+   req->special_vec.bv_offset = offset_in_page(range);
+   req->special_vec.bv_len = sizeof(*range) * nr_range;
+   req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+   return BLK_STS_OK;
+}
+
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
 {
@@ -1005,6 +1062,11 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct 
request *req)
case REQ_OP_ZONE_APPEND:
ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
break;
+   case REQ_OP_COPY_SRC:
+   ret = nvme_setup_copy_offload(ns, req, cmd);
+   break;
+   case REQ_OP_COPY_DST:
+   return BLK_STS_IOERR;
default:
WARN_ON_ONCE(1);
return BLK_STS_IOERR;
@@ -1745,6 +1807,21 @@ static void nvme_config_discard(struct gendisk *disk, 
struct nvme_ns *ns)
blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
+static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
+   struct nvme_id_ns *id)
+{
+   struct nvme_ctrl *ctrl = ns->ctrl;
+   struct request_queue *q = disk->queue;
+
+   if (!(ctrl->oncs & NVME

[dm-devel] [PATCH v15 08/12] nvmet: add copy command support for bdev and file ns

2023-09-06 Thread Nitesh Shetty
Add support for handling nvme_cmd_copy command on target.

For bdev-ns if backing device supports copy offload we call device copy
offload (blkdev_copy_offload).
In case of partial completion from above or absence of device copy offload
capability, we fallback to copy emulation (blkdev_copy_emulation)

For file-ns we call vfs_copy_file_range to service our request.

Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.

loop target has copy support, which can be used to test copy offload.
trace event support for nvme_cmd_copy.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/target/admin-cmd.c   |  9 ++-
 drivers/nvme/target/io-cmd-bdev.c | 97 +++
 drivers/nvme/target/io-cmd-file.c | 50 
 drivers/nvme/target/nvmet.h   |  4 ++
 drivers/nvme/target/trace.c   | 19 ++
 5 files changed, 177 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 39cb570f833d..4e1a6ca09937 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -433,8 +433,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req 
*req)
id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-   NVME_CTRL_ONCS_WRITE_ZEROES);
-
+   NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
/* XXX: don't report vwc if the underlying device is write through */
id->vwc = NVME_CTRL_VWC_PRESENT;
 
@@ -536,6 +535,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req 
*req)
 
if (req->ns->bdev)
nvmet_bdev_set_limits(req->ns->bdev, id);
+   else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
+   (PAGE_SHIFT - SECTOR_SHIFT));
+   id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
+   }
 
/*
 * We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index 468833675cc9..47df9222 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,18 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct 
nvme_id_ns *id)
id->npda = id->npdg;
/* NOWS = Namespace Optimal Write Size */
id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+   if (bdev_max_copy_sectors(bdev)) {
+   id->msrc = id->msrc;
+   id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
+   SECTOR_SHIFT) / bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   } else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
+   bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   }
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -449,6 +461,87 @@ static void nvmet_bdev_execute_write_zeroes(struct 
nvmet_req *req)
}
 }
 
+static void nvmet_bdev_copy_emulation_endio(void *private, int status,
+   ssize_t copied)
+{
+   struct nvmet_req *rq = (struct nvmet_req *)private;
+   u16 nvme_status;
+
+   if (rq->copied + copied == rq->copy_len)
+   rq->cqe->result.u32 = cpu_to_le32(1);
+   else
+   rq->cqe->result.u32 = cpu_to_le32(0);
+
+   nvme_status = errno_to_nvme_status(rq, status);
+   nvmet_req_complete(rq, nvme_status);
+}
+
+static void nvmet_bdev_copy_offload_endio(void *private, int status,
+ ssize_t copied)
+{
+   struct nvmet_req *rq = (struct nvmet_req *)private;
+   u16 nvme_status;
+   ssize_t ret;
+
+   if (copied == rq->copy_len) {
+   rq->cqe->result.u32 = cpu_to_le32(1);
+   nvme_status = errno_to_nvme_status(rq, status);
+   } else {
+   rq->copied = copied;
+   ret = blkdev_copy_emulation(rq->ns->bdev, rq->copy_dst + copied,
+   rq->ns->bdev, rq->copy_src + copied,
+   rq->copy_len - copied,
+   nvmet_bdev_copy_emulation_endio,
+   (void *)rq, GFP_KERNEL);
+   if (ret == -EIOCBQUEUED)
+   return;
+   rq->cqe->result.u32 = cpu_to_le32(0);
+

[dm-devel] [PATCH v15 06/12] fs, block: copy_file_range for def_blk_ops for direct block device

2023-09-06 Thread Nitesh Shetty
For direct block device opened with O_DIRECT, use copy_file_range to
issue device copy offload, and fallback to generic_copy_file_range incase
device copy offload capability is absent or the device files are not open
with O_DIRECT.

Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 block/fops.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/block/fops.c b/block/fops.c
index a24a624d3bf7..2d96459f3277 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -739,6 +739,30 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
return ret;
 }
 
+static ssize_t blkdev_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
+{
+   struct block_device *in_bdev = I_BDEV(bdev_file_inode(file_in));
+   struct block_device *out_bdev = I_BDEV(bdev_file_inode(file_out));
+   ssize_t copied = 0;
+
+   if ((in_bdev == out_bdev) && bdev_max_copy_sectors(in_bdev) &&
+   (file_in->f_iocb_flags & IOCB_DIRECT) &&
+   (file_out->f_iocb_flags & IOCB_DIRECT)) {
+   copied = blkdev_copy_offload(in_bdev, pos_in, pos_out, len,
+NULL, NULL, GFP_KERNEL);
+   if (copied < 0)
+   copied = 0;
+   }
+   if (copied != len)
+   copied = generic_copy_file_range(file_in, pos_in + copied,
+file_out, pos_out + copied,
+len - copied, flags);
+
+   return copied;
+}
+
 #defineBLKDEV_FALLOC_FL_SUPPORTED  
\
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
@@ -832,6 +856,7 @@ const struct file_operations def_blk_fops = {
.splice_read= filemap_splice_read,
.splice_write   = iter_file_splice_write,
.fallocate  = blkdev_fallocate,
+   .copy_file_range = blkdev_copy_file_range,
 };
 
 static __init int blkdev_init(void)
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v15 05/12] fs/read_write: Enable copy_file_range for block device.

2023-09-06 Thread Nitesh Shetty
From: Anuj Gupta 

This is a prep patch. Allow copy_file_range to work for block devices.
Relaxing generic_copy_file_checks allows us to reuse the existing infra,
instead of adding a new user interface for block copy offload.
Change generic_copy_file_checks to use ->f_mapping->host for both inode_in
and inode_out. Allow block device in generic_file_rw_checks.

Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 fs/read_write.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 4771701c896b..f0f52bf48f57 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1405,8 +1405,8 @@ static int generic_copy_file_checks(struct file *file_in, 
loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t *req_count, unsigned int flags)
 {
-   struct inode *inode_in = file_inode(file_in);
-   struct inode *inode_out = file_inode(file_out);
+   struct inode *inode_in = file_in->f_mapping->host;
+   struct inode *inode_out = file_out->f_mapping->host;
uint64_t count = *req_count;
loff_t size_in;
int ret;
@@ -1708,7 +1708,9 @@ int generic_file_rw_checks(struct file *file_in, struct 
file *file_out)
/* Don't copy dirs, pipes, sockets... */
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
return -EISDIR;
-   if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+   if (!S_ISREG(inode_in->i_mode) && !S_ISBLK(inode_in->i_mode))
+   return -EINVAL;
+   if ((inode_in->i_mode & S_IFMT) != (inode_out->i_mode & S_IFMT))
return -EINVAL;
 
if (!(file_in->f_mode & FMODE_READ) ||
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v15 04/12] block: add emulation for copy

2023-09-06 Thread Nitesh Shetty
For the devices which does not support copy, copy emulation is added.
It is required for in-kernel users like fabrics, where file descriptor is
not available and hence they can't use copy_file_range.
Copy-emulation is implemented by reading from source into memory and
writing to the corresponding destination.
Also emulation can be used, if copy offload fails or partially completes.
At present in kernel user of emulation is NVMe fabrics.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
Signed-off-by: Anuj Gupta 
---
 block/blk-lib.c| 223 +
 include/linux/blkdev.h |   4 +
 2 files changed, 227 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index d22e1e7417ca..b18871ea7281 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -26,6 +26,20 @@ struct blkdev_copy_offload_io {
loff_t offset;
 };
 
+/* Keeps track of single outstanding copy emulation IO */
+struct blkdev_copy_emulation_io {
+   struct blkdev_copy_io *cio;
+   struct work_struct emulation_work;
+   void *buf;
+   ssize_t buf_len;
+   loff_t pos_in;
+   loff_t pos_out;
+   ssize_t len;
+   struct block_device *bdev_in;
+   struct block_device *bdev_out;
+   gfp_t gfp;
+};
+
 static sector_t bio_discard_limit(struct block_device *bdev, sector_t sector)
 {
unsigned int discard_granularity = bdev_discard_granularity(bdev);
@@ -317,6 +331,215 @@ ssize_t blkdev_copy_offload(struct block_device *bdev, 
loff_t pos_in,
 }
 EXPORT_SYMBOL_GPL(blkdev_copy_offload);
 
+static void *blkdev_copy_alloc_buf(ssize_t req_size, ssize_t *alloc_size,
+  gfp_t gfp)
+{
+   int min_size = PAGE_SIZE;
+   char *buf;
+
+   while (req_size >= min_size) {
+   buf = kvmalloc(req_size, gfp);
+   if (buf) {
+   *alloc_size = req_size;
+   return buf;
+   }
+   req_size >>= 1;
+   }
+
+   return NULL;
+}
+
+static struct bio *bio_map_buf(void *data, unsigned int len, gfp_t gfp)
+{
+   unsigned long kaddr = (unsigned long)data;
+   unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+   unsigned long start = kaddr >> PAGE_SHIFT;
+   const int nr_pages = end - start;
+   bool is_vmalloc = is_vmalloc_addr(data);
+   struct page *page;
+   int offset, i;
+   struct bio *bio;
+
+   bio = bio_kmalloc(nr_pages, gfp);
+   if (!bio)
+   return ERR_PTR(-ENOMEM);
+   bio_init(bio, NULL, bio->bi_inline_vecs, nr_pages, 0);
+
+   if (is_vmalloc) {
+   flush_kernel_vmap_range(data, len);
+   bio->bi_private = data;
+   }
+
+   offset = offset_in_page(kaddr);
+   for (i = 0; i < nr_pages; i++) {
+   unsigned int bytes = PAGE_SIZE - offset;
+
+   if (len <= 0)
+   break;
+
+   if (bytes > len)
+   bytes = len;
+
+   if (!is_vmalloc)
+   page = virt_to_page(data);
+   else
+   page = vmalloc_to_page(data);
+   if (bio_add_page(bio, page, bytes, offset) < bytes) {
+   /* we don't support partial mappings */
+   bio_uninit(bio);
+   kfree(bio);
+   return ERR_PTR(-EINVAL);
+   }
+
+   data += bytes;
+   len -= bytes;
+   offset = 0;
+   }
+
+   return bio;
+}
+
+static void blkdev_copy_emulation_work(struct work_struct *work)
+{
+   struct blkdev_copy_emulation_io *emulation_io = container_of(work,
+   struct blkdev_copy_emulation_io, emulation_work);
+   struct blkdev_copy_io *cio = emulation_io->cio;
+   struct bio *read_bio, *write_bio;
+   loff_t pos_in = emulation_io->pos_in, pos_out = emulation_io->pos_out;
+   ssize_t rem, chunk;
+   int ret = 0;
+
+   for (rem = emulation_io->len; rem > 0; rem -= chunk) {
+   chunk = min_t(int, emulation_io->buf_len, rem);
+
+   read_bio = bio_map_buf(emulation_io->buf,
+  emulation_io->buf_len,
+  emulation_io->gfp);
+   if (IS_ERR(read_bio)) {
+   ret = PTR_ERR(read_bio);
+   break;
+   }
+   read_bio->bi_opf = REQ_OP_READ | REQ_SYNC;
+   bio_set_dev(read_bio, emulation_io->bdev_in);
+   read_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
+   read_bio->bi_iter.bi_size = chunk;
+   ret = submit_bio_wait(read_bio);
+   kfree(read_bio);
+   if (ret)
+   break;
+
+   write_bio = bio_map_

[dm-devel] [PATCH v15 03/12] block: add copy offload support

2023-09-06 Thread Nitesh Shetty
Introduce blkdev_copy_offload to perform copy offload.
Issue REQ_OP_COPY_SRC with source info along with taking a plug.
This flows till request layer and waits for dst bio to arrive.
Issue REQ_OP_COPY_DST with destination info and this bio reaches request
layer and merges with src request.
For any reason, if a request comes to the driver with only one of src/dst
bio, we fail the copy offload.

Larger copy will be divided, based on max_copy_sectors limit.

Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 block/blk-lib.c| 202 +
 include/linux/blkdev.h |   4 +
 2 files changed, 206 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..d22e1e7417ca 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -10,6 +10,22 @@
 
 #include "blk.h"
 
+/* Keeps track of all outstanding copy IO */
+struct blkdev_copy_io {
+   atomic_t refcount;
+   ssize_t copied;
+   int status;
+   struct task_struct *waiter;
+   void (*endio)(void *private, int status, ssize_t copied);
+   void *private;
+};
+
+/* Keeps track of single outstanding copy offload IO */
+struct blkdev_copy_offload_io {
+   struct blkdev_copy_io *cio;
+   loff_t offset;
+};
+
 static sector_t bio_discard_limit(struct block_device *bdev, sector_t sector)
 {
unsigned int discard_granularity = bdev_discard_granularity(bdev);
@@ -115,6 +131,192 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+static inline ssize_t blkdev_copy_sanity_check(struct block_device *bdev_in,
+  loff_t pos_in,
+  struct block_device *bdev_out,
+  loff_t pos_out, size_t len)
+{
+   unsigned int align = max(bdev_logical_block_size(bdev_out),
+bdev_logical_block_size(bdev_in)) - 1;
+
+   if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
+   len >= BLK_COPY_MAX_BYTES)
+   return -EINVAL;
+
+   return 0;
+}
+
+static inline void blkdev_copy_endio(struct blkdev_copy_io *cio)
+{
+   if (cio->endio) {
+   cio->endio(cio->private, cio->status, cio->copied);
+   kfree(cio);
+   } else {
+   struct task_struct *waiter = cio->waiter;
+
+   WRITE_ONCE(cio->waiter, NULL);
+   blk_wake_io_task(waiter);
+   }
+}
+
+/*
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to complete.
+ * Returns the length of bytes copied or error
+ */
+static ssize_t blkdev_copy_wait_io_completion(struct blkdev_copy_io *cio)
+{
+   ssize_t ret;
+
+   for (;;) {
+   __set_current_state(TASK_UNINTERRUPTIBLE);
+   if (!READ_ONCE(cio->waiter))
+   break;
+   blk_io_schedule();
+   }
+   __set_current_state(TASK_RUNNING);
+   ret = cio->copied;
+   kfree(cio);
+
+   return ret;
+}
+
+static void blkdev_copy_offload_dst_endio(struct bio *bio)
+{
+   struct blkdev_copy_offload_io *offload_io = bio->bi_private;
+   struct blkdev_copy_io *cio = offload_io->cio;
+
+   if (bio->bi_status) {
+   cio->copied = min_t(ssize_t, offload_io->offset, cio->copied);
+   if (!cio->status)
+   cio->status = blk_status_to_errno(bio->bi_status);
+   }
+   bio_put(bio);
+
+   if (atomic_dec_and_test(>refcount))
+   blkdev_copy_endio(cio);
+}
+
+/*
+ * @bdev:  block device
+ * @pos_in:source offset
+ * @pos_out:   destination offset
+ * @len:   length in bytes to be copied
+ * @endio: endio function to be called on completion of copy operation,
+ * for synchronous operation this should be NULL
+ * @private:   endio function will be called with this private data,
+ * for synchronous operation this should be NULL
+ * @gfp_mask:  memory allocation flags (for bio_alloc)
+ *
+ * For synchronous operation returns the length of bytes copied or error
+ * For asynchronous operation returns -EIOCBQUEUED or error
+ *
+ * Description:
+ * Copy source offset to destination offset within block device, using
+ * device's native copy offload feature. This function can fail, and
+ * in that case the caller can fallback to emulation.
+ * We perform copy operation using 2 bio's.
+ * 1. We take a plug and send a REQ_OP_COPY_SRC bio along with source
+ * sector and length. Once this bio reaches request layer, we form a
+ * request and wait for dst bio to arrive.
+ * 2. We issue REQ_OP_COPY_DST bio along with destination sector, length.
+ * Once this bio reaches request layer and find a reques

[dm-devel] [PATCH v15 02/12] Add infrastructure for copy offload in block and request layer.

2023-09-06 Thread Nitesh Shetty
We add two new opcode REQ_OP_COPY_SRC, REQ_OP_COPY_DST.
Since copy is a composite operation involving src and dst sectors/lba,
each needs to be represented by a separate bio to make it compatible
with device mapper.
We expect caller to take a plug and send bio with source information,
followed by bio with destination information.
Once the src bio arrives we form a request and wait for destination
bio. Upon arrival of destination we merge these two bio's and send
corresponding request down to device driver.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 block/blk-core.c  |  7 +++
 block/blk-merge.c | 41 +++
 block/blk.h   | 16 +++
 block/elevator.h  |  1 +
 include/linux/bio.h   |  6 +-
 include/linux/blk_types.h | 10 ++
 6 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9d51e9894ece..33aadafdb7f9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -121,6 +121,8 @@ static const char *const blk_op_name[] = {
REQ_OP_NAME(ZONE_FINISH),
REQ_OP_NAME(ZONE_APPEND),
REQ_OP_NAME(WRITE_ZEROES),
+   REQ_OP_NAME(COPY_SRC),
+   REQ_OP_NAME(COPY_DST),
REQ_OP_NAME(DRV_IN),
REQ_OP_NAME(DRV_OUT),
 };
@@ -792,6 +794,11 @@ void submit_bio_noacct(struct bio *bio)
if (!q->limits.max_write_zeroes_sectors)
goto not_supported;
break;
+   case REQ_OP_COPY_SRC:
+   case REQ_OP_COPY_DST:
+   if (!q->limits.max_copy_sectors)
+   goto not_supported;
+   break;
default:
break;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 65e75efa9bd3..bcb55ba48107 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -158,6 +158,20 @@ static struct bio *bio_split_write_zeroes(struct bio *bio,
return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *bio_split_copy(struct bio *bio,
+ const struct queue_limits *lim,
+ unsigned int *nsegs)
+{
+   *nsegs = 1;
+   if (bio_sectors(bio) <= lim->max_copy_sectors)
+   return NULL;
+   /*
+* We don't support splitting for a copy bio. End it with EIO if
+* splitting is required and return an error pointer.
+*/
+   return ERR_PTR(-EIO);
+}
+
 /*
  * Return the maximum number of sectors from the start of a bio that may be
  * submitted as a single request to a block device. If enough sectors remain,
@@ -366,6 +380,12 @@ struct bio *__bio_split_to_limits(struct bio *bio,
case REQ_OP_WRITE_ZEROES:
split = bio_split_write_zeroes(bio, lim, nr_segs, bs);
break;
+   case REQ_OP_COPY_SRC:
+   case REQ_OP_COPY_DST:
+   split = bio_split_copy(bio, lim, nr_segs);
+   if (IS_ERR(split))
+   return NULL;
+   break;
default:
split = bio_split_rw(bio, lim, nr_segs, bs,
get_max_io_size(bio, lim) << SECTOR_SHIFT);
@@ -922,6 +942,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
if (!rq_mergeable(rq) || !bio_mergeable(bio))
return false;
 
+   if (blk_copy_offload_mergable(rq, bio))
+   return true;
+
if (req_op(rq) != bio_op(bio))
return false;
 
@@ -951,6 +974,8 @@ enum elv_merge blk_try_merge(struct request *rq, struct bio 
*bio)
 {
if (blk_discard_mergable(rq))
return ELEVATOR_DISCARD_MERGE;
+   else if (blk_copy_offload_mergable(rq, bio))
+   return ELEVATOR_COPY_OFFLOAD_MERGE;
else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
return ELEVATOR_BACK_MERGE;
else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
@@ -1053,6 +1078,20 @@ static enum bio_merge_status 
bio_attempt_discard_merge(struct request_queue *q,
return BIO_MERGE_FAILED;
 }
 
+static enum bio_merge_status bio_attempt_copy_offload_merge(struct request 
*req,
+   struct bio *bio)
+{
+   if (req->__data_len != bio->bi_iter.bi_size)
+   return BIO_MERGE_FAILED;
+
+   req->biotail->bi_next = bio;
+   req->biotail = bio;
+   req->nr_phys_segments++;
+   req->__data_len += bio->bi_iter.bi_size;
+
+   return BIO_MERGE_OK;
+}
+
 static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
   struct request *rq,
   struct bio *bio,
@@ -1073,6 +1112,8 @@ static enum bio_merge_status blk_attempt_bio_merge(struct 
request_queue *q,
   

[dm-devel] [PATCH v15 01/12] block: Introduce queue limits and sysfs for copy-offload support

2023-09-06 Thread Nitesh Shetty
Add device limits as sysfs entries,
- copy_max_bytes (RW)
- copy_max_hw_bytes (RO)

Above limits help to split the copy payload in block layer.
copy_max_bytes: maximum total length of copy in single payload.
copy_max_hw_bytes: Reflects the device supported maximum limit.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Anuj Gupta 
---
 Documentation/ABI/stable/sysfs-block | 23 ++
 block/blk-settings.c | 24 +++
 block/blk-sysfs.c| 36 
 include/linux/blkdev.h   | 13 ++
 4 files changed, 96 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block 
b/Documentation/ABI/stable/sysfs-block
index 1fe9a553c37b..96ba701e57da 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,29 @@ Description:
last zone of the device which may be smaller.
 
 
+What:  /sys/block//queue/copy_max_bytes
+Date:  August 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] This is the maximum number of bytes that the block layer
+   will allow for a copy request. This is always smaller or
+   equal to the maximum size allowed by the hardware, indicated by
+   'copy_max_hw_bytes'. An attempt to set a value higher than
+   'copy_max_hw_bytes' will truncate this to 'copy_max_hw_bytes'.
+   Writing '0' to this file will disable offloading copies for this
+   device, instead copy is done via emulation.
+
+
+What:  /sys/block//queue/copy_max_hw_bytes
+Date:  August 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RO] This is the maximum number of bytes that the hardware
+   will allow for single data copy request.
+   A value of 0 means that the device does not support
+   copy offload.
+
+
 What:  /sys/block//queue/crypto/
 Date:  February 2022
 Contact:   linux-bl...@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0046b447268f..4441711ac364 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
+   lim->max_copy_hw_sectors = 0;
+   lim->max_copy_sectors = 0;
 }
 
 /**
@@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_dev_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
lim->max_zone_append_sectors = UINT_MAX;
+   lim->max_copy_hw_sectors = UINT_MAX;
+   lim->max_copy_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/*
+ * blk_queue_max_copy_hw_sectors - set max sectors for a single copy payload
+ * @q: the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ */
+void blk_queue_max_copy_hw_sectors(struct request_queue *q,
+  unsigned int max_copy_sectors)
+{
+   if (max_copy_sectors > (BLK_COPY_MAX_BYTES >> SECTOR_SHIFT))
+   max_copy_sectors = BLK_COPY_MAX_BYTES >> SECTOR_SHIFT;
+
+   q->limits.max_copy_hw_sectors = max_copy_sectors;
+   q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_hw_sectors);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -578,6 +598,10 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
   b->max_segment_size);
 
+   t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+   t->max_copy_hw_sectors = min(t->max_copy_hw_sectors,
+b->max_copy_hw_sectors);
+
t->misaligned |= b->misaligned;
 
alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 63e481262336..4840e21adefa 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -199,6 +199,37 @@ static ssize_t queue_discard_zeroes_data_show(struct 
request_queue *q, char *pag
return queue_var_show(0, page);
 }
 
+static ssize_t queue_copy_hw_max_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, "%llu\n", (unsigned long long)
+  q->limits.max_copy_hw_sectors << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_

[dm-devel] [PATCH v15 00/12] Implement copy offload support

2023-09-06 Thread Nitesh Shetty
The patch series covers the points discussed in past and most recently
in LSFMM'23[0].
We have covered the initial agreed requirements in this patch set and
further additional features suggested by community.

This is next iteration of our previous patch set v14[1].
Copy offload is performed using two bio's -
1. Take a plug
2. The first bio containing source info is prepared and sent,
a request is formed.
3. This is followed by preparing and sending the second bio containing the
destination info.
4. This bio is merged with the request containing the source info.
5. The plug is released, and the request containing source and destination
bio's is sent to the driver.

So copy offload works only for request based storage drivers.
For failures, partial completion, absence of copy offload capability,
we can fallback to copy emulation.

Overall series supports:

1. Driver
- NVMe Copy command (single NS, TP 4065), including support
in nvme-target (for block and file back end).

2. Block layer
- Block-generic copy (REQ_OP_COPY_DST/SRC), operation with
  interface accommodating two block-devs
- Merging copy requests in request layer
- Emulation, for in-kernel user when offload is natively 
absent
- dm-linear support (for cases not requiring split)

3. User-interface
- copy_file_range

Testing
===
Copy offload can be tested on:
a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
parameters mssrl,mcl, msrc. For more info [2].
b. Null block device
c. NVMe Fabrics loopback.
d. blktests[3]

Emulation can be tested on any device.

fio[4].

Infra and plumbing:
===
We populate copy_file_range callback in def_blk_fops. 
For devices that support copy-offload, use blkdev_copy_offload to
achieve in-device copy.
However for cases, where device doesn't support offload,
fallback to generic_copy_file_range.
For in-kernel users (like NVMe fabrics), use blkdev_copy_offload
if device is copy offload capable or else fallback to emulation 
using blkdev_copy_emulation.
Modify checks in generic_copy_file_range to support block-device.

Blktests[3]
==
tests/block/035-040: Runs copy offload and emulation on null
  block device.
tests/block/050,055: Runs copy offload and emulation on test
  nvme block device.
tests/nvme/056-067: Create a loop backed fabrics device and
  run copy offload and emulation.

Future Work
===
- loopback device copy offload support
- upstream fio to use copy offload
- upstream blktest to test copy offload
- update man pages for copy_file_range
- expand in-kernel users of copy offload

These are to be taken up after this minimal series is agreed upon.

Additional links:
=
[0] 
https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zt14mmpgmivczugb33c60...@mail.gmail.com/

https://lore.kernel.org/linux-nvme/f0e19ae4-b37a-e9a3-2be7-a5afb334a...@nvidia.com/

https://lore.kernel.org/linux-nvme/20230113094648.15614-1-nj.she...@samsung.com/
[1] 
https://lore.kernel.org/lkml/20230811105300.15889-1-nj.she...@samsung.com/T/#t
[2] 
https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v15
[4] https://github.com/OpenMPDK/fio/tree/copyoffload-3.35-v14

Changes since v14:
=
- block: (Bart Van Assche)
1. BLK_ prefix addition to COPY_MAX_BYES and COPY_MAX_SEGMENTS
2. Improved function,patch,cover-letter description
3. Simplified refcount updating.
- null-blk, nvme:
4. static warning fixes (kernel test robot)

Changes since v13:
=
- block:
1. Simplified copy offload and emulation helpers, now
  caller needs to decide between offload/emulation fallback
2. src,dst bio order change (Christoph Hellwig)
3. refcount changes similar to dio (Christoph Hellwig)
4. Single outstanding IO for copy emulation (Christoph Hellwig)
5. use copy_max_sectors to identify copy offload
  capability and other reviews (Damien, Christoph)
6. Return status in endio handler (Christoph Hellwig)
- nvme-fabrics: fallback to emulation in case of partial
  offload completion
- in kernel user addition (Ming lei)
- indentation, documentation, minor fixes, misc changes (Damien,
  Christoph

Re: [dm-devel] [PATCH v14 02/11] Add infrastructure for copy offload in block and request layer.

2023-08-15 Thread Nitesh Shetty
We had kept this as a part of blk-types.h because we saw some other functions
trying to do similar things inside this file (op_is_write/flush/discard).
But it should be okay for us to move it to blk-mq.h if that’s the right way.

Thank you,
Nitesh Shetty


On Mon, Aug 14, 2023 at 8:28 PM Bart Van Assche  wrote:
>
> On 8/14/23 05:18, Nitesh Shetty wrote:
> > On 23/08/11 02:25PM, Bart Van Assche wrote:
> >> On 8/11/23 03:52, Nitesh Shetty wrote:
> >>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> >>> index 0bad62cca3d0..de0ad7a0d571 100644
> >>> +static inline bool op_is_copy(blk_opf_t op)
> >>> +{
> >>> +return ((op & REQ_OP_MASK) == REQ_OP_COPY_SRC ||
> >>> +(op & REQ_OP_MASK) == REQ_OP_COPY_DST);
> >>> +}
> >>> +
> >>
> >> The above function should be moved into include/linux/blk-mq.h below the
> >> definition of req_op() such that it can use req_op() instead of
> >> open-coding it.
> >>
> > We use this later for dm patches(patch 9) as well, and we don't have
> > request at
> > that time.
>
> My understanding is that include/linux/blk_types.h should only contain
> data types and constants and hence that inline functions like
> op_is_copy() should be moved elsewhere.
>
> Bart.
>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v14 03/11] block: add copy offload support

2023-08-14 Thread Nitesh Shetty
On Sat, Aug 12, 2023 at 3:10 AM Bart Van Assche  wrote:
>
> On 8/11/23 03:52, Nitesh Shetty wrote:
> > + * Description:
> > + *   Copy source offset to destination offset within block device, using
> > + *   device's native copy offload feature.
>
> Offloading the copy operation is not guaranteed so I think that needs to
> be reflected in the above comment.
>
Acked.
> > + *   We perform copy operation by sending 2 bio's.
> > + *   1. We take a plug and send a REQ_OP_COPY_SRC bio along with source
> > + *   sector and length. Once this bio reaches request layer, we form a
> > + *   request and wait for dst bio to arrive.
>
> What will happen if the queue depth of the request queue at the bottom
> is one?
>
For any reason if a request reaches the driver with only one of the src/dst bio,
copy will fail. This design requires only one request to do a copy,
so it should work fine.

> > + blk_start_plug();
> > + dst_bio = blk_next_bio(src_bio, bdev, 0, REQ_OP_COPY_DST, 
> > gfp);
>
> blk_next_bio() can return NULL so its return value should be checked.
>
Acked.

Thank you,
Nitesh Shetty

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v14 04/11] block: add emulation for copy

2023-08-14 Thread Nitesh Shetty
On Sat, Aug 12, 2023 at 4:25 AM Bart Van Assche  wrote:
>
> On 8/11/23 03:52, Nitesh Shetty wrote:
> > + schedule_work(_io->emulation_work);
>
> schedule_work() uses system_wq. This won't work for all users since
> there are no latency guarantees for system_wq.
>
At present copy is treated as background operation, so went ahead
with the current approach.

Thank you,
Nitesh Shetty

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v14 00/11] Implement copy offload support

2023-08-14 Thread Nitesh Shetty
On Sat, Aug 12, 2023 at 3:42 AM Bart Van Assche  wrote:
>
> On 8/11/23 03:52, Nitesh Shetty wrote:
> > We achieve copy offload by sending 2 bio's with source and destination
> > info and merge them to form a request. This request is sent to driver.
> > So this design works only for request based storage drivers.
>
> [ ... ]
>
> > Overall series supports:
> > 
> >   1. Driver
> >   - NVMe Copy command (single NS, TP 4065), including support
> >   in nvme-target (for block and file back end).
> >
> >   2. Block layer
> >   - Block-generic copy (REQ_OP_COPY_DST/SRC), operation with
> >interface accommodating two block-devs
> >  - Merging copy requests in request layer
> >   - Emulation, for in-kernel user when offload is natively
> >  absent
> >   - dm-linear support (for cases not requiring split)
> >
> >   3. User-interface
> >   - copy_file_range
>
> Is this sufficient? The combination of dm-crypt, dm-linear and the NVMe
> driver is very common. What is the plan for supporting dm-crypt?

Plan is to add offload support for other dm targets as part of subsequent
series once current patchset merges, dm targets can use emulation to
achieve the same at present.

> Shouldn't bio splitting be supported for dm-linear?
Handling split is tricky in this case, if we allow splitting, there is
no easy way
to match/merge different src/dst bio's. Once we have multi range support then
we feel at least src bio's can be split. But this series split won't work.

Thank you,
Nitesh Shetty

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v14 02/11] Add infrastructure for copy offload in block and request layer.

2023-08-14 Thread Nitesh Shetty

On 23/08/11 02:25PM, Bart Van Assche wrote:

On 8/11/23 03:52, Nitesh Shetty wrote:

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 0bad62cca3d0..de0ad7a0d571 100644
+static inline bool op_is_copy(blk_opf_t op)
+{
+   return ((op & REQ_OP_MASK) == REQ_OP_COPY_SRC ||
+   (op & REQ_OP_MASK) == REQ_OP_COPY_DST);
+}
+


The above function should be moved into include/linux/blk-mq.h below the
definition of req_op() such that it can use req_op() instead of 
open-coding it.



We use this later for dm patches(patch 9) as well, and we don't have request at
that time.

Thank you,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v14 02/11] Add infrastructure for copy offload in block and request layer.

2023-08-14 Thread Nitesh Shetty

On 23/08/11 02:58PM, Bart Van Assche wrote:

On 8/11/23 03:52, Nitesh Shetty wrote:

We expect caller to take a plug and send bio with source information,
followed by bio with destination information.
Once the src bio arrives we form a request and wait for destination
bio. Upon arrival of destination we merge these two bio's and send
corresponding request down to device driver.


Is the above description up-to-date? In the cover letter there is a 
different description of how copy offloading works.



Acked, This description is up to date.
We need to update this description in cover letter.

Thank you,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v14 01/11] block: Introduce queue limits and sysfs for copy-offload support

2023-08-14 Thread Nitesh Shetty

On 23/08/11 02:56PM, Bart Van Assche wrote:

On 8/11/23 03:52, Nitesh Shetty wrote:

+/* maximum copy offload length, this is set to 128MB based on current testing 
*/
+#define COPY_MAX_BYTES (1 << 27)


Since the COPY_MAX_BYTES constant is only used in source file
block/blk-settings.c it should be moved into that file. If you really
want to keep it in include/linux/blkdev.h, a BLK_ prefix should
be added.


We are using this in other files. So we will add a prefix BLK_.

Thank you,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v14 03/11] block: add copy offload support

2023-08-14 Thread Nitesh Shetty

On 23/08/11 03:06PM, Bart Van Assche wrote:

On 8/11/23 03:52, Nitesh Shetty wrote:

+   if (rem != chunk)
+   atomic_inc(>refcount);


This code will be easier to read if the above if-test is left out
and if the following code is added below the for-loop:

if (atomic_dec_and_test(>refcount))
blkdev_copy_endio(cio);


Acked

Thank you,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH v14 11/11] null_blk: add support for copy offload

2023-08-11 Thread Nitesh Shetty
Implementation is based on existing read and write infrastructure.
copy_max_bytes: A new configfs and module parameter is introduced, which
can be used to set hardware/driver supported maximum copy limit.
Only request based queue mode will support for copy offload.
Added tracefs support to copy IO tracing.

Suggested-by: Damien Le Moal 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
---
 Documentation/block/null_blk.rst  |  5 ++
 drivers/block/null_blk/main.c | 99 ++-
 drivers/block/null_blk/null_blk.h |  1 +
 drivers/block/null_blk/trace.h| 23 +++
 4 files changed, 125 insertions(+), 3 deletions(-)

diff --git a/Documentation/block/null_blk.rst b/Documentation/block/null_blk.rst
index 4dd78f24d10a..6153e02fcf13 100644
--- a/Documentation/block/null_blk.rst
+++ b/Documentation/block/null_blk.rst
@@ -149,3 +149,8 @@ zone_size=[MB]: Default: 256
 zone_nr_conv=[nr_conv]: Default: 0
   The number of conventional zones to create when block device is zoned.  If
   zone_nr_conv >= nr_zones, it will be reduced to nr_zones - 1.
+
+copy_max_bytes=[size in bytes]: Default: COPY_MAX_BYTES
+  A module and configfs parameter which can be used to set hardware/driver
+  supported maximum copy offload limit.
+  COPY_MAX_BYTES(=128MB at present) is defined in fs.h
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 864013019d6b..afc14aa20305 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -11,6 +11,8 @@
 #include 
 #include "null_blk.h"
 
+#include "trace.h"
+
 #undef pr_fmt
 #define pr_fmt(fmt)"null_blk: " fmt
 
@@ -157,6 +159,10 @@ static int g_max_sectors;
 module_param_named(max_sectors, g_max_sectors, int, 0444);
 MODULE_PARM_DESC(max_sectors, "Maximum size of a command (in 512B sectors)");
 
+static unsigned long g_copy_max_bytes = COPY_MAX_BYTES;
+module_param_named(copy_max_bytes, g_copy_max_bytes, ulong, 0444);
+MODULE_PARM_DESC(copy_max_bytes, "Maximum size of a copy command (in bytes)");
+
 static unsigned int nr_devices = 1;
 module_param(nr_devices, uint, 0444);
 MODULE_PARM_DESC(nr_devices, "Number of devices to register");
@@ -409,6 +415,7 @@ NULLB_DEVICE_ATTR(home_node, uint, NULL);
 NULLB_DEVICE_ATTR(queue_mode, uint, NULL);
 NULLB_DEVICE_ATTR(blocksize, uint, NULL);
 NULLB_DEVICE_ATTR(max_sectors, uint, NULL);
+NULLB_DEVICE_ATTR(copy_max_bytes, uint, NULL);
 NULLB_DEVICE_ATTR(irqmode, uint, NULL);
 NULLB_DEVICE_ATTR(hw_queue_depth, uint, NULL);
 NULLB_DEVICE_ATTR(index, uint, NULL);
@@ -550,6 +557,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
_device_attr_queue_mode,
_device_attr_blocksize,
_device_attr_max_sectors,
+   _device_attr_copy_max_bytes,
_device_attr_irqmode,
_device_attr_hw_queue_depth,
_device_attr_index,
@@ -656,7 +664,8 @@ static ssize_t memb_group_features_show(struct config_item 
*item, char *page)
"poll_queues,power,queue_mode,shared_tag_bitmap,size,"
"submit_queues,use_per_node_hctx,virt_boundary,zoned,"
"zone_capacity,zone_max_active,zone_max_open,"
-   "zone_nr_conv,zone_offline,zone_readonly,zone_size\n");
+   "zone_nr_conv,zone_offline,zone_readonly,zone_size,"
+   "copy_max_bytes\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -722,6 +731,7 @@ static struct nullb_device *null_alloc_dev(void)
dev->queue_mode = g_queue_mode;
dev->blocksize = g_bs;
dev->max_sectors = g_max_sectors;
+   dev->copy_max_bytes = g_copy_max_bytes;
dev->irqmode = g_irqmode;
dev->hw_queue_depth = g_hw_queue_depth;
dev->blocking = g_blocking;
@@ -1271,6 +1281,81 @@ static int null_transfer(struct nullb *nullb, struct 
page *page,
return err;
 }
 
+static inline int nullb_setup_copy(struct nullb *nullb, struct request *req,
+  bool is_fua)
+{
+   sector_t sector_in, sector_out;
+   loff_t offset_in, offset_out;
+   void *in, *out;
+   ssize_t chunk, rem = 0;
+   struct bio *bio;
+   struct nullb_page *t_page_in, *t_page_out;
+   u16 seg = 1;
+   int status = -EIO;
+
+   if (blk_rq_nr_phys_segments(req) != COPY_MAX_SEGMENTS)
+   return status;
+
+   /*
+* First bio contains information about source and last bio contains
+* information about destination.
+*/
+   __rq_for_each_bio(bio, req) {
+   if (seg == blk_rq_nr_phys_segments(req)) {
+   sector_out = bio->bi_iter.bi_sector;
+   if (rem != bio->bi_iter.bi_size)
+   return status;
+   } else {

[dm-devel] [PATCH v14 08/11] nvmet: add copy command support for bdev and file ns

2023-08-11 Thread Nitesh Shetty
Add support for handling nvme_cmd_copy command on target.

For bdev-ns if backing device supports copy offload we call device copy
offload (blkdev_copy_offload).
In case of partial completion from above or absence of device copy offload
capability, we fallback to copy emulation (blkdev_copy_emulation)

For file-ns we call vfs_copy_file_range to service our request.

Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.

loop target has copy support, which can be used to test copy offload.
trace event support for nvme_cmd_copy.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/target/admin-cmd.c   |  9 ++-
 drivers/nvme/target/io-cmd-bdev.c | 97 +++
 drivers/nvme/target/io-cmd-file.c | 50 
 drivers/nvme/target/nvmet.h   |  4 ++
 drivers/nvme/target/trace.c   | 19 ++
 5 files changed, 177 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 39cb570f833d..4e1a6ca09937 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -433,8 +433,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req 
*req)
id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-   NVME_CTRL_ONCS_WRITE_ZEROES);
-
+   NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
/* XXX: don't report vwc if the underlying device is write through */
id->vwc = NVME_CTRL_VWC_PRESENT;
 
@@ -536,6 +535,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req 
*req)
 
if (req->ns->bdev)
nvmet_bdev_set_limits(req->ns->bdev, id);
+   else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
+   (PAGE_SHIFT - SECTOR_SHIFT));
+   id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
+   }
 
/*
 * We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index 2733e0158585..3e9dfdfd6aa5 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,18 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct 
nvme_id_ns *id)
id->npda = id->npdg;
/* NOWS = Namespace Optimal Write Size */
id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+   if (bdev_max_copy_sectors(bdev)) {
+   id->msrc = id->msrc;
+   id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
+   SECTOR_SHIFT) / bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   } else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
+   bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   }
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -450,6 +462,87 @@ static void nvmet_bdev_execute_write_zeroes(struct 
nvmet_req *req)
}
 }
 
+static void nvmet_bdev_copy_emulation_endio(void *private, int status,
+   ssize_t copied)
+{
+   struct nvmet_req *rq = (struct nvmet_req *)private;
+   u16 nvme_status;
+
+   if (rq->copied + copied == rq->copy_len)
+   rq->cqe->result.u32 = cpu_to_le32(1);
+   else
+   rq->cqe->result.u32 = cpu_to_le32(0);
+
+   nvme_status = errno_to_nvme_status(rq, status);
+   nvmet_req_complete(rq, nvme_status);
+}
+
+static void nvmet_bdev_copy_offload_endio(void *private, int status,
+ ssize_t copied)
+{
+   struct nvmet_req *rq = (struct nvmet_req *)private;
+   u16 nvme_status;
+   ssize_t ret;
+
+   if (copied == rq->copy_len) {
+   rq->cqe->result.u32 = cpu_to_le32(1);
+   nvme_status = errno_to_nvme_status(rq, status);
+   } else {
+   rq->copied = copied;
+   ret = blkdev_copy_emulation(rq->ns->bdev, rq->copy_dst + copied,
+   rq->ns->bdev, rq->copy_src + copied,
+   rq->copy_len - copied,
+   nvmet_bdev_copy_emulation_endio,
+   (void *)rq, GFP_KERNEL);
+   if (ret == -EIOCBQUEUED)
+   return;
+   rq->cqe->result.u32 = cpu_to_le32(0);
+ 

[dm-devel] [PATCH v14 10/11] dm: Enable copy offload for dm-linear target

2023-08-11 Thread Nitesh Shetty
Setting copy_offload_supported flag to enable offload.

Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-linear.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index f4448d520ee9..1d1ee30bbefb 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -62,6 +62,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
ti->num_discard_bios = 1;
ti->num_secure_erase_bios = 1;
ti->num_write_zeroes_bios = 1;
+   ti->copy_offload_supported = 1;
ti->private = lc;
return 0;
 
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v14 09/11] dm: Add support for copy offload

2023-08-11 Thread Nitesh Shetty
Before enabling copy for dm target, check if underlying devices and
dm target support copy. Avoid split happening inside dm target.
Fail early if the request needs split, currently splitting copy
request is not supported.

Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-table.c | 37 +++
 drivers/md/dm.c   |  7 +++
 include/linux/device-mapper.h |  3 +++
 3 files changed, 47 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 7d208b2b1a19..a192c19b68e4 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1862,6 +1862,38 @@ static bool dm_table_supports_nowait(struct dm_table *t)
return true;
 }
 
+static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
+  sector_t start, sector_t len, void *data)
+{
+   struct request_queue *q = bdev_get_queue(dev->bdev);
+
+   return !q->limits.max_copy_sectors;
+}
+
+static bool dm_table_supports_copy(struct dm_table *t)
+{
+   struct dm_target *ti;
+   unsigned int i;
+
+   for (i = 0; i < t->num_targets; i++) {
+   ti = dm_table_get_target(t, i);
+
+   if (!ti->copy_offload_supported)
+   return false;
+
+   /*
+* target provides copy support (as implied by setting
+* 'copy_offload_supported')
+* and it relies on _all_ data devices having copy support.
+*/
+   if (!ti->type->iterate_devices ||
+   ti->type->iterate_devices(ti, device_not_copy_capable, 
NULL))
+   return false;
+   }
+
+   return true;
+}
+
 static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
  sector_t start, sector_t len, void *data)
 {
@@ -1944,6 +1976,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct 
request_queue *q,
q->limits.discard_misaligned = 0;
}
 
+   if (!dm_table_supports_copy(t)) {
+   q->limits.max_copy_sectors = 0;
+   q->limits.max_copy_hw_sectors = 0;
+   }
+
if (!dm_table_supports_secure_erase(t))
q->limits.max_secure_erase_sectors = 0;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f0f118ab20fa..f9d6215e6d4d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1732,6 +1732,13 @@ static blk_status_t __split_and_process_bio(struct 
clone_info *ci)
if (unlikely(ci->is_abnormal_io))
return __process_abnormal_io(ci, ti);
 
+   if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
+   max_io_len(ti, ci->sector) < ci->sector_count)) {
+   DMERR("Error, IO size(%u) > max target size(%llu)\n",
+ ci->sector_count, max_io_len(ti, ci->sector));
+   return BLK_STS_IOERR;
+   }
+
/*
 * Only support bio polling for normal IO, and the target io is
 * exactly inside the dm_io instance (verified in dm_poll_dm_io)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 69d0435c7ebb..98db52d1c773 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -396,6 +396,9 @@ struct dm_target {
 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
 */
bool needs_bio_set_dev:1;
+
+   /* copy offload is supported */
+   bool copy_offload_supported:1;
 };
 
 void *dm_per_bio_data(struct bio *bio, size_t data_size);
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v14 07/11] nvme: add copy offload support

2023-08-11 Thread Nitesh Shetty
Current design only supports single source range.
We receive a request with REQ_OP_COPY_SRC.
Parse this request which consists of src(1st) and dst(2nd) bios.
Form a copy command (TP 4065)

trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.

Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier González 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/host/constants.c |  1 +
 drivers/nvme/host/core.c  | 79 +++
 drivers/nvme/host/trace.c | 19 +
 include/linux/blkdev.h|  1 +
 include/linux/nvme.h  | 43 +--
 5 files changed, 140 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index 20f46c230885..2f504a2b1fe8 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -19,6 +19,7 @@ static const char * const nvme_ops[] = {
[nvme_cmd_resv_report] = "Reservation Report",
[nvme_cmd_resv_acquire] = "Reservation Acquire",
[nvme_cmd_resv_release] = "Reservation Release",
+   [nvme_cmd_copy] = "Copy Offload",
[nvme_cmd_zone_mgmt_send] = "Zone Management Send",
[nvme_cmd_zone_mgmt_recv] = "Zone Management Receive",
[nvme_cmd_zone_append] = "Zone Append",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 37b6fa746662..214628356e44 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -763,6 +763,63 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
+static inline blk_status_t nvme_setup_copy_offload(struct nvme_ns *ns,
+  struct request *req,
+  struct nvme_command *cmnd)
+{
+   struct nvme_copy_range *range = NULL;
+   struct bio *bio;
+   u64 dst_lba, src_lba, n_lba;
+   u16 nr_range = 1, control = 0, seg = 1;
+
+   if (blk_rq_nr_phys_segments(req) != COPY_MAX_SEGMENTS)
+   return BLK_STS_IOERR;
+
+   /*
+* First bio contains information about source and last bio contains
+* information about destination.
+*/
+   __rq_for_each_bio(bio, req) {
+   if (seg == blk_rq_nr_phys_segments(req)) {
+   dst_lba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
+   if (n_lba != bio->bi_iter.bi_size >> ns->lba_shift)
+   return BLK_STS_IOERR;
+   } else {
+   src_lba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
+   n_lba = bio->bi_iter.bi_size >> ns->lba_shift;
+   }
+   seg++;
+   }
+
+   if (req->cmd_flags & REQ_FUA)
+   control |= NVME_RW_FUA;
+
+   if (req->cmd_flags & REQ_FAILFAST_DEV)
+   control |= NVME_RW_LR;
+
+   memset(cmnd, 0, sizeof(*cmnd));
+   cmnd->copy.opcode = nvme_cmd_copy;
+   cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+   cmnd->copy.control = cpu_to_le16(control);
+   cmnd->copy.sdlba = cpu_to_le64(dst_lba);
+   cmnd->copy.nr_range = 0;
+
+   range = kmalloc_array(nr_range, sizeof(*range),
+ GFP_ATOMIC | __GFP_NOWARN);
+   if (!range)
+   return BLK_STS_RESOURCE;
+
+   range[0].slba = cpu_to_le64(src_lba);
+   range[0].nlb = cpu_to_le16(n_lba - 1);
+
+   req->special_vec.bv_page = virt_to_page(range);
+   req->special_vec.bv_offset = offset_in_page(range);
+   req->special_vec.bv_len = sizeof(*range) * nr_range;
+   req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+   return BLK_STS_OK;
+}
+
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
 {
@@ -1005,6 +1062,11 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct 
request *req)
case REQ_OP_ZONE_APPEND:
ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
break;
+   case REQ_OP_COPY_SRC:
+   ret = nvme_setup_copy_offload(ns, req, cmd);
+   break;
+   case REQ_OP_COPY_DST:
+   return BLK_STS_IOERR;
default:
WARN_ON_ONCE(1);
return BLK_STS_IOERR;
@@ -1745,6 +1807,21 @@ static void nvme_config_discard(struct gendisk *disk, 
struct nvme_ns *ns)
blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
+static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
+   struct nvme_id_ns *id)
+{
+   struct nvme_ctrl *ctrl = ns->ctrl;
+   struct request_queue *q = disk->queue;
+
+   if (!(ctrl->oncs & NVME

[dm-devel] [PATCH v14 06/11] fs, block: copy_file_range for def_blk_ops for direct block device

2023-08-11 Thread Nitesh Shetty
For direct block device opened with O_DIRECT, use copy_file_range to
issue device copy offload, and fallback to generic_copy_file_range incase
device copy offload capability is absent or the device files are not open
with O_DIRECT.

Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 block/fops.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/block/fops.c b/block/fops.c
index eaa98a987213..f5cf061ea91b 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -738,6 +738,30 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
return ret;
 }
 
+static ssize_t blkdev_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
+{
+   struct block_device *in_bdev = I_BDEV(bdev_file_inode(file_in));
+   struct block_device *out_bdev = I_BDEV(bdev_file_inode(file_out));
+   ssize_t copied = 0;
+
+   if ((in_bdev == out_bdev) && bdev_max_copy_sectors(in_bdev) &&
+   (file_in->f_iocb_flags & IOCB_DIRECT) &&
+   (file_out->f_iocb_flags & IOCB_DIRECT)) {
+   copied = blkdev_copy_offload(in_bdev, pos_in, pos_out, len,
+NULL, NULL, GFP_KERNEL);
+   if (copied < 0)
+   copied = 0;
+   }
+   if (copied != len)
+   copied = generic_copy_file_range(file_in, pos_in + copied,
+file_out, pos_out + copied,
+len - copied, flags);
+
+   return copied;
+}
+
 #defineBLKDEV_FALLOC_FL_SUPPORTED  
\
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
@@ -831,6 +855,7 @@ const struct file_operations def_blk_fops = {
.splice_read= filemap_splice_read,
.splice_write   = iter_file_splice_write,
.fallocate  = blkdev_fallocate,
+   .copy_file_range = blkdev_copy_file_range,
 };
 
 static __init int blkdev_init(void)
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v14 05/11] fs/read_write: Enable copy_file_range for block device.

2023-08-11 Thread Nitesh Shetty
From: Anuj Gupta 

This is a prep patch. Allow copy_file_range to work for block devices.
Relaxing generic_copy_file_checks allows us to reuse the existing infra,
instead of adding a new user interface for block copy offload.
Change generic_copy_file_checks to use ->f_mapping->host for both inode_in
and inode_out. Allow block device in generic_file_rw_checks.

Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 fs/read_write.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index b07de77ef126..eaeb481477f4 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1405,8 +1405,8 @@ static int generic_copy_file_checks(struct file *file_in, 
loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t *req_count, unsigned int flags)
 {
-   struct inode *inode_in = file_inode(file_in);
-   struct inode *inode_out = file_inode(file_out);
+   struct inode *inode_in = file_in->f_mapping->host;
+   struct inode *inode_out = file_out->f_mapping->host;
uint64_t count = *req_count;
loff_t size_in;
int ret;
@@ -1708,7 +1708,9 @@ int generic_file_rw_checks(struct file *file_in, struct 
file *file_out)
/* Don't copy dirs, pipes, sockets... */
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
return -EISDIR;
-   if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+   if (!S_ISREG(inode_in->i_mode) && !S_ISBLK(inode_in->i_mode))
+   return -EINVAL;
+   if ((inode_in->i_mode & S_IFMT) != (inode_out->i_mode & S_IFMT))
return -EINVAL;
 
if (!(file_in->f_mode & FMODE_READ) ||
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v14 04/11] block: add emulation for copy

2023-08-11 Thread Nitesh Shetty
For the devices which does not support copy, copy emulation is added.
It is required for in-kernel users like fabrics, where file descriptor is
not available and hence they can't use copy_file_range.
Copy-emulation is implemented by reading from source into memory and
writing to the corresponding destination.
Also emulation can be used, if copy offload fails or partially completes.
At present in kernel user of emulation is NVMe fabrics.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
Signed-off-by: Anuj Gupta 
---
 block/blk-lib.c| 223 +
 include/linux/blkdev.h |   4 +
 2 files changed, 227 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index ad512293730b..5eeb5af20a9a 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -26,6 +26,20 @@ struct blkdev_copy_offload_io {
loff_t offset;
 };
 
+/* Keeps track of single outstanding copy emulation IO */
+struct blkdev_copy_emulation_io {
+   struct blkdev_copy_io *cio;
+   struct work_struct emulation_work;
+   void *buf;
+   ssize_t buf_len;
+   loff_t pos_in;
+   loff_t pos_out;
+   ssize_t len;
+   struct block_device *bdev_in;
+   struct block_device *bdev_out;
+   gfp_t gfp;
+};
+
 static sector_t bio_discard_limit(struct block_device *bdev, sector_t sector)
 {
unsigned int discard_granularity = bdev_discard_granularity(bdev);
@@ -311,6 +325,215 @@ ssize_t blkdev_copy_offload(struct block_device *bdev, 
loff_t pos_in,
 }
 EXPORT_SYMBOL_GPL(blkdev_copy_offload);
 
+static void *blkdev_copy_alloc_buf(ssize_t req_size, ssize_t *alloc_size,
+  gfp_t gfp)
+{
+   int min_size = PAGE_SIZE;
+   char *buf;
+
+   while (req_size >= min_size) {
+   buf = kvmalloc(req_size, gfp);
+   if (buf) {
+   *alloc_size = req_size;
+   return buf;
+   }
+   req_size >>= 1;
+   }
+
+   return NULL;
+}
+
+static struct bio *bio_map_buf(void *data, unsigned int len, gfp_t gfp)
+{
+   unsigned long kaddr = (unsigned long)data;
+   unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+   unsigned long start = kaddr >> PAGE_SHIFT;
+   const int nr_pages = end - start;
+   bool is_vmalloc = is_vmalloc_addr(data);
+   struct page *page;
+   int offset, i;
+   struct bio *bio;
+
+   bio = bio_kmalloc(nr_pages, gfp);
+   if (!bio)
+   return ERR_PTR(-ENOMEM);
+   bio_init(bio, NULL, bio->bi_inline_vecs, nr_pages, 0);
+
+   if (is_vmalloc) {
+   flush_kernel_vmap_range(data, len);
+   bio->bi_private = data;
+   }
+
+   offset = offset_in_page(kaddr);
+   for (i = 0; i < nr_pages; i++) {
+   unsigned int bytes = PAGE_SIZE - offset;
+
+   if (len <= 0)
+   break;
+
+   if (bytes > len)
+   bytes = len;
+
+   if (!is_vmalloc)
+   page = virt_to_page(data);
+   else
+   page = vmalloc_to_page(data);
+   if (bio_add_page(bio, page, bytes, offset) < bytes) {
+   /* we don't support partial mappings */
+   bio_uninit(bio);
+   kfree(bio);
+   return ERR_PTR(-EINVAL);
+   }
+
+   data += bytes;
+   len -= bytes;
+   offset = 0;
+   }
+
+   return bio;
+}
+
+static void blkdev_copy_emulation_work(struct work_struct *work)
+{
+   struct blkdev_copy_emulation_io *emulation_io = container_of(work,
+   struct blkdev_copy_emulation_io, emulation_work);
+   struct blkdev_copy_io *cio = emulation_io->cio;
+   struct bio *read_bio, *write_bio;
+   loff_t pos_in = emulation_io->pos_in, pos_out = emulation_io->pos_out;
+   ssize_t rem, chunk;
+   int ret = 0;
+
+   for (rem = emulation_io->len; rem > 0; rem -= chunk) {
+   chunk = min_t(int, emulation_io->buf_len, rem);
+
+   read_bio = bio_map_buf(emulation_io->buf,
+  emulation_io->buf_len,
+  emulation_io->gfp);
+   if (IS_ERR(read_bio)) {
+   ret = PTR_ERR(read_bio);
+   break;
+   }
+   read_bio->bi_opf = REQ_OP_READ | REQ_SYNC;
+   bio_set_dev(read_bio, emulation_io->bdev_in);
+   read_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
+   read_bio->bi_iter.bi_size = chunk;
+   ret = submit_bio_wait(read_bio);
+   kfree(read_bio);
+   if (ret)
+   break;
+
+   write_bio = bio_map_

[dm-devel] [PATCH v14 03/11] block: add copy offload support

2023-08-11 Thread Nitesh Shetty
Introduce blkdev_copy_offload to perform copy offload.
Issue REQ_OP_COPY_SRC with source info along with taking a plug.
This flows till request layer and waits for dst bio to arrive.
Issue REQ_OP_COPY_DST with destination info and this bio reaches request
layer and merges with src request.
For any reason, if a request comes to the driver with only one of src/dst
bio, we fail the copy offload.

Larger copy will be divided, based on max_copy_sectors limit.

Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 block/blk-lib.c| 196 +
 include/linux/blkdev.h |   4 +
 2 files changed, 200 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..ad512293730b 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -10,6 +10,22 @@
 
 #include "blk.h"
 
+/* Keeps track of all outstanding copy IO */
+struct blkdev_copy_io {
+   atomic_t refcount;
+   ssize_t copied;
+   int status;
+   struct task_struct *waiter;
+   void (*endio)(void *private, int status, ssize_t copied);
+   void *private;
+};
+
+/* Keeps track of single outstanding copy offload IO */
+struct blkdev_copy_offload_io {
+   struct blkdev_copy_io *cio;
+   loff_t offset;
+};
+
 static sector_t bio_discard_limit(struct block_device *bdev, sector_t sector)
 {
unsigned int discard_granularity = bdev_discard_granularity(bdev);
@@ -115,6 +131,186 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+static inline ssize_t blkdev_copy_sanity_check(struct block_device *bdev_in,
+  loff_t pos_in,
+  struct block_device *bdev_out,
+  loff_t pos_out, size_t len)
+{
+   unsigned int align = max(bdev_logical_block_size(bdev_out),
+bdev_logical_block_size(bdev_in)) - 1;
+
+   if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
+   len >= COPY_MAX_BYTES)
+   return -EINVAL;
+
+   return 0;
+}
+
+static inline void blkdev_copy_endio(struct blkdev_copy_io *cio)
+{
+   if (cio->endio) {
+   cio->endio(cio->private, cio->status, cio->copied);
+   kfree(cio);
+   } else {
+   struct task_struct *waiter = cio->waiter;
+
+   WRITE_ONCE(cio->waiter, NULL);
+   blk_wake_io_task(waiter);
+   }
+}
+
+/*
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to complete.
+ * Returns the length of bytes copied or error
+ */
+static ssize_t blkdev_copy_wait_io_completion(struct blkdev_copy_io *cio)
+{
+   ssize_t ret;
+
+   for (;;) {
+   __set_current_state(TASK_UNINTERRUPTIBLE);
+   if (!READ_ONCE(cio->waiter))
+   break;
+   blk_io_schedule();
+   }
+   __set_current_state(TASK_RUNNING);
+   ret = cio->copied;
+   kfree(cio);
+
+   return ret;
+}
+
+static void blkdev_copy_offload_dst_endio(struct bio *bio)
+{
+   struct blkdev_copy_offload_io *offload_io = bio->bi_private;
+   struct blkdev_copy_io *cio = offload_io->cio;
+
+   if (bio->bi_status) {
+   cio->copied = min_t(ssize_t, offload_io->offset, cio->copied);
+   if (!cio->status)
+   cio->status = blk_status_to_errno(bio->bi_status);
+   }
+   bio_put(bio);
+
+   if (atomic_dec_and_test(>refcount))
+   blkdev_copy_endio(cio);
+}
+
+/*
+ * @bdev:  block device
+ * @pos_in:source offset
+ * @pos_out:   destination offset
+ * @len:   length in bytes to be copied
+ * @endio: endio function to be called on completion of copy operation,
+ * for synchronous operation this should be NULL
+ * @private:   endio function will be called with this private data,
+ * for synchronous operation this should be NULL
+ * @gfp_mask:  memory allocation flags (for bio_alloc)
+ *
+ * For synchronous operation returns the length of bytes copied or error
+ * For asynchronous operation returns -EIOCBQUEUED or error
+ *
+ * Description:
+ * Copy source offset to destination offset within block device, using
+ * device's native copy offload feature.
+ * We perform copy operation by sending 2 bio's.
+ * 1. We take a plug and send a REQ_OP_COPY_SRC bio along with source
+ * sector and length. Once this bio reaches request layer, we form a
+ * request and wait for dst bio to arrive.
+ * 2. We issue REQ_OP_COPY_DST bio along with destination sector, length.
+ * Once this bio reaches request layer and find a request with previously
+ * sent source info we merge the destination bio and return.
+ *   

[dm-devel] [PATCH v14 01/11] block: Introduce queue limits and sysfs for copy-offload support

2023-08-11 Thread Nitesh Shetty
Add device limits as sysfs entries,
- copy_max_bytes (RW)
- copy_max_hw_bytes (RO)

Above limits help to split the copy payload in block layer.
copy_max_bytes: maximum total length of copy in single payload.
copy_max_hw_bytes: Reflects the device supported maximum limit.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Anuj Gupta 
---
 Documentation/ABI/stable/sysfs-block | 23 ++
 block/blk-settings.c | 24 +++
 block/blk-sysfs.c| 36 
 include/linux/blkdev.h   | 13 ++
 4 files changed, 96 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block 
b/Documentation/ABI/stable/sysfs-block
index c57e5b7cb532..1728b5ceabcb 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,29 @@ Description:
last zone of the device which may be smaller.
 
 
+What:  /sys/block//queue/copy_max_bytes
+Date:  August 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] This is the maximum number of bytes that the block layer
+   will allow for a copy request. This is always smaller or
+   equal to the maximum size allowed by the hardware, indicated by
+   'copy_max_hw_bytes'. An attempt to set a value higher than
+   'copy_max_hw_bytes' will truncate this to 'copy_max_hw_bytes'.
+   Writing '0' to this file will disable offloading copies for this
+   device, instead copy is done via emulation.
+
+
+What:  /sys/block//queue/copy_max_hw_bytes
+Date:  August 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RO] This is the maximum number of bytes that the hardware
+   will allow for single data copy request.
+   A value of 0 means that the device does not support
+   copy offload.
+
+
 What:  /sys/block//queue/crypto/
 Date:  February 2022
 Contact:   linux-bl...@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0046b447268f..7c6aaa4df565 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
+   lim->max_copy_hw_sectors = 0;
+   lim->max_copy_sectors = 0;
 }
 
 /**
@@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_dev_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
lim->max_zone_append_sectors = UINT_MAX;
+   lim->max_copy_hw_sectors = UINT_MAX;
+   lim->max_copy_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/*
+ * blk_queue_max_copy_hw_sectors - set max sectors for a single copy payload
+ * @q: the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ */
+void blk_queue_max_copy_hw_sectors(struct request_queue *q,
+  unsigned int max_copy_sectors)
+{
+   if (max_copy_sectors > (COPY_MAX_BYTES >> SECTOR_SHIFT))
+   max_copy_sectors = COPY_MAX_BYTES >> SECTOR_SHIFT;
+
+   q->limits.max_copy_hw_sectors = max_copy_sectors;
+   q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_hw_sectors);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -578,6 +598,10 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
   b->max_segment_size);
 
+   t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+   t->max_copy_hw_sectors = min(t->max_copy_hw_sectors,
+b->max_copy_hw_sectors);
+
t->misaligned |= b->misaligned;
 
alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 63e481262336..4840e21adefa 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -199,6 +199,37 @@ static ssize_t queue_discard_zeroes_data_show(struct 
request_queue *q, char *pag
return queue_var_show(0, page);
 }
 
+static ssize_t queue_copy_hw_max_show(struct request_queue *q, char *page)
+{
+   return sprintf(page, "%llu\n", (unsigned long long)
+  q->limits.max_copy_hw_sectors << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_show(struct request_queue 

[dm-devel] [PATCH v14 02/11] Add infrastructure for copy offload in block and request layer.

2023-08-11 Thread Nitesh Shetty
We add two new opcode REQ_OP_COPY_SRC, REQ_OP_COPY_DST.
Since copy is a composite operation involving src and dst sectors/lba,
each needs to be represented by a separate bio to make it compatible
with device mapper.
We expect caller to take a plug and send bio with source information,
followed by bio with destination information.
Once the src bio arrives we form a request and wait for destination
bio. Upon arrival of destination we merge these two bio's and send
corresponding request down to device driver.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 block/blk-core.c  |  7 +++
 block/blk-merge.c | 41 +++
 block/blk.h   | 16 +++
 block/elevator.h  |  1 +
 include/linux/bio.h   |  6 +-
 include/linux/blk_types.h | 10 ++
 6 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 90de50082146..2bcd06686560 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -121,6 +121,8 @@ static const char *const blk_op_name[] = {
REQ_OP_NAME(ZONE_FINISH),
REQ_OP_NAME(ZONE_APPEND),
REQ_OP_NAME(WRITE_ZEROES),
+   REQ_OP_NAME(COPY_SRC),
+   REQ_OP_NAME(COPY_DST),
REQ_OP_NAME(DRV_IN),
REQ_OP_NAME(DRV_OUT),
 };
@@ -796,6 +798,11 @@ void submit_bio_noacct(struct bio *bio)
if (!q->limits.max_write_zeroes_sectors)
goto not_supported;
break;
+   case REQ_OP_COPY_SRC:
+   case REQ_OP_COPY_DST:
+   if (!q->limits.max_copy_sectors)
+   goto not_supported;
+   break;
default:
break;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 65e75efa9bd3..bcb55ba48107 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -158,6 +158,20 @@ static struct bio *bio_split_write_zeroes(struct bio *bio,
return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *bio_split_copy(struct bio *bio,
+ const struct queue_limits *lim,
+ unsigned int *nsegs)
+{
+   *nsegs = 1;
+   if (bio_sectors(bio) <= lim->max_copy_sectors)
+   return NULL;
+   /*
+* We don't support splitting for a copy bio. End it with EIO if
+* splitting is required and return an error pointer.
+*/
+   return ERR_PTR(-EIO);
+}
+
 /*
  * Return the maximum number of sectors from the start of a bio that may be
  * submitted as a single request to a block device. If enough sectors remain,
@@ -366,6 +380,12 @@ struct bio *__bio_split_to_limits(struct bio *bio,
case REQ_OP_WRITE_ZEROES:
split = bio_split_write_zeroes(bio, lim, nr_segs, bs);
break;
+   case REQ_OP_COPY_SRC:
+   case REQ_OP_COPY_DST:
+   split = bio_split_copy(bio, lim, nr_segs);
+   if (IS_ERR(split))
+   return NULL;
+   break;
default:
split = bio_split_rw(bio, lim, nr_segs, bs,
get_max_io_size(bio, lim) << SECTOR_SHIFT);
@@ -922,6 +942,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
if (!rq_mergeable(rq) || !bio_mergeable(bio))
return false;
 
+   if (blk_copy_offload_mergable(rq, bio))
+   return true;
+
if (req_op(rq) != bio_op(bio))
return false;
 
@@ -951,6 +974,8 @@ enum elv_merge blk_try_merge(struct request *rq, struct bio 
*bio)
 {
if (blk_discard_mergable(rq))
return ELEVATOR_DISCARD_MERGE;
+   else if (blk_copy_offload_mergable(rq, bio))
+   return ELEVATOR_COPY_OFFLOAD_MERGE;
else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
return ELEVATOR_BACK_MERGE;
else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
@@ -1053,6 +1078,20 @@ static enum bio_merge_status 
bio_attempt_discard_merge(struct request_queue *q,
return BIO_MERGE_FAILED;
 }
 
+static enum bio_merge_status bio_attempt_copy_offload_merge(struct request 
*req,
+   struct bio *bio)
+{
+   if (req->__data_len != bio->bi_iter.bi_size)
+   return BIO_MERGE_FAILED;
+
+   req->biotail->bi_next = bio;
+   req->biotail = bio;
+   req->nr_phys_segments++;
+   req->__data_len += bio->bi_iter.bi_size;
+
+   return BIO_MERGE_OK;
+}
+
 static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
   struct request *rq,
   struct bio *bio,
@@ -1073,6 +1112,8 @@ static enum bio_merge_status blk_attempt_bio_merge(struct 
request_queue *q,
   

[dm-devel] [PATCH v14 00/11] Implement copy offload support

2023-08-11 Thread Nitesh Shetty
The patch series covers the points discussed in past and most recently
in LSFMM'23[0].
We have covered the initial agreed requirements in this patch set and
further additional features suggested by community.

This is next iteration of our previous patch set v13[1].
We achieve copy offload by sending 2 bio's with source and destination
info and merge them to form a request. This request is sent to driver.
So this design works only for request based storage drivers.

Overall series supports:

1. Driver
- NVMe Copy command (single NS, TP 4065), including support
in nvme-target (for block and file back end).

2. Block layer
- Block-generic copy (REQ_OP_COPY_DST/SRC), operation with
  interface accommodating two block-devs
- Merging copy requests in request layer
- Emulation, for in-kernel user when offload is natively 
absent
- dm-linear support (for cases not requiring split)

3. User-interface
- copy_file_range

Testing
===
Copy offload can be tested on:
a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
parameters mssrl,mcl, msrc. For more info [2].
b. Null block device
c. NVMe Fabrics loopback.
d. blktests[3]

Emulation can be tested on any device.

fio[4].

Infra and plumbing:
===
We populate copy_file_range callback in def_blk_fops. 
For devices that support copy-offload, use blkdev_copy_offload to
achieve in-device copy.
However for cases, where device doesn't support offload,
fallback to generic_copy_file_range.
For in-kernel users (like NVMe fabrics), use blkdev_copy_offload
if device is copy offload capable or else fallback to emulation 
using blkdev_copy_emulation.
Modify checks in generic_copy_file_range to support block-device.

Blktests[3]
==
tests/block/035-040: Runs copy offload and emulation on null
  block device.
tests/block/050,055: Runs copy offload and emulation on test
  nvme block device.
tests/nvme/056-067: Create a loop backed fabrics device and
  run copy offload and emulation.

Future Work
===
- loopback device copy offload support
- upstream fio to use copy offload
- upstream blktest to test copy offload
- update man pages for copy_file_range
- expand in-kernel users of copy offload

These are to be taken up after this minimal series is agreed upon.

Additional links:
=
[0] 
https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zt14mmpgmivczugb33c60...@mail.gmail.com/

https://lore.kernel.org/linux-nvme/f0e19ae4-b37a-e9a3-2be7-a5afb334a...@nvidia.com/

https://lore.kernel.org/linux-nvme/20230113094648.15614-1-nj.she...@samsung.com/
[1] 
https://lore.kernel.org/linux-nvme/20230627183629.26571-1-nj.she...@samsung.com/
[2] 
https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v14
[4] https://github.com/OpenMPDK/fio/tree/copyoffload-3.35-v14

Changes since v13:
=
- block:
1. Simplified copy offload and emulation helpers, now
  caller needs to decide between offload/emulation fallback
2. src,dst bio order change (Christoph Hellwig)
3. refcount changes similar to dio (Christoph Hellwig)
4. Single outstanding IO for copy emulation (Christoph Hellwig)
5. use copy_max_sectors to identify copy offload
  capability and other reviews (Damien, Christoph)
6. Return status in endio handler (Christoph Hellwig)
- nvme-fabrics: fallback to emulation in case of partial
  offload completion
- in kernel user addition (Ming lei)
- indentation, documentation, minor fixes, misc changes (Damien,
  Christoph)
- blktests changes to test kernel changes

Changes since v12:
=
- block,nvme: Replaced token based approach with request based
  single namespace capable approach (Christoph Hellwig)

Changes since v11:
=
- Documentation: Improved documentation (Damien Le Moal)
- block,nvme: ssize_t return values (Darrick J. Wong)
- block: token is allocated to SECTOR_SIZE (Matthew Wilcox)
- block: mem leak fix (Maurizio Lombardi)

Changes since v10:
=
- NVMeOF: optimization in NVMe fabrics (Chaitanya Kulkarni)
- NVMeOF: sparse warnings (kernel test robot)

Changes since v9:
=
- null_blk, improved

Re: [dm-devel] [PATCH v13 3/9] block: add emulation for copy

2023-08-01 Thread Nitesh Shetty

On 23/07/20 09:50AM, Christoph Hellwig wrote:

+static void *blkdev_copy_alloc_buf(sector_t req_size, sector_t *alloc_size,
+   gfp_t gfp_mask)
+{
+   int min_size = PAGE_SIZE;
+   void *buf;
+
+   while (req_size >= min_size) {
+   buf = kvmalloc(req_size, gfp_mask);
+   if (buf) {
+   *alloc_size = req_size;
+   return buf;
+   }
+   /* retry half the requested size */
+   req_size >>= 1;
+   }
+
+   return NULL;


Is there any good reason for using vmalloc instead of a bunch
of distcontiguous pages?



kvmalloc seemed convenient for the purpose. 
We will need to call alloc_page in a loop to guarantee discontigous pages. 
Do you prefer that over kvmalloc?



+   ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
+   if (!ctx)
+   goto err_ctx;


I'd suspect it would be better to just allocte a single buffer and
only have a single outstanding copy.  That will reduce the bandwith
you can theoretically get, but copies tend to be background operations
anyway.  It will reduce the required memory, and thus the chance for
this operation to fail on a loaded system.  It will also dramatically
reduce the effect on memory managment.



Next version will have that change.

Thank You,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v13 2/9] block: Add copy offload support infrastructure

2023-07-27 Thread Nitesh Shetty
On Thu, Jul 20, 2023 at 1:12 PM Christoph Hellwig  wrote:
> > Suggested-by: Christoph Hellwig 
>
> Hmm, I'm not sure I suggested adding copy offload..
>
We meant for request based design, we will remove it.

> >  static inline unsigned int blk_rq_get_max_segments(struct request *rq)
> >  {
> >   if (req_op(rq) == REQ_OP_DISCARD)
> > @@ -303,6 +310,8 @@ static inline bool bio_may_exceed_limits(struct bio 
> > *bio,
> >   break;
> >   }
> >
> > + if (unlikely(op_is_copy(bio->bi_opf)))
> > + return false;
>
> This looks wrong to me.  I think the copy ops need to be added to the
> switch statement above as they have non-trivial splitting decisions.
> Or at least should have those as we're missing the code to split
> copy commands right now.
>

Agreed, copy will have non-trivial splitting decisions. But, I
couldn't think of scenarios where this could happen, as we check for
queue limits before issuing a copy. Do you see scenarios where split
could happen for copy here.

Acked for all other review comments.

Thank you,
Nitesh Shetty

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v13 4/9] fs, block: copy_file_range for def_blk_ops for direct block device

2023-07-24 Thread Nitesh Shetty

On 23/07/20 09:57AM, Christoph Hellwig wrote:

+/* Copy source offset from source block device to destination block
+ * device. Returns the length of bytes copied.
+ */
+ssize_t blkdev_copy_offload_failfast(
+   struct block_device *bdev_in, loff_t pos_in,
+   struct block_device *bdev_out, loff_t pos_out,
+   size_t len, gfp_t gfp_mask)


This is an odd and very misnamed interface.

Either we have a klkdev_copy() interface that automatically falls back
to a fallback (maybe with an opt-out), or we have separate
blkdev_copy_offload/blkdev_copy_emulated interface and let the caller
decide.  But none of that really is "failfast".

Also this needs to go into the helpers patch and not a patch that is
supposed to just wire copying up for block device node.


Acked.


index b07de77ef126..d27148a2543f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1447,7 +1447,8 @@ static int generic_copy_file_checks(struct file *file_in, 
loff_t pos_in,
return -EOVERFLOW;

/* Shorten the copy to EOF */
-   size_in = i_size_read(inode_in);
+   size_in = i_size_read(file_in->f_mapping->host);


generic_copy_file_checks needs to be fixed to use ->mapping->host both
or inode_in and inode_out at the top of the file instead of this
band aid.  And that needs to be a separate patch with a Fixes tag.


Addressed below.


@@ -1708,7 +1709,9 @@ int generic_file_rw_checks(struct file *file_in, struct 
file *file_out)
/* Don't copy dirs, pipes, sockets... */
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
return -EISDIR;
-   if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+
+   if ((!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) &&
+   (!S_ISBLK(inode_in->i_mode) || !S_ISBLK(inode_out->i_mode)))


This is using weird indentation, and might also not be doing
exactly what we want.  I think the better thing to do here is to:

1) check for the accetable types only on the in inode
2) have a check that the mode matches for the in and out inodes

And please do this as a separate prep patch instead of hiding it here.


Agreed. We will send a separate patch, that enables copy_file_range on
block devices.

Thank you,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v12 5/9] nvme: add copy offload support

2023-07-10 Thread Nitesh Shetty

On 23/06/08 09:24PM, Christoph Hellwig wrote:

On Thu, Jun 08, 2023 at 05:38:17PM +0530, Nitesh Shetty wrote:

Sure, we can do away with subsys and realign more on single namespace copy.
We are planning to use token to store source info, such as src sector,
len and namespace. Something like below,

struct nvme_copy_token {
struct nvme_ns *ns; // to make sure we are copying within same namespace
/* store source info during *IN operation, will be used by *OUT operation */
sector_t src_sector;
sector_t sectors;
};
Do you have any better way to handle this in mind ?


In general every time we tried to come up with a request payload that is
not just data passed to the device it has been a nightmare.

So my gut feeling would be that bi_sector and bi_iter.bi_size are the
ranges, with multiple bios being allowed to form the input data, similar
to how we implement discard merging.

The interesting part is how we'd match up these bios.  One idea would
be that since copy by definition doesn't need integrity data we just
add a copy_id that unions it, and use a simple per-gendisk copy I/D
allocator, but I'm not entirely sure how well that interacts stacking
drivers.


V13[1] implements that route. Please see if that matches with what you had
in mind?

[1] 
https://lore.kernel.org/linux-nvme/20230627183629.26571-1-nj.she...@samsung.com/

Thank you, 
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v13 3/9] block: add emulation for copy

2023-06-30 Thread Nitesh Shetty

On 23/06/29 04:33PM, Ming Lei wrote:

Hi Nitesh,

On Wed, Jun 28, 2023 at 12:06:17AM +0530, Nitesh Shetty wrote:

For the devices which does not support copy, copy emulation is added.
It is required for in-kernel users like fabrics, where file descriptor is


I can understand copy command does help for FS GC and fabrics storages,
but still not very clear why copy emulation is needed for kernel users,
is it just for covering both copy command and emulation in single
interface? Or other purposes?

I'd suggest to add more words about in-kernel users of copy emulation.



As you mentioned above, we need a single interface for covering both
copy command and emulation.
This is needed in fabrics cases, as we expose any non copy command
supported target device also as copy capable, so we fallback to emulation
once we recieve copy from host/initator.
Agreed, we will add more description to covey the same info.


not available and hence they can't use copy_file_range.
Copy-emulation is implemented by reading from source into memory and
writing to the corresponding destination asynchronously.
Also emulation is used, if copy offload fails or partially completes.


Per my understanding, this kind of emulation may not be as efficient
as doing it in userspace(two linked io_uring SQEs, read & write with
shared buffer). But it is fine if there are real in-kernel such users.



We do have plans for uring based copy interface in next phase,
once curent series is merged.
With current design we really see the advantage of emulation in fabrics case.

Thank you,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v13 2/9] block: Add copy offload support infrastructure

2023-06-29 Thread Nitesh Shetty

On 23/06/28 03:45PM, Damien Le Moal wrote:

On 6/28/23 03:36, Nitesh Shetty wrote:

Introduce blkdev_copy_offload which takes similar arguments as
copy_file_range and performs copy offload between two bdevs.


I am confused... I thought it was discussed to only allow copy offload only
within a single bdev for now... Did I missi something ?



Yes, you are right. copy is supported within single bdev only.
We will update this.


Introduce REQ_OP_COPY_DST, REQ_OP_COPY_SRC operation.
Issue REQ_OP_COPY_DST with destination info along with taking a plug.
This flows till request layer and waits for src bio to get merged.
Issue REQ_OP_COPY_SRC with source info and this bio reaches request
layer and merges with dst request.
For any reason, if request comes to driver with either only one of src/dst
info we fail the copy offload.

Larger copy will be divided, based on max_copy_sectors limit.

Suggested-by: Christoph Hellwig 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 block/blk-core.c  |   5 ++
 block/blk-lib.c   | 177 ++
 block/blk-merge.c |  21 +
 block/blk.h   |   9 ++
 block/elevator.h  |   1 +
 include/linux/bio.h   |   4 +-
 include/linux/blk_types.h |  21 +
 include/linux/blkdev.h|   4 +
 8 files changed, 241 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 99d8b9812b18..e6714391c93f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -796,6 +796,11 @@ void submit_bio_noacct(struct bio *bio)
if (!q->limits.max_write_zeroes_sectors)
goto not_supported;
break;
+   case REQ_OP_COPY_SRC:
+   case REQ_OP_COPY_DST:
+   if (!blk_queue_copy(q))
+   goto not_supported;
+   break;
default:
break;
}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..10c3eadd5bf6 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -115,6 +115,183 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);

+/*
+ * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to make it through
+ * blkdev_copy_(offload/emulate)_(read/write)_endio.
+ */
+static ssize_t blkdev_copy_wait_io_completion(struct cio *cio)
+{
+   ssize_t ret;
+
+   if (cio->endio)
+   return 0;
+
+   if (atomic_read(>refcount)) {
+   __set_current_state(TASK_UNINTERRUPTIBLE);
+   blk_io_schedule();
+   }
+
+   ret = cio->comp_len;
+   kfree(cio);
+
+   return ret;
+}
+
+static void blkdev_copy_offload_read_endio(struct bio *bio)
+{
+   struct cio *cio = bio->bi_private;
+   sector_t clen;
+
+   if (bio->bi_status) {
+   clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   }
+   bio_put(bio);
+
+   if (!atomic_dec_and_test(>refcount))
+   return;
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);


Curly brackets around else missing.



Acked.


+}
+
+/*
+ * __blkdev_copy_offload   - Use device's native copy offload feature.
+ * we perform copy operation by sending 2 bio.
+ * 1. We take a plug and send a REQ_OP_COPY_DST bio along with destination
+ * sector and length. Once this bio reaches request layer, we form a request 
and
+ * wait for src bio to arrive.
+ * 2. We issue REQ_OP_COPY_SRC bio along with source sector and length. Once
+ * this bio reaches request layer and find a request with previously sent
+ * destination info we merge the source bio and return.
+ * 3. Release the plug and request is sent to driver
+ *
+ * Returns the length of bytes copied or error if encountered
+ */
+static ssize_t __blkdev_copy_offload(
+   struct block_device *bdev_in, loff_t pos_in,
+   struct block_device *bdev_out, loff_t pos_out,
+   size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+   struct cio *cio;
+   struct bio *read_bio, *write_bio;
+   sector_t rem, copy_len, max_copy_len;
+   struct blk_plug plug;
+
+   cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+   if (!cio)
+   return -ENOMEM;
+   atomic_set(>refcount, 0);
+   cio->waiter = current;
+   cio->endio = endio;
+   cio->private = private;
+
+   max_copy_len = min(bdev_max_copy_sectors(bdev_in),
+   bdev_max_copy_sectors(bdev_out)) << SECTOR_SHIFT;


According to patch 1, this can end up b

Re: [dm-devel] [PATCH v13 3/9] block: add emulation for copy

2023-06-29 Thread Nitesh Shetty

On 23/06/28 03:50PM, Damien Le Moal wrote:

On 6/28/23 03:36, Nitesh Shetty wrote:

For the devices which does not support copy, copy emulation is added.
It is required for in-kernel users like fabrics, where file descriptor is
not available and hence they can't use copy_file_range.
Copy-emulation is implemented by reading from source into memory and
writing to the corresponding destination asynchronously.
Also emulation is used, if copy offload fails or partially completes.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
Signed-off-by: Anuj Gupta 
---
 block/blk-lib.c   | 183 +-
 block/blk-map.c   |   4 +-
 include/linux/blk_types.h |   5 ++
 include/linux/blkdev.h|   3 +
 4 files changed, 192 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 10c3eadd5bf6..09e0d5d51d03 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -234,6 +234,180 @@ static ssize_t __blkdev_copy_offload(
return blkdev_copy_wait_io_completion(cio);
 }

+static void *blkdev_copy_alloc_buf(sector_t req_size, sector_t *alloc_size,
+   gfp_t gfp_mask)
+{
+   int min_size = PAGE_SIZE;
+   void *buf;
+
+   while (req_size >= min_size) {
+   buf = kvmalloc(req_size, gfp_mask);
+   if (buf) {
+   *alloc_size = req_size;
+   return buf;
+   }
+   /* retry half the requested size */


Kind of obvious :)


Acked. will remove.




+   req_size >>= 1;
+   }
+
+   return NULL;
+}
+
+static void blkdev_copy_emulate_write_endio(struct bio *bio)
+{
+   struct copy_ctx *ctx = bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (bio->bi_status) {
+   clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   }
+   kfree(bvec_virt(>bi_io_vec[0]));
+   bio_map_kern_endio(bio);
+   kfree(ctx);
+   if (atomic_dec_and_test(>refcount)) {
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);


Curly brackets.


acked


+   }
+}
+
+static void blkdev_copy_emulate_read_endio(struct bio *read_bio)
+{
+   struct copy_ctx *ctx = read_bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (read_bio->bi_status) {
+   clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+   cio->pos_in;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   kfree(bvec_virt(_bio->bi_io_vec[0]));
+   bio_map_kern_endio(read_bio);
+   kfree(ctx);
+
+   if (atomic_dec_and_test(>refcount)) {
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);


Same.


acked




+   }
+   }
+   schedule_work(>dispatch_work);


ctx may have been freed above.


acked, will fix this.




+   kfree(read_bio);
+}
+
+static void blkdev_copy_dispatch_work(struct work_struct *work)
+{
+   struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+   dispatch_work);


Please align the argument, or even better: split the line after "=".



acked.

Thank you
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v13 4/9] fs, block: copy_file_range for def_blk_ops for direct block device

2023-06-29 Thread Nitesh Shetty

On 23/06/28 03:51PM, Damien Le Moal wrote:

On 6/28/23 03:36, Nitesh Shetty wrote:

For direct block device opened with O_DIRECT, use copy_file_range to
issue device copy offload, and fallback to generic_copy_file_range incase
device copy offload capability is absent.


...if the device does not support copy offload or the device files are not open
with O_DIRECT.

No ?


Yes your right. We will fallback to generic_copy_file_range in either of
these cases.


Modify checks to allow bdevs to use copy_file_range.

Suggested-by: Ming Lei 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 block/blk-lib.c| 26 ++
 block/fops.c   | 20 
 fs/read_write.c|  7 +--
 include/linux/blkdev.h |  4 
 4 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 09e0d5d51d03..7d8e09a99254 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -473,6 +473,32 @@ ssize_t blkdev_copy_offload(
 }
 EXPORT_SYMBOL_GPL(blkdev_copy_offload);

+/* Copy source offset from source block device to destination block
+ * device. Returns the length of bytes copied.
+ */


Multi-line comment style: start with a "/*" line please.


acked


+ssize_t blkdev_copy_offload_failfast(


What is the "failfast" in the name for ?


We dont want failed copy offload IOs to fallback to block layer copy emulation.
We wanted a API to return error, if offload fails.




+   struct block_device *bdev_in, loff_t pos_in,
+   struct block_device *bdev_out, loff_t pos_out,
+   size_t len, gfp_t gfp_mask)
+{
+   struct request_queue *in_q = bdev_get_queue(bdev_in);
+   struct request_queue *out_q = bdev_get_queue(bdev_out);
+   ssize_t ret = 0;


You do not need this initialization.



we need this initialization, because __blkdev_copy_offload return number of
bytes copied or error value.
So we can not return 0, incase of success/partial completion.
blkdev_copy_offload_failfast is expected to return number of bytes copied.


+
+   if (blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len))
+   return 0;
+
+   if (blk_queue_copy(in_q) && blk_queue_copy(out_q)) {


Given that I think we do not allow copies between different devices, in_q and
out_q should always be the same, no ?


acked, will update this.




+   ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
+   len, NULL, NULL, gfp_mask);


Same here. Why pass 2 bdevs if we only allow copies within the same device ?



acked, will update function arguments to take single bdev.


+   if (ret < 0)
+   return 0;
+   }
+
+   return ret;


return 0;



Nack, explained above.

Thank you,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v13 1/9] block: Introduce queue limits for copy-offload support

2023-06-29 Thread Nitesh Shetty

On 23/06/28 03:40PM, Damien Le Moal wrote:

On 6/28/23 03:36, Nitesh Shetty wrote:

Add device limits as sysfs entries,
- copy_offload (RW)
- copy_max_bytes (RW)
- copy_max_bytes_hw (RO)

Above limits help to split the copy payload in block layer.
copy_offload: used for setting copy offload(1) or emulation(0).
copy_max_bytes: maximum total length of copy in single payload.
copy_max_bytes_hw: Reflects the device supported maximum limit.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Anuj Gupta 
---
 Documentation/ABI/stable/sysfs-block | 33 +++
 block/blk-settings.c | 24 +++
 block/blk-sysfs.c| 63 
 include/linux/blkdev.h   | 12 ++
 include/uapi/linux/fs.h  |  3 ++
 5 files changed, 135 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block 
b/Documentation/ABI/stable/sysfs-block
index c57e5b7cb532..3c97303f658b 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,39 @@ Description:
last zone of the device which may be smaller.


+What:  /sys/block//queue/copy_offload
+Date:  June 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] When read, this file shows whether offloading copy to a
+   device is enabled (1) or disabled (0). Writing '0' to this
+   file will disable offloading copies for this device.
+   Writing any '1' value will enable this feature. If the device
+   does not support offloading, then writing 1, will result in an
+   error.


I am still not convinced that this one is really necessary. copy_max_bytes_hw !=
0 indicates that the devices supports copy offload. And setting copy_max_bytes
to 0 can be used to disable copy offload (which probably should be the default
for now).



Agreed, we will do this in next iteration.


+
+
+What:  /sys/block//queue/copy_max_bytes
+Date:  June 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] This is the maximum number of bytes that the block layer
+   will allow for a copy request. This will is always smaller or


will is -> is



acked


+   equal to the maximum size allowed by the hardware, indicated by
+   'copy_max_bytes_hw'. An attempt to set a value higher than
+   'copy_max_bytes_hw' will truncate this to 'copy_max_bytes_hw'.
+
+
+What:  /sys/block//queue/copy_max_bytes_hw


Nit: In keeping with the spirit of attributes like
max_hw_sectors_kb/max_sectors_kb, I would call this one copy_max_hw_bytes.



acked, will update in next iteration.


+Date:  June 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RO] This is the maximum number of bytes that the hardware
+   will allow for single data copy request.
+   A value of 0 means that the device does not support
+   copy offload.
+
+
 What:  /sys/block//queue/crypto/
 Date:  February 2022
 Contact:   linux-bl...@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 4dd59059b788..738cd3f21259 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
+   lim->max_copy_sectors_hw = 0;
+   lim->max_copy_sectors = 0;
 }

 /**
@@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_dev_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
lim->max_zone_append_sectors = UINT_MAX;
+   lim->max_copy_sectors_hw = UINT_MAX;
+   lim->max_copy_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);

@@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);

+/**
+ * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
+ * @q:  the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ **/
+void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+   unsigned int max_copy_sectors)
+{
+   if (max_copy_sectors > (COPY_MAX_BYTES >> SECTOR_SHIFT))
+   max_copy_sectors = COPY_MAX_BYTES >> SECTOR_SHIFT;
+
+   q->limits.max_copy_sectors_hw = max_copy_sectors;
+   q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -578,6 +598,10 @@ int blk_stack_limits(stru

[dm-devel] [PATCH v13 6/9] nvmet: add copy command support for bdev and file ns

2023-06-28 Thread Nitesh Shetty
Add support for handling nvme_cmd_copy command on target.
For bdev-ns we call into blkdev_copy_offload, which the block layer
completes by a offloaded copy request to backend bdev or by emulating the
request.

For file-ns we call vfs_copy_file_range to service our request.

Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.

loop target has copy support, which can be used to test copy offload.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/target/admin-cmd.c   |  9 -
 drivers/nvme/target/io-cmd-bdev.c | 62 +++
 drivers/nvme/target/io-cmd-file.c | 52 ++
 drivers/nvme/target/nvmet.h   |  1 +
 4 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 39cb570f833d..8e644b8ec0fd 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -433,8 +433,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req 
*req)
id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-   NVME_CTRL_ONCS_WRITE_ZEROES);
-
+   NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
/* XXX: don't report vwc if the underlying device is write through */
id->vwc = NVME_CTRL_VWC_PRESENT;
 
@@ -536,6 +535,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req 
*req)
 
if (req->ns->bdev)
nvmet_bdev_set_limits(req->ns->bdev, id);
+   else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
+   (PAGE_SHIFT - SECTOR_SHIFT));
+   id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
+   }
 
/*
 * We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index 2733e0158585..5c4c6a460cfa 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,18 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct 
nvme_id_ns *id)
id->npda = id->npdg;
/* NOWS = Namespace Optimal Write Size */
id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+   if (bdev_max_copy_sectors(bdev)) {
+   id->msrc = id->msrc;
+   id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
+   SECTOR_SHIFT) / bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   } else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
+   bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   }
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -184,6 +196,21 @@ static void nvmet_bio_done(struct bio *bio)
nvmet_req_bio_put(req, bio);
 }
 
+static void nvmet_bdev_copy_end_io(void *private, int comp_len)
+{
+   struct nvmet_req *req = (struct nvmet_req *)private;
+   u16 status;
+
+   if (comp_len == req->copy_len) {
+   req->cqe->result.u32 = cpu_to_le32(1);
+   status = errno_to_nvme_status(req, 0);
+   } else {
+   req->cqe->result.u32 = cpu_to_le32(0);
+   status = errno_to_nvme_status(req, (__force u16)BLK_STS_IOERR);
+   }
+   nvmet_req_complete(req, status);
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 static int nvmet_bdev_alloc_bip(struct nvmet_req *req, struct bio *bio,
struct sg_mapping_iter *miter)
@@ -450,6 +477,37 @@ static void nvmet_bdev_execute_write_zeroes(struct 
nvmet_req *req)
}
 }
 
+/* At present we handle only one range entry, since copy offload is aligned 
with
+ * copy_file_range, only one entry is passed from block layer.
+ */
+static void nvmet_bdev_execute_copy(struct nvmet_req *req)
+{
+   struct nvme_copy_range range;
+   struct nvme_command *cmd = req->cmd;
+   ssize_t ret;
+   u16 status;
+
+   status = nvmet_copy_from_sgl(req, 0, , sizeof(range));
+   if (status)
+   goto out;
+
+   ret = blkdev_copy_offload(req->ns->bdev,
+   le64_to_cpu(cmd->copy.sdlba) << req->ns->blksize_shift,
+   req->ns->bdev,
+   le64_to_cpu(range.slba) << req->ns->blksize_shift,
+   (le16_to_cpu(range.nlb) + 1) << req->ns->blksize_shift,
+   nvmet_bdev_copy_end_io, (void *)req, GFP_KERNEL);
+   if (ret) {
+

[dm-devel] [PATCH v13 5/9] nvme: add copy offload support

2023-06-28 Thread Nitesh Shetty
Current design only supports single source range.
We receive a request with REQ_OP_COPY_DST.
Parse this request which consists of dst(1st) and src(2nd) bios.
Form a copy command (TP 4065)

trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.

Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier González 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/host/constants.c |  1 +
 drivers/nvme/host/core.c  | 79 +++
 drivers/nvme/host/trace.c | 19 +
 include/linux/nvme.h  | 43 +--
 4 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index 5e4f8848dce0..311ad67e9cf3 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -19,6 +19,7 @@ static const char * const nvme_ops[] = {
[nvme_cmd_resv_report] = "Reservation Report",
[nvme_cmd_resv_acquire] = "Reservation Acquire",
[nvme_cmd_resv_release] = "Reservation Release",
+   [nvme_cmd_copy] = "Copy Offload",
[nvme_cmd_zone_mgmt_send] = "Zone Management Send",
[nvme_cmd_zone_mgmt_recv] = "Zone Management Receive",
[nvme_cmd_zone_append] = "Zone Append",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 98bfb3d9c22a..d4063e981492 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -763,6 +763,60 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
+static inline blk_status_t nvme_setup_copy_write(struct nvme_ns *ns,
+  struct request *req, struct nvme_command *cmnd)
+{
+   struct nvme_copy_range *range = NULL;
+   struct bio *bio;
+   u64 dst_lba, src_lba, n_lba;
+   u16 nr_range = 1, control = 0;
+
+   if (blk_rq_nr_phys_segments(req) != 2)
+   return BLK_STS_IOERR;
+
+   /* +1 shift as dst+src length is added in request merging, we send copy
+* for half the length.
+*/
+   n_lba = blk_rq_bytes(req) >> (ns->lba_shift + 1);
+   if (WARN_ON(!n_lba))
+   return BLK_STS_NOTSUPP;
+
+   dst_lba = nvme_sect_to_lba(ns, blk_rq_pos(req));
+   __rq_for_each_bio(bio, req) {
+   src_lba = nvme_sect_to_lba(ns, bio->bi_iter.bi_sector);
+   if (n_lba != bio->bi_iter.bi_size >> ns->lba_shift)
+   return BLK_STS_IOERR;
+   }
+
+   if (req->cmd_flags & REQ_FUA)
+   control |= NVME_RW_FUA;
+
+   if (req->cmd_flags & REQ_FAILFAST_DEV)
+   control |= NVME_RW_LR;
+
+   memset(cmnd, 0, sizeof(*cmnd));
+   cmnd->copy.opcode = nvme_cmd_copy;
+   cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+   cmnd->copy.control = cpu_to_le16(control);
+   cmnd->copy.sdlba = cpu_to_le64(dst_lba);
+   cmnd->copy.nr_range = 0;
+
+   range = kmalloc_array(nr_range, sizeof(*range),
+   GFP_ATOMIC | __GFP_NOWARN);
+   if (!range)
+   return BLK_STS_RESOURCE;
+
+   range[0].slba = cpu_to_le64(src_lba);
+   range[0].nlb = cpu_to_le16(n_lba - 1);
+
+   req->special_vec.bv_page = virt_to_page(range);
+   req->special_vec.bv_offset = offset_in_page(range);
+   req->special_vec.bv_len = sizeof(*range) * nr_range;
+   req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+   return BLK_STS_OK;
+}
+
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
 {
@@ -1005,6 +1059,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct 
request *req)
case REQ_OP_ZONE_APPEND:
ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
break;
+   case REQ_OP_COPY_DST:
+   ret = nvme_setup_copy_write(ns, req, cmd);
+   break;
default:
WARN_ON_ONCE(1);
return BLK_STS_IOERR;
@@ -1742,6 +1799,26 @@ static void nvme_config_discard(struct gendisk *disk, 
struct nvme_ns *ns)
blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
+static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
+  struct nvme_id_ns *id)
+{
+   struct nvme_ctrl *ctrl = ns->ctrl;
+   struct request_queue *q = disk->queue;
+
+   if (!(ctrl->oncs & NVME_CTRL_ONCS_COPY)) {
+   blk_queue_max_copy_sectors_hw(q, 0);
+   blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+   return;
+   }
+
+   /* setting copy limits */
+   if (blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, q))
+   return;
+
+   blk_queue_max_copy_sectors_hw(q,
+   nvme_lba_to_sect(ns, le16_to_c

[dm-devel] [PATCH v13 9/9] null_blk: add support for copy offload

2023-06-28 Thread Nitesh Shetty
Implementaion is based on existing read and write infrastructure.
copy_max_bytes: A new configfs and module parameter is introduced, which
can be used to set hardware/driver supported maximum copy limit.

Suggested-by: Damien Le Moal 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
---
 Documentation/block/null_blk.rst  |  5 ++
 drivers/block/null_blk/main.c | 85 +--
 drivers/block/null_blk/null_blk.h |  1 +
 3 files changed, 88 insertions(+), 3 deletions(-)

diff --git a/Documentation/block/null_blk.rst b/Documentation/block/null_blk.rst
index 4dd78f24d10a..6153e02fcf13 100644
--- a/Documentation/block/null_blk.rst
+++ b/Documentation/block/null_blk.rst
@@ -149,3 +149,8 @@ zone_size=[MB]: Default: 256
 zone_nr_conv=[nr_conv]: Default: 0
   The number of conventional zones to create when block device is zoned.  If
   zone_nr_conv >= nr_zones, it will be reduced to nr_zones - 1.
+
+copy_max_bytes=[size in bytes]: Default: COPY_MAX_BYTES
+  A module and configfs parameter which can be used to set hardware/driver
+  supported maximum copy offload limit.
+  COPY_MAX_BYTES(=128MB at present) is defined in fs.h
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 864013019d6b..e9461bd4dc2c 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -157,6 +157,10 @@ static int g_max_sectors;
 module_param_named(max_sectors, g_max_sectors, int, 0444);
 MODULE_PARM_DESC(max_sectors, "Maximum size of a command (in 512B sectors)");
 
+static unsigned long g_copy_max_bytes = COPY_MAX_BYTES;
+module_param_named(copy_max_bytes, g_copy_max_bytes, ulong, 0444);
+MODULE_PARM_DESC(copy_max_bytes, "Maximum size of a copy command (in bytes)");
+
 static unsigned int nr_devices = 1;
 module_param(nr_devices, uint, 0444);
 MODULE_PARM_DESC(nr_devices, "Number of devices to register");
@@ -409,6 +413,7 @@ NULLB_DEVICE_ATTR(home_node, uint, NULL);
 NULLB_DEVICE_ATTR(queue_mode, uint, NULL);
 NULLB_DEVICE_ATTR(blocksize, uint, NULL);
 NULLB_DEVICE_ATTR(max_sectors, uint, NULL);
+NULLB_DEVICE_ATTR(copy_max_bytes, uint, NULL);
 NULLB_DEVICE_ATTR(irqmode, uint, NULL);
 NULLB_DEVICE_ATTR(hw_queue_depth, uint, NULL);
 NULLB_DEVICE_ATTR(index, uint, NULL);
@@ -550,6 +555,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
_device_attr_queue_mode,
_device_attr_blocksize,
_device_attr_max_sectors,
+   _device_attr_copy_max_bytes,
_device_attr_irqmode,
_device_attr_hw_queue_depth,
_device_attr_index,
@@ -656,7 +662,8 @@ static ssize_t memb_group_features_show(struct config_item 
*item, char *page)
"poll_queues,power,queue_mode,shared_tag_bitmap,size,"
"submit_queues,use_per_node_hctx,virt_boundary,zoned,"
"zone_capacity,zone_max_active,zone_max_open,"
-   "zone_nr_conv,zone_offline,zone_readonly,zone_size\n");
+   "zone_nr_conv,zone_offline,zone_readonly,zone_size,"
+   "copy_max_bytes\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -722,6 +729,7 @@ static struct nullb_device *null_alloc_dev(void)
dev->queue_mode = g_queue_mode;
dev->blocksize = g_bs;
dev->max_sectors = g_max_sectors;
+   dev->copy_max_bytes = g_copy_max_bytes;
dev->irqmode = g_irqmode;
dev->hw_queue_depth = g_hw_queue_depth;
dev->blocking = g_blocking;
@@ -1271,6 +1279,67 @@ static int null_transfer(struct nullb *nullb, struct 
page *page,
return err;
 }
 
+static inline int nullb_setup_copy_write(struct nullb *nullb,
+   struct request *req, bool is_fua)
+{
+   sector_t sector_in, sector_out;
+   void *in, *out;
+   size_t rem, temp;
+   struct bio *bio;
+   unsigned long offset_in, offset_out;
+   struct nullb_page *t_page_in, *t_page_out;
+   int ret = -EIO;
+
+   sector_out = blk_rq_pos(req);
+
+   __rq_for_each_bio(bio, req) {
+   sector_in = bio->bi_iter.bi_sector;
+   rem = bio->bi_iter.bi_size;
+   }
+
+   if (WARN_ON(!rem))
+   return BLK_STS_NOTSUPP;
+
+   spin_lock_irq(>lock);
+   while (rem > 0) {
+   temp = min_t(size_t, nullb->dev->blocksize, rem);
+   offset_in = (sector_in & SECTOR_MASK) << SECTOR_SHIFT;
+   offset_out = (sector_out & SECTOR_MASK) << SECTOR_SHIFT;
+
+   if (null_cache_active(nullb) && !is_fua)
+   null_make_cache_space(nullb, PAGE_SIZE);
+
+   t_page_in = null_lookup_page(nullb, sector_in, false,
+   !null_cache_active(nullb));
+   if (!t_page_in)
+   goto err

[dm-devel] [PATCH v13 8/9] dm: Enable copy offload for dm-linear target

2023-06-28 Thread Nitesh Shetty
Setting copy_offload_supported flag to enable offload.

Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-linear.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index f4448d520ee9..1d1ee30bbefb 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -62,6 +62,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
ti->num_discard_bios = 1;
ti->num_secure_erase_bios = 1;
ti->num_write_zeroes_bios = 1;
+   ti->copy_offload_supported = 1;
ti->private = lc;
return 0;
 
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v13 7/9] dm: Add support for copy offload

2023-06-28 Thread Nitesh Shetty
Before enabling copy for dm target, check if underlying devices and
dm target support copy. Avoid split happening inside dm target.
Fail early if the request needs split, currently splitting copy
request is not supported.

Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-table.c | 41 +++
 drivers/md/dm.c   |  7 ++
 include/linux/device-mapper.h |  5 +
 3 files changed, 53 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 7d208b2b1a19..2d08a890d7e1 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1862,6 +1862,39 @@ static bool dm_table_supports_nowait(struct dm_table *t)
return true;
 }
 
+static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
+{
+   struct request_queue *q = bdev_get_queue(dev->bdev);
+
+   return !blk_queue_copy(q);
+}
+
+static bool dm_table_supports_copy(struct dm_table *t)
+{
+   struct dm_target *ti;
+   unsigned int i;
+
+   for (i = 0; i < t->num_targets; i++) {
+   ti = dm_table_get_target(t, i);
+
+   if (!ti->copy_offload_supported)
+   return false;
+
+   /*
+* target provides copy support (as implied by setting
+* 'copy_offload_supported')
+* and it relies on _all_ data devices having copy support.
+*/
+   if (!ti->type->iterate_devices ||
+ti->type->iterate_devices(ti,
+device_not_copy_capable, NULL))
+   return false;
+   }
+
+   return true;
+}
+
 static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
  sector_t start, sector_t len, void *data)
 {
@@ -1944,6 +1977,14 @@ int dm_table_set_restrictions(struct dm_table *t, struct 
request_queue *q,
q->limits.discard_misaligned = 0;
}
 
+   if (!dm_table_supports_copy(t)) {
+   blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+   q->limits.max_copy_sectors = 0;
+   q->limits.max_copy_sectors_hw = 0;
+   } else {
+   blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+   }
+
if (!dm_table_supports_secure_erase(t))
q->limits.max_secure_erase_sectors = 0;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f0f118ab20fa..6245e16bf066 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1732,6 +1732,13 @@ static blk_status_t __split_and_process_bio(struct 
clone_info *ci)
if (unlikely(ci->is_abnormal_io))
return __process_abnormal_io(ci, ti);
 
+   if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
+   max_io_len(ti, ci->sector) < ci->sector_count)) {
+   DMERR("Error, IO size(%u) > max target size(%llu)\n",
+   ci->sector_count, max_io_len(ti, ci->sector));
+   return BLK_STS_IOERR;
+   }
+
/*
 * Only support bio polling for normal IO, and the target io is
 * exactly inside the dm_io instance (verified in dm_poll_dm_io)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 69d0435c7ebb..8ffee7e8cd06 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -396,6 +396,11 @@ struct dm_target {
 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
 */
bool needs_bio_set_dev:1;
+
+   /*
+* copy offload is supported
+*/
+   bool copy_offload_supported:1;
 };
 
 void *dm_per_bio_data(struct bio *bio, size_t data_size);
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v13 4/9] fs, block: copy_file_range for def_blk_ops for direct block device

2023-06-27 Thread Nitesh Shetty
For direct block device opened with O_DIRECT, use copy_file_range to
issue device copy offload, and fallback to generic_copy_file_range incase
device copy offload capability is absent.
Modify checks to allow bdevs to use copy_file_range.

Suggested-by: Ming Lei 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 block/blk-lib.c| 26 ++
 block/fops.c   | 20 
 fs/read_write.c|  7 +--
 include/linux/blkdev.h |  4 
 4 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 09e0d5d51d03..7d8e09a99254 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -473,6 +473,32 @@ ssize_t blkdev_copy_offload(
 }
 EXPORT_SYMBOL_GPL(blkdev_copy_offload);
 
+/* Copy source offset from source block device to destination block
+ * device. Returns the length of bytes copied.
+ */
+ssize_t blkdev_copy_offload_failfast(
+   struct block_device *bdev_in, loff_t pos_in,
+   struct block_device *bdev_out, loff_t pos_out,
+   size_t len, gfp_t gfp_mask)
+{
+   struct request_queue *in_q = bdev_get_queue(bdev_in);
+   struct request_queue *out_q = bdev_get_queue(bdev_out);
+   ssize_t ret = 0;
+
+   if (blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len))
+   return 0;
+
+   if (blk_queue_copy(in_q) && blk_queue_copy(out_q)) {
+   ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
+   len, NULL, NULL, gfp_mask);
+   if (ret < 0)
+   return 0;
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_copy_offload_failfast);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
struct bio **biop, unsigned flags)
diff --git a/block/fops.c b/block/fops.c
index a286bf3325c5..a1576304f269 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -621,6 +621,25 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
return ret;
 }
 
+static ssize_t blkdev_copy_file_range(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   size_t len, unsigned int flags)
+{
+   struct block_device *in_bdev = I_BDEV(bdev_file_inode(file_in));
+   struct block_device *out_bdev = I_BDEV(bdev_file_inode(file_out));
+   ssize_t comp_len = 0;
+
+   if ((file_in->f_iocb_flags & IOCB_DIRECT) &&
+   (file_out->f_iocb_flags & IOCB_DIRECT))
+   comp_len = blkdev_copy_offload_failfast(in_bdev, pos_in,
+   out_bdev, pos_out, len, GFP_KERNEL);
+   if (comp_len != len)
+   comp_len = generic_copy_file_range(file_in, pos_in + comp_len,
+   file_out, pos_out + comp_len, len - comp_len, flags);
+
+   return comp_len;
+}
+
 #defineBLKDEV_FALLOC_FL_SUPPORTED  
\
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
@@ -714,6 +733,7 @@ const struct file_operations def_blk_fops = {
.splice_read= filemap_splice_read,
.splice_write   = iter_file_splice_write,
.fallocate  = blkdev_fallocate,
+   .copy_file_range = blkdev_copy_file_range,
 };
 
 static __init int blkdev_init(void)
diff --git a/fs/read_write.c b/fs/read_write.c
index b07de77ef126..d27148a2543f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1447,7 +1447,8 @@ static int generic_copy_file_checks(struct file *file_in, 
loff_t pos_in,
return -EOVERFLOW;
 
/* Shorten the copy to EOF */
-   size_in = i_size_read(inode_in);
+   size_in = i_size_read(file_in->f_mapping->host);
+
if (pos_in >= size_in)
count = 0;
else
@@ -1708,7 +1709,9 @@ int generic_file_rw_checks(struct file *file_in, struct 
file *file_out)
/* Don't copy dirs, pipes, sockets... */
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
return -EISDIR;
-   if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+
+   if ((!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) &&
+   (!S_ISBLK(inode_in->i_mode) || !S_ISBLK(inode_out->i_mode)))
return -EINVAL;
 
if (!(file_in->f_mode & FMODE_READ) ||
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c176bf6173c5..850168cad080 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1047,6 +1047,10 @@ ssize_t blkdev_copy_offload(
struct block_device *bdev_in, loff_t pos_in,
struct block_device *bdev_out, loff_t pos_out,

[dm-devel] [PATCH v13 2/9] block: Add copy offload support infrastructure

2023-06-27 Thread Nitesh Shetty
Introduce blkdev_copy_offload which takes similar arguments as
copy_file_range and performs copy offload between two bdevs.
Introduce REQ_OP_COPY_DST, REQ_OP_COPY_SRC operation.
Issue REQ_OP_COPY_DST with destination info along with taking a plug.
This flows till request layer and waits for src bio to get merged.
Issue REQ_OP_COPY_SRC with source info and this bio reaches request
layer and merges with dst request.
For any reason, if request comes to driver with either only one of src/dst
info we fail the copy offload.

Larger copy will be divided, based on max_copy_sectors limit.

Suggested-by: Christoph Hellwig 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 block/blk-core.c  |   5 ++
 block/blk-lib.c   | 177 ++
 block/blk-merge.c |  21 +
 block/blk.h   |   9 ++
 block/elevator.h  |   1 +
 include/linux/bio.h   |   4 +-
 include/linux/blk_types.h |  21 +
 include/linux/blkdev.h|   4 +
 8 files changed, 241 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 99d8b9812b18..e6714391c93f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -796,6 +796,11 @@ void submit_bio_noacct(struct bio *bio)
if (!q->limits.max_write_zeroes_sectors)
goto not_supported;
break;
+   case REQ_OP_COPY_SRC:
+   case REQ_OP_COPY_DST:
+   if (!blk_queue_copy(q))
+   goto not_supported;
+   break;
default:
break;
}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..10c3eadd5bf6 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -115,6 +115,183 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+/*
+ * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to make it through
+ * blkdev_copy_(offload/emulate)_(read/write)_endio.
+ */
+static ssize_t blkdev_copy_wait_io_completion(struct cio *cio)
+{
+   ssize_t ret;
+
+   if (cio->endio)
+   return 0;
+
+   if (atomic_read(>refcount)) {
+   __set_current_state(TASK_UNINTERRUPTIBLE);
+   blk_io_schedule();
+   }
+
+   ret = cio->comp_len;
+   kfree(cio);
+
+   return ret;
+}
+
+static void blkdev_copy_offload_read_endio(struct bio *bio)
+{
+   struct cio *cio = bio->bi_private;
+   sector_t clen;
+
+   if (bio->bi_status) {
+   clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   }
+   bio_put(bio);
+
+   if (!atomic_dec_and_test(>refcount))
+   return;
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+}
+
+/*
+ * __blkdev_copy_offload   - Use device's native copy offload feature.
+ * we perform copy operation by sending 2 bio.
+ * 1. We take a plug and send a REQ_OP_COPY_DST bio along with destination
+ * sector and length. Once this bio reaches request layer, we form a request 
and
+ * wait for src bio to arrive.
+ * 2. We issue REQ_OP_COPY_SRC bio along with source sector and length. Once
+ * this bio reaches request layer and find a request with previously sent
+ * destination info we merge the source bio and return.
+ * 3. Release the plug and request is sent to driver
+ *
+ * Returns the length of bytes copied or error if encountered
+ */
+static ssize_t __blkdev_copy_offload(
+   struct block_device *bdev_in, loff_t pos_in,
+   struct block_device *bdev_out, loff_t pos_out,
+   size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+   struct cio *cio;
+   struct bio *read_bio, *write_bio;
+   sector_t rem, copy_len, max_copy_len;
+   struct blk_plug plug;
+
+   cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+   if (!cio)
+   return -ENOMEM;
+   atomic_set(>refcount, 0);
+   cio->waiter = current;
+   cio->endio = endio;
+   cio->private = private;
+
+   max_copy_len = min(bdev_max_copy_sectors(bdev_in),
+   bdev_max_copy_sectors(bdev_out)) << SECTOR_SHIFT;
+
+   cio->pos_in = pos_in;
+   cio->pos_out = pos_out;
+   /* If there is a error, comp_len will be set to least successfully
+* completed copied length
+*/
+   cio->comp_len = len;
+   for (rem = len; rem > 0; rem -= copy_len) {
+   copy_len = min(rem, max_copy_len);
+
+   write_bio = bio_alloc(bdev_out, 0, REQ_

[dm-devel] [PATCH v13 0/9] Implement copy offload support

2023-06-27 Thread Nitesh Shetty
The patch series covers the points discussed in past and most recently
in LSFMM'23[0].
We have covered the initial agreed requirements in this patchset and
further additional features suggested by community.

This is next iteration of our previous patchset v12[1].
We have changed the token based approach to request based approach,
instead of storing the info in token. We now try to merge the copy bio's
in request layer and send it to driver.
So this design works only for request based storage drivers.

Overall series supports:

1. Driver
- NVMe Copy command (single NS, TP 4065), including support
in nvme-target (for block and file backend).

2. Block layer
- Block-generic copy (REQ_OP_COPY_DST/SRC), operation with
  interface accommodating two block-devs
- Merging copy requests in request layer
- Emulation, for in-kernel user when offload is natively 
absent
- dm-linear support (for cases not requiring split)

3. User-interface
- copy_file_range

Testing
===
Copy offload can be tested on:
a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
parameters mssrl,mcl, msrc. For more info [2].
b. Null block device
c. NVMe Fabrics loopback.
d. blktests[3] (tests block/035-038, nvme/050-053)

Emulation can be tested on any device.

fio[4].

Infra and plumbing:
===
We populate copy_file_range callback in def_blk_fops. 
For devices that support copy-offload, use __blkdev_copy_offload to
achieve in-device copy.
However for cases, where device doesn't support offload,
fallback to generic_copy_file_range.
For in-kernel users (like NVMe fabrics), we use blkdev_issue_copy
which implements its own emulation, as fd is not available.
Modify checks in generic_copy_file_range to support block-device.

Blktests[3]
==
tests/block/035,036: Runs copy offload and emulation on block
  device.
tests/block/037,038: Runs copy offload and emulation on null
  block device.
tests/nvme/050-053: Create a loop backed fabrics device and
  run copy offload and emulation.

Future Work
===
- loopback device copy offload support
- upstream fio to use copy offload
- upstream blktest to test copy offload

These are to be taken up after this minimal series is agreed upon.

Additional links:
=
[0] 
https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zt14mmpgmivczugb33c60...@mail.gmail.com/

https://lore.kernel.org/linux-nvme/f0e19ae4-b37a-e9a3-2be7-a5afb334a...@nvidia.com/

https://lore.kernel.org/linux-nvme/20230113094648.15614-1-nj.she...@samsung.com/
[1] 
https://lore.kernel.org/linux-block/20230605121732.28468-1-nj.she...@samsung.com/T/#m4db1801c86a5490dc736266609f8458fd52b9eb5
[2] 
https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v13
[4] https://github.com/vincentkfu/fio/commits/copyoffload-3.35-v13

Changes since v12:
=
- block,nvme: Replaced token based approach with request based
  single namespace capable approach (Christoph Hellwig)

Changes since v11:
=
- Documentation: Improved documentation (Damien Le Moal)
- block,nvme: ssize_t return values (Darrick J. Wong)
- block: token is allocated to SECTOR_SIZE (Matthew Wilcox)
- block: mem leak fix (Maurizio Lombardi)

Changes since v10:
=
- NVMeOF: optimization in NVMe fabrics (Chaitanya Kulkarni)
- NVMeOF: sparse warnings (kernel test robot)

Changes since v9:
=
- null_blk, improved documentation, minor fixes(Chaitanya Kulkarni)
- fio, expanded testing and minor fixes (Vincent Fu)

Changes since v8:
=
- null_blk, copy_max_bytes_hw is made config fs parameter
  (Damien Le Moal)
- Negative error handling in copy_file_range (Christian Brauner)
- minor fixes, better documentation (Damien Le Moal)
- fio upgraded to 3.34 (Vincent Fu)

Changes since v7:
=
- null block copy offload support for testing (Damien Le Moal)
- adding direct flag check for copy offload to block device,
  as we are using generic_copy_file_range for cached cases.
- Minor fixes

Changes since v6:
=
- copy_file_range instead of ioctl for direct block device
- Remove support for multi range (vectored) copy
- Remove ioctl interface for copy.
- Remove

[dm-devel] [PATCH v13 1/9] block: Introduce queue limits for copy-offload support

2023-06-27 Thread Nitesh Shetty
Add device limits as sysfs entries,
- copy_offload (RW)
- copy_max_bytes (RW)
- copy_max_bytes_hw (RO)

Above limits help to split the copy payload in block layer.
copy_offload: used for setting copy offload(1) or emulation(0).
copy_max_bytes: maximum total length of copy in single payload.
copy_max_bytes_hw: Reflects the device supported maximum limit.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Anuj Gupta 
---
 Documentation/ABI/stable/sysfs-block | 33 +++
 block/blk-settings.c | 24 +++
 block/blk-sysfs.c| 63 
 include/linux/blkdev.h   | 12 ++
 include/uapi/linux/fs.h  |  3 ++
 5 files changed, 135 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block 
b/Documentation/ABI/stable/sysfs-block
index c57e5b7cb532..3c97303f658b 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,39 @@ Description:
last zone of the device which may be smaller.
 
 
+What:  /sys/block//queue/copy_offload
+Date:  June 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] When read, this file shows whether offloading copy to a
+   device is enabled (1) or disabled (0). Writing '0' to this
+   file will disable offloading copies for this device.
+   Writing any '1' value will enable this feature. If the device
+   does not support offloading, then writing 1, will result in an
+   error.
+
+
+What:  /sys/block//queue/copy_max_bytes
+Date:  June 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] This is the maximum number of bytes that the block layer
+   will allow for a copy request. This will is always smaller or
+   equal to the maximum size allowed by the hardware, indicated by
+   'copy_max_bytes_hw'. An attempt to set a value higher than
+   'copy_max_bytes_hw' will truncate this to 'copy_max_bytes_hw'.
+
+
+What:  /sys/block//queue/copy_max_bytes_hw
+Date:  June 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RO] This is the maximum number of bytes that the hardware
+   will allow for single data copy request.
+   A value of 0 means that the device does not support
+   copy offload.
+
+
 What:  /sys/block//queue/crypto/
 Date:  February 2022
 Contact:   linux-bl...@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 4dd59059b788..738cd3f21259 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
+   lim->max_copy_sectors_hw = 0;
+   lim->max_copy_sectors = 0;
 }
 
 /**
@@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_dev_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
lim->max_zone_append_sectors = UINT_MAX;
+   lim->max_copy_sectors_hw = UINT_MAX;
+   lim->max_copy_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/**
+ * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
+ * @q:  the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ **/
+void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+   unsigned int max_copy_sectors)
+{
+   if (max_copy_sectors > (COPY_MAX_BYTES >> SECTOR_SHIFT))
+   max_copy_sectors = COPY_MAX_BYTES >> SECTOR_SHIFT;
+
+   q->limits.max_copy_sectors_hw = max_copy_sectors;
+   q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -578,6 +598,10 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
   b->max_segment_size);
 
+   t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+   t->max_copy_sectors_hw = min(t->max_copy_sectors_hw,
+   b->max_copy_sectors_hw);
+
t->misaligned |= b->misaligned;
 
alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index afc797f

[dm-devel] [PATCH v13 3/9] block: add emulation for copy

2023-06-27 Thread Nitesh Shetty
For the devices which does not support copy, copy emulation is added.
It is required for in-kernel users like fabrics, where file descriptor is
not available and hence they can't use copy_file_range.
Copy-emulation is implemented by reading from source into memory and
writing to the corresponding destination asynchronously.
Also emulation is used, if copy offload fails or partially completes.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
Signed-off-by: Anuj Gupta 
---
 block/blk-lib.c   | 183 +-
 block/blk-map.c   |   4 +-
 include/linux/blk_types.h |   5 ++
 include/linux/blkdev.h|   3 +
 4 files changed, 192 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 10c3eadd5bf6..09e0d5d51d03 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -234,6 +234,180 @@ static ssize_t __blkdev_copy_offload(
return blkdev_copy_wait_io_completion(cio);
 }
 
+static void *blkdev_copy_alloc_buf(sector_t req_size, sector_t *alloc_size,
+   gfp_t gfp_mask)
+{
+   int min_size = PAGE_SIZE;
+   void *buf;
+
+   while (req_size >= min_size) {
+   buf = kvmalloc(req_size, gfp_mask);
+   if (buf) {
+   *alloc_size = req_size;
+   return buf;
+   }
+   /* retry half the requested size */
+   req_size >>= 1;
+   }
+
+   return NULL;
+}
+
+static void blkdev_copy_emulate_write_endio(struct bio *bio)
+{
+   struct copy_ctx *ctx = bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (bio->bi_status) {
+   clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   }
+   kfree(bvec_virt(>bi_io_vec[0]));
+   bio_map_kern_endio(bio);
+   kfree(ctx);
+   if (atomic_dec_and_test(>refcount)) {
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+   }
+}
+
+static void blkdev_copy_emulate_read_endio(struct bio *read_bio)
+{
+   struct copy_ctx *ctx = read_bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (read_bio->bi_status) {
+   clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+   cio->pos_in;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   kfree(bvec_virt(_bio->bi_io_vec[0]));
+   bio_map_kern_endio(read_bio);
+   kfree(ctx);
+
+   if (atomic_dec_and_test(>refcount)) {
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+   }
+   }
+   schedule_work(>dispatch_work);
+   kfree(read_bio);
+}
+
+static void blkdev_copy_dispatch_work(struct work_struct *work)
+{
+   struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+   dispatch_work);
+
+   submit_bio(ctx->write_bio);
+}
+
+/*
+ * If native copy offload feature is absent, this function tries to emulate,
+ * by copying data from source to a temporary buffer and from buffer to
+ * destination device.
+ * Returns the length of bytes copied or error if encountered
+ */
+static ssize_t __blkdev_copy_emulate(
+   struct block_device *bdev_in, loff_t pos_in,
+   struct block_device *bdev_out, loff_t pos_out,
+   size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+   struct request_queue *in = bdev_get_queue(bdev_in);
+   struct request_queue *out = bdev_get_queue(bdev_out);
+   struct bio *read_bio, *write_bio;
+   void *buf = NULL;
+   struct copy_ctx *ctx;
+   struct cio *cio;
+   sector_t buf_len, req_len, rem = 0;
+   sector_t max_src_hw_len = min_t(unsigned int,
+   queue_max_hw_sectors(in),
+   queue_max_segments(in) << (PAGE_SHIFT - SECTOR_SHIFT))
+   << SECTOR_SHIFT;
+   sector_t max_dst_hw_len = min_t(unsigned int,
+   queue_max_hw_sectors(out),
+   queue_max_segments(out) << (PAGE_SHIFT - SECTOR_SHIFT))
+   << SECTOR_SHIFT;
+   sector_t max_hw_len = min_t(unsigned int,
+   max_src_hw_len, max_dst_hw_len);
+
+   cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+   if (!cio)
+   return -ENOMEM;
+   atomic_set(>refcount, 0);
+  

Re: [dm-devel] [PATCH v12 5/9] nvme: add copy offload support

2023-06-08 Thread Nitesh Shetty

Hi Christoph and Martin,

On 23/06/07 12:12AM, Christoph Hellwig wrote:

On Tue, Jun 06, 2023 at 05:05:35PM +0530, Nitesh Shetty wrote:

Downside will be duplicating checks which are present for read, write in
block layer, device-mapper and zoned devices.
But we can do this, shouldn't be an issue.


Yes.  Please never overload operations, this is just causing problems
everywhere, and that why I split the operations from the flag a few
years ago.



Sure, we will add REQ_COPY_IN/OUT and send a new version.


The idea behind subsys is to prevent copy across different subsystem.
For example, copy across nvme subsystem and the scsi subsystem. [1]
At present, we don't support inter-namespace(copy across NVMe namespace),
but after community feedback for previous series we left scope for it.


Never leave scope for something that isn't actually added.  That just
creates a giant maintainance nightmare.  Cross-device copies are giant
nightmare in general, and in the case of NVMe completely unusable
as currently done in the working group.  Messing up something that
is entirely reasonable (local copy) for something like that is a sure
way to never get this series in.


Sure, we can do away with subsys and realign more on single namespace copy.
We are planning to use token to store source info, such as src sector,
len and namespace. Something like below,

struct nvme_copy_token {
struct nvme_ns *ns; // to make sure we are copying within same namespace
/* store source info during *IN operation, will be used by *OUT operation */
sector_t src_sector;
sector_t sectors;
};
Do you have any better way to handle this in mind ?


Thank you,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v12 5/9] nvme: add copy offload support

2023-06-06 Thread Nitesh Shetty

On 23/06/05 06:43AM, Christoph Hellwig wrote:

break;
case REQ_OP_READ:
-   ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
+   if (unlikely(req->cmd_flags & REQ_COPY))
+   nvme_setup_copy_read(ns, req);
+   else
+   ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
break;
case REQ_OP_WRITE:
-   ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
+   if (unlikely(req->cmd_flags & REQ_COPY))
+   ret = nvme_setup_copy_write(ns, req, cmd);
+   else
+   ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);


Yikes.  Overloading REQ_OP_READ and REQ_OP_WRITE with something entirely
different brings us back the horrors of the block layer 15 years ago.
Don't do that.  Please add separate REQ_COPY_IN/OUT (or maybe
SEND/RECEIVE or whatever) methods.



Downside will be duplicating checks which are present for read, write in
block layer, device-mapper and zoned devices.
But we can do this, shouldn't be an issue.


+   /* setting copy limits */
+   if (blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, q))


I don't understand this comment.



It was a mistake. Comment is misplaced and it should have been
"setting copy flag" instead of "setting copy limits".
Anyway now we feel this comment is redundant, will remove it.
Also, we should have used blk_queue_flag_set to enable copy offload.


+struct nvme_copy_token {
+   char *subsys;
+   struct nvme_ns *ns;
+   sector_t src_sector;
+   sector_t sectors;
+};


Why do we need a subsys token?  Inter-namespace copy is pretty crazy,
and not really anything we should aim for.  But this whole token design
is pretty odd anyway.  The only thing we'd need is a sequence number /
idr / etc to find an input and output side match up, as long as we
stick to the proper namespace scope.



The idea behind subsys is to prevent copy across different subsystem.
For example, copy across nvme subsystem and the scsi subsystem. [1]
At present, we don't support inter-namespace(copy across NVMe namespace),
but after community feedback for previous series we left scope for it.
About idr per namespace, it will be similar to namespace check that
we are doing to prevent copy across namespace.
We went with current structure for token, as it was solving above
issues as well as provides a placeholder for storing source LBA and
number of sectors.
Do have any suggestions on how we can store source info, if we go with
idr based approach ?

[1] 
https://lore.kernel.org/all/alpine.lrh.2.02.2202011327350.22...@file01.intranet.prod.int.rdu2.redhat.com/T/#m407f24fb4454d35c3283a5e51fdb04f1600463af


+   if (unlikely((req->cmd_flags & REQ_COPY) &&
+   (req_op(req) == REQ_OP_READ))) {
+   blk_mq_start_request(req);
+   return BLK_STS_OK;
+   }


This really needs to be hiden inside of nvme_setup_cmd.  And given
that other drivers might need similar handling the best way is probably
to have a new magic BLK_STS_* value for request started but we're
not actually sending it to hardware.


Sure we will add new BLK_STS_* for completion and move the snippet.

Thank you,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH v12 9/9] null_blk: add support for copy offload

2023-06-05 Thread Nitesh Shetty
Implementaion is based on existing read and write infrastructure.
copy_max_bytes: A new configfs and module parameter is introduced, which
can be used to set hardware/driver supported maximum copy limit.

Suggested-by: Damien Le Moal 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
---
 Documentation/block/null_blk.rst  |   5 ++
 drivers/block/null_blk/main.c | 108 --
 drivers/block/null_blk/null_blk.h |   8 +++
 3 files changed, 116 insertions(+), 5 deletions(-)

diff --git a/Documentation/block/null_blk.rst b/Documentation/block/null_blk.rst
index 4dd78f24d10a..6153e02fcf13 100644
--- a/Documentation/block/null_blk.rst
+++ b/Documentation/block/null_blk.rst
@@ -149,3 +149,8 @@ zone_size=[MB]: Default: 256
 zone_nr_conv=[nr_conv]: Default: 0
   The number of conventional zones to create when block device is zoned.  If
   zone_nr_conv >= nr_zones, it will be reduced to nr_zones - 1.
+
+copy_max_bytes=[size in bytes]: Default: COPY_MAX_BYTES
+  A module and configfs parameter which can be used to set hardware/driver
+  supported maximum copy offload limit.
+  COPY_MAX_BYTES(=128MB at present) is defined in fs.h
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index b3fedafe301e..34e009b3ebd5 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -157,6 +157,10 @@ static int g_max_sectors;
 module_param_named(max_sectors, g_max_sectors, int, 0444);
 MODULE_PARM_DESC(max_sectors, "Maximum size of a command (in 512B sectors)");
 
+static unsigned long g_copy_max_bytes = COPY_MAX_BYTES;
+module_param_named(copy_max_bytes, g_copy_max_bytes, ulong, 0444);
+MODULE_PARM_DESC(copy_max_bytes, "Maximum size of a copy command (in bytes)");
+
 static unsigned int nr_devices = 1;
 module_param(nr_devices, uint, 0444);
 MODULE_PARM_DESC(nr_devices, "Number of devices to register");
@@ -409,6 +413,7 @@ NULLB_DEVICE_ATTR(home_node, uint, NULL);
 NULLB_DEVICE_ATTR(queue_mode, uint, NULL);
 NULLB_DEVICE_ATTR(blocksize, uint, NULL);
 NULLB_DEVICE_ATTR(max_sectors, uint, NULL);
+NULLB_DEVICE_ATTR(copy_max_bytes, uint, NULL);
 NULLB_DEVICE_ATTR(irqmode, uint, NULL);
 NULLB_DEVICE_ATTR(hw_queue_depth, uint, NULL);
 NULLB_DEVICE_ATTR(index, uint, NULL);
@@ -550,6 +555,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
_device_attr_queue_mode,
_device_attr_blocksize,
_device_attr_max_sectors,
+   _device_attr_copy_max_bytes,
_device_attr_irqmode,
_device_attr_hw_queue_depth,
_device_attr_index,
@@ -656,7 +662,8 @@ static ssize_t memb_group_features_show(struct config_item 
*item, char *page)
"poll_queues,power,queue_mode,shared_tag_bitmap,size,"
"submit_queues,use_per_node_hctx,virt_boundary,zoned,"
"zone_capacity,zone_max_active,zone_max_open,"
-   "zone_nr_conv,zone_offline,zone_readonly,zone_size\n");
+   "zone_nr_conv,zone_offline,zone_readonly,zone_size,"
+   "copy_max_bytes\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -722,6 +729,7 @@ static struct nullb_device *null_alloc_dev(void)
dev->queue_mode = g_queue_mode;
dev->blocksize = g_bs;
dev->max_sectors = g_max_sectors;
+   dev->copy_max_bytes = g_copy_max_bytes;
dev->irqmode = g_irqmode;
dev->hw_queue_depth = g_hw_queue_depth;
dev->blocking = g_blocking;
@@ -1271,6 +1279,78 @@ static int null_transfer(struct nullb *nullb, struct 
page *page,
return err;
 }
 
+static inline void nullb_setup_copy_read(struct nullb *nullb, struct bio *bio)
+{
+   struct nullb_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+
+   token->subsys = "nullb";
+   token->sector_in = bio->bi_iter.bi_sector;
+   token->nullb = nullb;
+   token->sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+}
+
+static inline int nullb_setup_copy_write(struct nullb *nullb,
+   struct bio *bio, bool is_fua)
+{
+   struct nullb_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+   sector_t sector_in, sector_out;
+   void *in, *out;
+   size_t rem, temp;
+   unsigned long offset_in, offset_out;
+   struct nullb_page *t_page_in, *t_page_out;
+   int ret = -EIO;
+
+   if (unlikely(memcmp(token->subsys, "nullb", 5)))
+   return -EINVAL;
+   if (unlikely(token->nullb != nullb))
+   return -EINVAL;
+   if (WARN_ON(token->sectors != bio->bi_iter.bi_size >> SECTOR_SHIFT))
+   return -EINVAL;
+
+   sector_in = token->sector_in;
+   sector_out = bio->bi_iter.bi_sector;
+   rem = token->sectors << SECTOR_SHI

[dm-devel] [PATCH v12 6/9] nvmet: add copy command support for bdev and file ns

2023-06-05 Thread Nitesh Shetty
Add support for handling nvme_cmd_copy command on target.
For bdev-ns we call into blkdev_issue_copy, which the block layer
completes by a offloaded copy request to backend bdev or by emulating the
request.

For file-ns we call vfs_copy_file_range to service our request.

Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.

loop target has copy support, which can be used to test copy offload.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/target/admin-cmd.c   |  9 -
 drivers/nvme/target/io-cmd-bdev.c | 62 +++
 drivers/nvme/target/io-cmd-file.c | 52 ++
 drivers/nvme/target/loop.c|  6 +++
 drivers/nvme/target/nvmet.h   |  1 +
 5 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 39cb570f833d..8e644b8ec0fd 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -433,8 +433,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req 
*req)
id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-   NVME_CTRL_ONCS_WRITE_ZEROES);
-
+   NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
/* XXX: don't report vwc if the underlying device is write through */
id->vwc = NVME_CTRL_VWC_PRESENT;
 
@@ -536,6 +535,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req 
*req)
 
if (req->ns->bdev)
nvmet_bdev_set_limits(req->ns->bdev, id);
+   else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
+   (PAGE_SHIFT - SECTOR_SHIFT));
+   id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
+   }
 
/*
 * We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index c2d6cea0236b..92b5accf0743 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,18 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct 
nvme_id_ns *id)
id->npda = id->npdg;
/* NOWS = Namespace Optimal Write Size */
id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+   if (bdev_max_copy_sectors(bdev)) {
+   id->msrc = id->msrc;
+   id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
+   SECTOR_SHIFT) / bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   } else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
+   bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   }
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -184,6 +196,21 @@ static void nvmet_bio_done(struct bio *bio)
nvmet_req_bio_put(req, bio);
 }
 
+static void nvmet_bdev_copy_end_io(void *private, int comp_len)
+{
+   struct nvmet_req *req = (struct nvmet_req *)private;
+   u16 status;
+
+   if (comp_len == req->copy_len) {
+   req->cqe->result.u32 = cpu_to_le32(1);
+   status = errno_to_nvme_status(req, 0);
+   } else {
+   req->cqe->result.u32 = cpu_to_le32(0);
+   status = errno_to_nvme_status(req, (__force u16)BLK_STS_IOERR);
+   }
+   nvmet_req_complete(req, status);
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 static int nvmet_bdev_alloc_bip(struct nvmet_req *req, struct bio *bio,
struct sg_mapping_iter *miter)
@@ -450,6 +477,37 @@ static void nvmet_bdev_execute_write_zeroes(struct 
nvmet_req *req)
}
 }
 
+/* At present we handle only one range entry, since copy offload is aligned 
with
+ * copy_file_range, only one entry is passed from block layer.
+ */
+static void nvmet_bdev_execute_copy(struct nvmet_req *req)
+{
+   struct nvme_copy_range range;
+   struct nvme_command *cmd = req->cmd;
+   ssize_t ret;
+   u16 status;
+
+   status = nvmet_copy_from_sgl(req, 0, , sizeof(range));
+   if (status)
+   goto out;
+
+   ret = blkdev_copy_offload(req->ns->bdev,
+   le64_to_cpu(cmd->copy.sdlba) << req->ns->blksize_shift,
+   req->ns->bdev,
+   le64_to_cpu(range.slba) << req->ns->blksize_shift,
+   (le16_to_cpu(range.nlb) + 1) << req->ns->blksize_shift,
+   nvmet_bdev_copy_end_io, (vo

[dm-devel] [PATCH v12 8/9] dm: Enable copy offload for dm-linear target

2023-06-05 Thread Nitesh Shetty
Setting copy_offload_supported flag to enable offload.

Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-linear.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index f4448d520ee9..1d1ee30bbefb 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -62,6 +62,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
ti->num_discard_bios = 1;
ti->num_secure_erase_bios = 1;
ti->num_write_zeroes_bios = 1;
+   ti->copy_offload_supported = 1;
ti->private = lc;
return 0;
 
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v12 7/9] dm: Add support for copy offload

2023-06-05 Thread Nitesh Shetty
Before enabling copy for dm target, check if underlying devices and
dm target support copy. Avoid split happening inside dm target.
Fail early if the request needs split, currently splitting copy
request is not supported.

Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-table.c | 41 +++
 drivers/md/dm.c   |  7 ++
 include/linux/device-mapper.h |  5 +
 3 files changed, 53 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 1398f1d6e83e..b3269271e761 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1867,6 +1867,39 @@ static bool dm_table_supports_nowait(struct dm_table *t)
return true;
 }
 
+static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
+{
+   struct request_queue *q = bdev_get_queue(dev->bdev);
+
+   return !blk_queue_copy(q);
+}
+
+static bool dm_table_supports_copy(struct dm_table *t)
+{
+   struct dm_target *ti;
+   unsigned int i;
+
+   for (i = 0; i < t->num_targets; i++) {
+   ti = dm_table_get_target(t, i);
+
+   if (!ti->copy_offload_supported)
+   return false;
+
+   /*
+* target provides copy support (as implied by setting
+* 'copy_offload_supported')
+* and it relies on _all_ data devices having copy support.
+*/
+   if (!ti->type->iterate_devices ||
+ti->type->iterate_devices(ti,
+device_not_copy_capable, NULL))
+   return false;
+   }
+
+   return true;
+}
+
 static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
  sector_t start, sector_t len, void *data)
 {
@@ -1949,6 +1982,14 @@ int dm_table_set_restrictions(struct dm_table *t, struct 
request_queue *q,
q->limits.discard_misaligned = 0;
}
 
+   if (!dm_table_supports_copy(t)) {
+   blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+   q->limits.max_copy_sectors = 0;
+   q->limits.max_copy_sectors_hw = 0;
+   } else {
+   blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+   }
+
if (!dm_table_supports_secure_erase(t))
q->limits.max_secure_erase_sectors = 0;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 4361a01bff3a..d9f45a1f0a77 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1720,6 +1720,13 @@ static blk_status_t __split_and_process_bio(struct 
clone_info *ci)
if (unlikely(ci->is_abnormal_io))
return __process_abnormal_io(ci, ti);
 
+   if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
+   max_io_len(ti, ci->sector) < ci->sector_count)) {
+   DMERR("Error, IO size(%u) > max target size(%llu)\n",
+   ci->sector_count, max_io_len(ti, ci->sector));
+   return BLK_STS_IOERR;
+   }
+
/*
 * Only support bio polling for normal IO, and the target io is
 * exactly inside the dm_io instance (verified in dm_poll_dm_io)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a52d2b9a6846..04016bd76e73 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -398,6 +398,11 @@ struct dm_target {
 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
 */
bool needs_bio_set_dev:1;
+
+   /*
+* copy offload is supported
+*/
+   bool copy_offload_supported:1;
 };
 
 void *dm_per_bio_data(struct bio *bio, size_t data_size);
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v12 4/9] fs, block: copy_file_range for def_blk_ops for direct block device

2023-06-05 Thread Nitesh Shetty
For direct block device opened with O_DIRECT, use copy_file_range to
issue device copy offload, and fallback to generic_copy_file_range incase
device copy offload capability is absent.
Modify checks to allow bdevs to use copy_file_range.

Suggested-by: Ming Lei 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 block/blk-lib.c| 26 ++
 block/fops.c   | 20 
 fs/read_write.c|  7 +--
 include/linux/blkdev.h |  4 
 4 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 99b65af8bfc1..31cfd5026367 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -534,6 +534,32 @@ ssize_t blkdev_copy_offload(
 }
 EXPORT_SYMBOL_GPL(blkdev_copy_offload);
 
+/* Copy source offset from source block device to destination block
+ * device. Returns the length of bytes copied.
+ */
+ssize_t blkdev_copy_offload_failfast(
+   struct block_device *bdev_in, loff_t pos_in,
+   struct block_device *bdev_out, loff_t pos_out,
+   size_t len, gfp_t gfp_mask)
+{
+   struct request_queue *in_q = bdev_get_queue(bdev_in);
+   struct request_queue *out_q = bdev_get_queue(bdev_out);
+   ssize_t ret = 0;
+
+   if (blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len))
+   return 0;
+
+   if (blk_queue_copy(in_q) && blk_queue_copy(out_q)) {
+   ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
+   len, NULL, NULL, gfp_mask);
+   if (ret < 0)
+   return 0;
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_copy_offload_failfast);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
struct bio **biop, unsigned flags)
diff --git a/block/fops.c b/block/fops.c
index f56811a925a0..9189f3239c9c 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -599,6 +599,25 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
return ret;
 }
 
+static ssize_t blkdev_copy_file_range(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   size_t len, unsigned int flags)
+{
+   struct block_device *in_bdev = I_BDEV(bdev_file_inode(file_in));
+   struct block_device *out_bdev = I_BDEV(bdev_file_inode(file_out));
+   ssize_t comp_len = 0;
+
+   if ((file_in->f_iocb_flags & IOCB_DIRECT) &&
+   (file_out->f_iocb_flags & IOCB_DIRECT))
+   comp_len = blkdev_copy_offload_failfast(in_bdev, pos_in,
+   out_bdev, pos_out, len, GFP_KERNEL);
+   if (comp_len != len)
+   comp_len = generic_copy_file_range(file_in, pos_in + comp_len,
+   file_out, pos_out + comp_len, len - comp_len, flags);
+
+   return comp_len;
+}
+
 #defineBLKDEV_FALLOC_FL_SUPPORTED  
\
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
@@ -692,6 +711,7 @@ const struct file_operations def_blk_fops = {
.splice_read= filemap_splice_read,
.splice_write   = iter_file_splice_write,
.fallocate  = blkdev_fallocate,
+   .copy_file_range = blkdev_copy_file_range,
 };
 
 static __init int blkdev_init(void)
diff --git a/fs/read_write.c b/fs/read_write.c
index b07de77ef126..d27148a2543f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1447,7 +1447,8 @@ static int generic_copy_file_checks(struct file *file_in, 
loff_t pos_in,
return -EOVERFLOW;
 
/* Shorten the copy to EOF */
-   size_in = i_size_read(inode_in);
+   size_in = i_size_read(file_in->f_mapping->host);
+
if (pos_in >= size_in)
count = 0;
else
@@ -1708,7 +1709,9 @@ int generic_file_rw_checks(struct file *file_in, struct 
file *file_out)
/* Don't copy dirs, pipes, sockets... */
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
return -EISDIR;
-   if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+
+   if ((!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) &&
+   (!S_ISBLK(inode_in->i_mode) || !S_ISBLK(inode_out->i_mode)))
return -EINVAL;
 
if (!(file_in->f_mode & FMODE_READ) ||
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69fe977afdc9..a634768a2318 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1057,6 +1057,10 @@ ssize_t blkdev_copy_offload(
struct block_device *bdev_in, loff_t pos_in,
struct block_device *bdev_out, loff_t pos_out,

[dm-devel] [PATCH v12 5/9] nvme: add copy offload support

2023-06-05 Thread Nitesh Shetty
For device supporting native copy, nvme driver receives read and
write request with BLK_COPY op flags.
For read request the nvme driver populates the payload with source
information.
For write request the driver converts it to nvme copy command using the
source information in the payload and submits to the device.
current design only supports single source range.
This design is courtesy Mikulas Patocka's token based copy

trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.

Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier González 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/host/constants.c |   1 +
 drivers/nvme/host/core.c  | 103 +-
 drivers/nvme/host/fc.c|   5 ++
 drivers/nvme/host/nvme.h  |   7 +++
 drivers/nvme/host/pci.c   |  27 -
 drivers/nvme/host/rdma.c  |   7 +++
 drivers/nvme/host/tcp.c   |  16 ++
 drivers/nvme/host/trace.c |  19 +++
 include/linux/nvme.h  |  43 +-
 9 files changed, 220 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index 5e4f8848dce0..311ad67e9cf3 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -19,6 +19,7 @@ static const char * const nvme_ops[] = {
[nvme_cmd_resv_report] = "Reservation Report",
[nvme_cmd_resv_acquire] = "Reservation Acquire",
[nvme_cmd_resv_release] = "Reservation Release",
+   [nvme_cmd_copy] = "Copy Offload",
[nvme_cmd_zone_mgmt_send] = "Zone Management Send",
[nvme_cmd_zone_mgmt_recv] = "Zone Management Receive",
[nvme_cmd_zone_append] = "Zone Append",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1715a508496c..ce1fec07dda6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -763,6 +763,77 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
+static inline void nvme_setup_copy_read(struct nvme_ns *ns, struct request 
*req)
+{
+   struct bio *bio = req->bio;
+   struct nvme_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+
+   token->subsys = "nvme";
+   token->ns = ns;
+   token->src_sector = bio->bi_iter.bi_sector;
+   token->sectors = bio->bi_iter.bi_size >> 9;
+}
+
+static inline blk_status_t nvme_setup_copy_write(struct nvme_ns *ns,
+  struct request *req, struct nvme_command *cmnd)
+{
+   struct nvme_copy_range *range = NULL;
+   struct bio *bio = req->bio;
+   struct nvme_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+   sector_t src_sector, dst_sector, n_sectors;
+   u64 src_lba, dst_lba, n_lba;
+   unsigned short nr_range = 1;
+   u16 control = 0;
+
+   if (unlikely(memcmp(token->subsys, "nvme", 4)))
+   return BLK_STS_NOTSUPP;
+   if (unlikely(token->ns != ns))
+   return BLK_STS_NOTSUPP;
+
+   src_sector = token->src_sector;
+   dst_sector = bio->bi_iter.bi_sector;
+   n_sectors = token->sectors;
+   if (WARN_ON(n_sectors != bio->bi_iter.bi_size >> 9))
+   return BLK_STS_NOTSUPP;
+
+   src_lba = nvme_sect_to_lba(ns, src_sector);
+   dst_lba = nvme_sect_to_lba(ns, dst_sector);
+   n_lba = nvme_sect_to_lba(ns, n_sectors);
+
+   if (WARN_ON(!n_lba))
+   return BLK_STS_NOTSUPP;
+
+   if (req->cmd_flags & REQ_FUA)
+   control |= NVME_RW_FUA;
+
+   if (req->cmd_flags & REQ_FAILFAST_DEV)
+   control |= NVME_RW_LR;
+
+   memset(cmnd, 0, sizeof(*cmnd));
+   cmnd->copy.opcode = nvme_cmd_copy;
+   cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+   cmnd->copy.sdlba = cpu_to_le64(dst_lba);
+
+   range = kmalloc_array(nr_range, sizeof(*range),
+   GFP_ATOMIC | __GFP_NOWARN);
+   if (!range)
+   return BLK_STS_RESOURCE;
+
+   range[0].slba = cpu_to_le64(src_lba);
+   range[0].nlb = cpu_to_le16(n_lba - 1);
+
+   cmnd->copy.nr_range = 0;
+
+   req->special_vec.bv_page = virt_to_page(range);
+   req->special_vec.bv_offset = offset_in_page(range);
+   req->special_vec.bv_len = sizeof(*range) * nr_range;
+   req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+   cmnd->copy.control = cpu_to_le16(control);
+
+   return BLK_STS_OK;
+}
+
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
 {
@@ -997,10 +1068,16 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct 
request *req)
ret = nvme_setup_discard(ns, req, cmd);
break;
case REQ_OP_READ:
-   ret 

[dm-devel] [PATCH v12 3/9] block: add emulation for copy

2023-06-05 Thread Nitesh Shetty
For the devices which does not support copy, copy emulation is added.
It is required for in-kernel users like fabrics, where file descriptor is
not available and hence they can't use copy_file_range.
Copy-emulation is implemented by reading from source into memory and
writing to the corresponding destination asynchronously.
Also emulation is used, if copy offload fails or partially completes.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
Signed-off-by: Anuj Gupta 
---
 block/blk-lib.c| 178 -
 block/blk-map.c|   4 +-
 include/linux/blkdev.h |   3 +
 3 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index b8e11997b5bf..99b65af8bfc1 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -300,6 +300,175 @@ static ssize_t __blkdev_copy_offload(
return blkdev_copy_wait_completion(cio);
 }
 
+static void *blkdev_copy_alloc_buf(sector_t req_size, sector_t *alloc_size,
+   gfp_t gfp_mask)
+{
+   int min_size = PAGE_SIZE;
+   void *buf;
+
+   while (req_size >= min_size) {
+   buf = kvmalloc(req_size, gfp_mask);
+   if (buf) {
+   *alloc_size = req_size;
+   return buf;
+   }
+   /* retry half the requested size */
+   req_size >>= 1;
+   }
+
+   return NULL;
+}
+
+static void blkdev_copy_emulate_write_endio(struct bio *bio)
+{
+   struct copy_ctx *ctx = bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (bio->bi_status) {
+   clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   }
+   kfree(bvec_virt(>bi_io_vec[0]));
+   bio_map_kern_endio(bio);
+   kfree(ctx);
+   if (atomic_dec_and_test(>refcount)) {
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+   }
+}
+
+static void blkdev_copy_emulate_read_endio(struct bio *read_bio)
+{
+   struct copy_ctx *ctx = read_bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (read_bio->bi_status) {
+   clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+   cio->pos_in;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   kfree(bvec_virt(_bio->bi_io_vec[0]));
+   bio_map_kern_endio(read_bio);
+   kfree(ctx);
+
+   if (atomic_dec_and_test(>refcount)) {
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+   }
+   }
+   schedule_work(>dispatch_work);
+   kfree(read_bio);
+}
+
+/*
+ * If native copy offload feature is absent, this function tries to emulate,
+ * by copying data from source to a temporary buffer and from buffer to
+ * destination device.
+ * Returns the length of bytes copied or error if encountered
+ */
+static ssize_t __blkdev_copy_emulate(
+   struct block_device *bdev_in, loff_t pos_in,
+   struct block_device *bdev_out, loff_t pos_out,
+   size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+   struct request_queue *in = bdev_get_queue(bdev_in);
+   struct request_queue *out = bdev_get_queue(bdev_out);
+   struct bio *read_bio, *write_bio;
+   void *buf = NULL;
+   struct copy_ctx *ctx;
+   struct cio *cio;
+   sector_t buf_len, req_len, rem = 0;
+   sector_t max_src_hw_len = min_t(unsigned int,
+   queue_max_hw_sectors(in),
+   queue_max_segments(in) << (PAGE_SHIFT - SECTOR_SHIFT))
+   << SECTOR_SHIFT;
+   sector_t max_dst_hw_len = min_t(unsigned int,
+   queue_max_hw_sectors(out),
+   queue_max_segments(out) << (PAGE_SHIFT - SECTOR_SHIFT))
+   << SECTOR_SHIFT;
+   sector_t max_hw_len = min_t(unsigned int,
+   max_src_hw_len, max_dst_hw_len);
+
+   cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+   if (!cio)
+   return -ENOMEM;
+   atomic_set(>refcount, 0);
+   cio->pos_in = pos_in;
+   cio->pos_out = pos_out;
+   cio->waiter = current;
+   cio->endio = endio;
+   cio->private = private;
+
+   for (rem = len; rem > 0; rem -= buf_len) {
+   req_len = min_t(int, max_h

[dm-devel] [PATCH v12 1/9] block: Introduce queue limits for copy-offload support

2023-06-05 Thread Nitesh Shetty
Add device limits as sysfs entries,
- copy_offload (RW)
- copy_max_bytes (RW)
- copy_max_bytes_hw (RO)

Above limits help to split the copy payload in block layer.
copy_offload: used for setting copy offload(1) or emulation(0).
copy_max_bytes: maximum total length of copy in single payload.
copy_max_bytes_hw: Reflects the device supported maximum limit.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Anuj Gupta 
---
 Documentation/ABI/stable/sysfs-block | 33 +++
 block/blk-settings.c | 24 +++
 block/blk-sysfs.c| 63 
 include/linux/blkdev.h   | 12 ++
 include/uapi/linux/fs.h  |  3 ++
 5 files changed, 135 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block 
b/Documentation/ABI/stable/sysfs-block
index c57e5b7cb532..3c97303f658b 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,39 @@ Description:
last zone of the device which may be smaller.
 
 
+What:  /sys/block//queue/copy_offload
+Date:  June 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] When read, this file shows whether offloading copy to a
+   device is enabled (1) or disabled (0). Writing '0' to this
+   file will disable offloading copies for this device.
+   Writing any '1' value will enable this feature. If the device
+   does not support offloading, then writing 1, will result in an
+   error.
+
+
+What:  /sys/block//queue/copy_max_bytes
+Date:  June 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] This is the maximum number of bytes that the block layer
+   will allow for a copy request. This will is always smaller or
+   equal to the maximum size allowed by the hardware, indicated by
+   'copy_max_bytes_hw'. An attempt to set a value higher than
+   'copy_max_bytes_hw' will truncate this to 'copy_max_bytes_hw'.
+
+
+What:  /sys/block//queue/copy_max_bytes_hw
+Date:  June 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RO] This is the maximum number of bytes that the hardware
+   will allow for single data copy request.
+   A value of 0 means that the device does not support
+   copy offload.
+
+
 What:  /sys/block//queue/crypto/
 Date:  February 2022
 Contact:   linux-bl...@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 4dd59059b788..738cd3f21259 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
+   lim->max_copy_sectors_hw = 0;
+   lim->max_copy_sectors = 0;
 }
 
 /**
@@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_dev_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
lim->max_zone_append_sectors = UINT_MAX;
+   lim->max_copy_sectors_hw = UINT_MAX;
+   lim->max_copy_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/**
+ * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
+ * @q:  the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ **/
+void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+   unsigned int max_copy_sectors)
+{
+   if (max_copy_sectors > (COPY_MAX_BYTES >> SECTOR_SHIFT))
+   max_copy_sectors = COPY_MAX_BYTES >> SECTOR_SHIFT;
+
+   q->limits.max_copy_sectors_hw = max_copy_sectors;
+   q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -578,6 +598,10 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
   b->max_segment_size);
 
+   t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+   t->max_copy_sectors_hw = min(t->max_copy_sectors_hw,
+   b->max_copy_sectors_hw);
+
t->misaligned |= b->misaligned;
 
alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a642085

[dm-devel] [PATCH v12 2/9] block: Add copy offload support infrastructure

2023-06-05 Thread Nitesh Shetty
Introduce blkdev_issue_copy which takes similar arguments as
copy_file_range and performs copy offload between two bdevs.
Introduce REQ_COPY copy offload operation flag. Create a read-write
bio pair with a token as payload and submitted to the device in order.
Read request populates token with source specific information which
is then passed with write request.
This design is courtesy Mikulas Patocka's token based copy

Larger copy will be divided, based on max_copy_sectors limit.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 block/blk-lib.c   | 243 ++
 block/blk.h   |   2 +
 include/linux/blk_types.h |  25 
 include/linux/blkdev.h|   4 +
 include/uapi/linux/fs.h   |   5 +-
 5 files changed, 278 insertions(+), 1 deletion(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..b8e11997b5bf 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -115,6 +115,249 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+/*
+ * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to make it through
+ * blkdev_copy_(offload/emulate)_write_endio.
+ */
+static ssize_t blkdev_copy_wait_completion(struct cio *cio)
+{
+   ssize_t ret;
+
+   if (cio->endio)
+   return 0;
+
+   if (atomic_read(>refcount)) {
+   __set_current_state(TASK_UNINTERRUPTIBLE);
+   blk_io_schedule();
+   }
+
+   ret = cio->comp_len;
+   kfree(cio);
+
+   return ret;
+}
+
+static void blkdev_copy_offload_write_endio(struct bio *bio)
+{
+   struct copy_ctx *ctx = bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (bio->bi_status) {
+   clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   }
+   kfree(bvec_virt(>bi_io_vec[0]));
+   bio_put(bio);
+
+   kfree(ctx);
+   if (!atomic_dec_and_test(>refcount))
+   return;
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+}
+
+static void blkdev_copy_offload_read_endio(struct bio *read_bio)
+{
+   struct copy_ctx *ctx = read_bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (read_bio->bi_status) {
+   clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT)
+   - cio->pos_in;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   kfree(bvec_virt(_bio->bi_io_vec[0]));
+   bio_put(ctx->write_bio);
+   bio_put(read_bio);
+   kfree(ctx);
+   if (atomic_dec_and_test(>refcount)) {
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+   }
+   return;
+   }
+
+   schedule_work(>dispatch_work);
+   bio_put(read_bio);
+}
+
+static void blkdev_copy_dispatch_work(struct work_struct *work)
+{
+   struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+   dispatch_work);
+
+   submit_bio(ctx->write_bio);
+}
+
+/*
+ * __blkdev_copy_offload   - Use device's native copy offload feature.
+ * we perform copy operation by sending 2 bio.
+ * 1. First we send a read bio with REQ_COPY flag along with a token and source
+ * and length. Once read bio reaches driver layer, device driver adds all the
+ * source info to token and does a fake completion.
+ * 2. Once read operation completes, we issue write with REQ_COPY flag with 
same
+ * token. In driver layer, token info is used to form a copy offload command.
+ *
+ * Returns the length of bytes copied or error if encountered
+ */
+static ssize_t __blkdev_copy_offload(
+   struct block_device *bdev_in, loff_t pos_in,
+   struct block_device *bdev_out, loff_t pos_out,
+   size_t len, cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+   struct cio *cio;
+   struct copy_ctx *ctx;
+   struct bio *read_bio, *write_bio;
+   void *token;
+   sector_t copy_len;
+   sector_t rem, max_copy_len;
+
+   cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+   if (!cio)
+   return -ENOMEM;
+   atomic_set(>refcount, 0);
+   cio->waiter = current;
+   cio->endio = endio;
+   cio->private 

[dm-devel] [PATCH v12 0/9] Implement copy offload support

2023-06-05 Thread Nitesh Shetty
The patch series covers the points discussed in past and most recently
in LSFMM'23[0].
We have covered the initial agreed requirements in this patchset and
further additional features suggested by community.
Patchset borrows Mikulas's token based approach for 2 bdev implementation.

This is next iteration of our previous patchset v11[1].

Overall series supports:

1. Driver
- NVMe Copy command (single NS, TP 4065), including support
in nvme-target (for block and file backend).

2. Block layer
- Block-generic copy (REQ_COPY flag), with interface
accommodating two block-devs
- Emulation, for in-kernel user when offload is natively 
absent
- dm-linear support (for cases not requiring split)

3. User-interface
- copy_file_range

Testing
===
Copy offload can be tested on:
a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
parameters mssrl,mcl, msrc. For more info [2].
b. Null block device
c. NVMe Fabrics loopback.
d. blktests[3] (tests block/034-037, nvme/050-053)

Emulation can be tested on any device.

fio[4].

Infra and plumbing:
===
We populate copy_file_range callback in def_blk_fops. 
For devices that support copy-offload, use blkdev_copy_offload to
achieve in-device copy.
However for cases, where device doesn't support offload,
fallback to generic_copy_file_range.
For in-kernel users (like NVMe fabrics), we use blkdev_issue_copy
which implements its own emulation, as fd is not available.
Modify checks in generic_copy_file_range to support block-device.

Performance:

The major benefit of this copy-offload/emulation framework is
observed in fabrics setup, for copy workloads across the network.
The host will send offload command over the network and actual copy
can be achieved using emulation on the target.
This results in higher performance and lower network consumption,
as compared to read and write travelling across the network.
With async-design of copy-offload/emulation we are able to see the
following improvements as compared to userspace read + write on a
NVMeOF TCP setup:

Setup1: Network Speed: 1000Mb/s
Host PC: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Target PC: AMD Ryzen 9 5900X 12-Core Processor
block size 8k:
710% improvement in IO BW (108 MiB/s to 876 MiB/s).
Network utilisation drops from  97% to 15%.
block-size 1M:
2532% improvement in IO BW (101 MiB/s to 2659 MiB/s).
Network utilisation drops from 89% to 0.62%.

Setup2: Network Speed: 100Gb/s
Server: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 72 cores
(host and target have the same configuration)
block-size 8k:
17.5% improvement in IO BW (794 MiB/s to 933 MiB/s).
Network utilisation drops from  6.75% to 0.16%.

Blktests[3]
==
tests/block/034,035: Runs copy offload and emulation on block
  device.
tests/block/035,036: Runs copy offload and emulation on null
  block device.
tests/nvme/050-053: Create a loop backed fabrics device and
  run copy offload and emulation.

Future Work
===
- loopback device copy offload support
- upstream fio to use copy offload
- upstream blktest to test copy offload

These are to be taken up after this minimal series is agreed upon.

Additional links:
=
[0] 
https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zt14mmpgmivczugb33c60...@mail.gmail.com/

https://lore.kernel.org/linux-nvme/f0e19ae4-b37a-e9a3-2be7-a5afb334a...@nvidia.com/

https://lore.kernel.org/linux-nvme/20230113094648.15614-1-nj.she...@samsung.com/
[1] 
https://lore.kernel.org/all/20230522104146.2856-1-nj.she...@samsung.com/
[2] 
https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v12
[4] https://github.com/vincentkfu/fio/tree/copyoffload-3.35-v12

Changes since v11:
=
- Documentation: Improved documentation (Damien Le Moal)
- block,nvme: ssize_t return values (Darrick J. Wong)
- block: token is allocated to SECTOR_SIZE (Matthew Wilcox)
- block: mem leak fix (Maurizio Lombardi)

Changes since v10:
=
- NVMeOF: optimization in NVMe fabrics (Chaitanya Kulkarni)
- NVMeOF: sparse warnings (kernel test robot)

Changes since v9:
=
- null_blk, improved documentation, minor fixes

Re: [dm-devel] [PATCH v11 2/9] block: Add copy offload support infrastructure

2023-05-30 Thread Nitesh Shetty

On 23/05/29 06:55PM, Matthew Wilcox wrote:

On Mon, May 22, 2023 at 04:11:33PM +0530, Nitesh Shetty wrote:

+   token = alloc_page(gfp_mask);


Why is PAGE_SIZE the right size for 'token'?  That seems quite unlikely.
I could understand it being SECTOR_SIZE or something that's dependent on
the device, but I cannot fathom it being dependent on the host' page size.


This has nothing to do with block device at this point, merely a place
holder to store information about copy offload such as src sectors, len.
Token will be typecasted by driver to get copy info.
SECTOR_SIZE also should work in this case, will update in next version.
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v11 2/9] block: Add copy offload support infrastructure

2023-05-30 Thread Nitesh Shetty
> > +/*
> > + * @bdev_in: source block device
> > + * @pos_in:  source offset
> > + * @bdev_out:destination block device
> > + * @pos_out: destination offset
> > + * @len: length in bytes to be copied
> > + * @endio:   endio function to be called on completion of copy operation,
> > + *   for synchronous operation this should be NULL
> > + * @private: endio function will be called with this private data, should 
> > be
> > + *   NULL, if operation is synchronous in nature
> > + * @gfp_mask:   memory allocation flags (for bio_alloc)
> > + *
> > + * Returns the length of bytes copied or error if encountered
> > + *
> > + * Description:
> > + *   Copy source offset from source block device to destination block
> > + *   device. Max total length of copy is limited to MAX_COPY_TOTAL_LENGTH
> > + */
> > +int blkdev_issue_copy(struct block_device *bdev_in, loff_t pos_in,
>
> I'd have thought you'd return ssize_t here.  If the two block devices
> are loopmounted xfs files, we can certainly reflink "copy" more than 2GB
> at a time.
>
> --D
>

Sure we will add this to make API future proof, but at present we do have
a limit for copy. COPY_MAX_BYTES(=128MB) at present. This limit is based
on our internal testing, we have plans to increase/remove with this
limit in future phases.

Thank you,
Nitesh Shetty

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v11 2/9] block: Add copy offload support infrastructure

2023-05-30 Thread Nitesh Shetty

On 23/05/30 01:29PM, Maurizio Lombardi wrote:

po 22. 5. 2023 v 13:17 odesílatel Nitesh Shetty  napsal:


+static int __blkdev_copy_offload(struct block_device *bdev_in, loff_t pos_in,
+   struct block_device *bdev_out, loff_t pos_out, size_t len,
+   cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+   struct cio *cio;
+   struct copy_ctx *ctx;
+   struct bio *read_bio, *write_bio;
+   struct page *token;
+   sector_t copy_len;
+   sector_t rem, max_copy_len;
+
+   cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+   if (!cio)
+   return -ENOMEM;
+   atomic_set(>refcount, 0);
+   cio->waiter = current;
+   cio->endio = endio;
+   cio->private = private;
+
+   max_copy_len = min(bdev_max_copy_sectors(bdev_in),
+   bdev_max_copy_sectors(bdev_out)) << SECTOR_SHIFT;
+
+   cio->pos_in = pos_in;
+   cio->pos_out = pos_out;
+   /* If there is a error, comp_len will be set to least successfully
+* completed copied length
+*/
+   cio->comp_len = len;
+   for (rem = len; rem > 0; rem -= copy_len) {
+   copy_len = min(rem, max_copy_len);
+
+   token = alloc_page(gfp_mask);
+   if (unlikely(!token))
+   goto err_token;


[...]


+err_token:
+   cio->comp_len = min_t(sector_t, cio->comp_len, (len - rem));
+   if (!atomic_read(>refcount))
+   return -ENOMEM;
+   /* Wait for submitted IOs to complete */
+   return blkdev_copy_wait_completion(cio);
+}


Suppose the first call to "token = alloc_page()" fails (and
cio->refcount == 0), isn't "cio" going to be leaked here?

Maurizio



Agreed, will free it in "err_token", and will update next version.

Thank you,
Nitesh Shetty
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v11 1/9] block: Introduce queue limits for copy-offload support

2023-05-23 Thread Nitesh Shetty
On Mon, May 22, 2023 at 08:45:44PM +0900, Damien Le Moal wrote:
> On 5/22/23 19:41, Nitesh Shetty wrote:
> > Add device limits as sysfs entries,
> > - copy_offload (RW)
> > - copy_max_bytes (RW)
> > - copy_max_bytes_hw (RO)
> > 
> > Above limits help to split the copy payload in block layer.
> > copy_offload: used for setting copy offload(1) or emulation(0).
> > copy_max_bytes: maximum total length of copy in single payload.
> > copy_max_bytes_hw: Reflects the device supported maximum limit.
> > 
> > Reviewed-by: Hannes Reinecke 
> > Signed-off-by: Nitesh Shetty 
> > Signed-off-by: Kanchan Joshi 
> > Signed-off-by: Anuj Gupta 
> > ---
> >  Documentation/ABI/stable/sysfs-block | 33 ++
> >  block/blk-settings.c | 24 +++
> >  block/blk-sysfs.c| 64 
> >  include/linux/blkdev.h   | 12 ++
> >  include/uapi/linux/fs.h  |  3 ++
> >  5 files changed, 136 insertions(+)
> > 
> > diff --git a/Documentation/ABI/stable/sysfs-block 
> > b/Documentation/ABI/stable/sysfs-block
> > index c57e5b7cb532..e4d31132f77c 100644
> > --- a/Documentation/ABI/stable/sysfs-block
> > +++ b/Documentation/ABI/stable/sysfs-block
> > @@ -155,6 +155,39 @@ Description:
> > last zone of the device which may be smaller.
> >  
> >  
> > +What:  /sys/block//queue/copy_offload
> > +Date:  April 2023
> > +Contact:   linux-bl...@vger.kernel.org
> > +Description:
> > +   [RW] When read, this file shows whether offloading copy to a
> > +   device is enabled (1) or disabled (0). Writing '0' to this
> > +   file will disable offloading copies for this device.
> > +   Writing any '1' value will enable this feature. If the device
> > +   does not support offloading, then writing 1, will result in
> > +   error.
> 
> will result is an error.
> 

acked

> > +
> > +
> > +What:  /sys/block//queue/copy_max_bytes
> > +Date:  April 2023
> > +Contact:   linux-bl...@vger.kernel.org
> > +Description:
> > +   [RW] This is the maximum number of bytes, that the block layer
> 
> Please drop the comma after block.

you mean after bytes ?, acked.

> 
> > +   will allow for copy request. This will be smaller or equal to
> 
> will allow for a copy request. This value is always smaller...
> 

acked

> > +   the maximum size allowed by the hardware, indicated by
> > +   'copy_max_bytes_hw'. Attempt to set value higher than
> 
> An attempt to set a value higher than...
> 

acked

> > +   'copy_max_bytes_hw' will truncate this to 'copy_max_bytes_hw'.
> > +
> > +
> > +What:  /sys/block//queue/copy_max_bytes_hw
> > +Date:  April 2023
> > +Contact:   linux-bl...@vger.kernel.org
> > +Description:
> > +   [RO] This is the maximum number of bytes, that the hardware
> 
> drop the comma after bytes
> 

acked

> > +   will allow in a single data copy request.
> 
> will allow for
> 

acked

> > +   A value of 0 means that the device does not support
> > +   copy offload.
> 
> Given that you do have copy emulation for devices that do not support hw
> offload, how is the user supposed to know the maximum size of a copy request
> when it is emulated ? This is not obvious from looking at these parameters.
> 

This was little tricky for us as well.
There are multiple limits (such as max_hw_sectors, max_segments,
buffer allocation size), which decide what emulated copy size is.
Moreover this limit was supposed to reflect device copy offload
size/capability.
Let me know if you something in mind, which can make this look better.

> > +
> > +
> >  What:  /sys/block//queue/crypto/
> >  Date:  February 2022
> >  Contact:   linux-bl...@vger.kernel.org
> > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > index 896b4654ab00..23aff2d4dcba 100644
> > --- a/block/blk-settings.c
> > +++ b/block/blk-settings.c
> > @@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
> > lim->zoned = BLK_ZONED_NONE;
> > lim->zone_write_granularity = 0;
> > lim->dma_alignment = 511;
> > +   lim->max_copy_sectors_hw = 0;
> > +   lim->max_copy_sectors = 0;
> >  }
> >  
> >  /**
> > @@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
>

[dm-devel] [PATCH v11 6/9] nvmet: add copy command support for bdev and file ns

2023-05-22 Thread Nitesh Shetty
Add support for handling nvme_cmd_copy command on target.
For bdev-ns we call into blkdev_issue_copy, which the block layer
completes by a offloaded copy request to backend bdev or by emulating the
request.

For file-ns we call vfs_copy_file_range to service our request.

Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.

loop target has copy support, which can be used to test copy offload.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/target/admin-cmd.c   |  9 -
 drivers/nvme/target/io-cmd-bdev.c | 62 +++
 drivers/nvme/target/io-cmd-file.c | 52 ++
 drivers/nvme/target/loop.c|  6 +++
 drivers/nvme/target/nvmet.h   |  1 +
 5 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 39cb570f833d..8e644b8ec0fd 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -433,8 +433,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req 
*req)
id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-   NVME_CTRL_ONCS_WRITE_ZEROES);
-
+   NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
/* XXX: don't report vwc if the underlying device is write through */
id->vwc = NVME_CTRL_VWC_PRESENT;
 
@@ -536,6 +535,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req 
*req)
 
if (req->ns->bdev)
nvmet_bdev_set_limits(req->ns->bdev, id);
+   else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
+   (PAGE_SHIFT - SECTOR_SHIFT));
+   id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
+   }
 
/*
 * We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index c2d6cea0236b..d92dfe86c647 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,18 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct 
nvme_id_ns *id)
id->npda = id->npdg;
/* NOWS = Namespace Optimal Write Size */
id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+   if (bdev_max_copy_sectors(bdev)) {
+   id->msrc = id->msrc;
+   id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
+   SECTOR_SHIFT) / bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   } else {
+   id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+   id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
+   bdev_logical_block_size(bdev));
+   id->mcl = cpu_to_le32((__force u32)id->mssrl);
+   }
 }
 
 void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -184,6 +196,21 @@ static void nvmet_bio_done(struct bio *bio)
nvmet_req_bio_put(req, bio);
 }
 
+static void nvmet_bdev_copy_end_io(void *private, int comp_len)
+{
+   struct nvmet_req *req = (struct nvmet_req *)private;
+   u16 status;
+
+   if (comp_len == req->copy_len) {
+   req->cqe->result.u32 = cpu_to_le32(1);
+   status = errno_to_nvme_status(req, 0);
+   } else {
+   req->cqe->result.u32 = cpu_to_le32(0);
+   status = errno_to_nvme_status(req, (__force u16)BLK_STS_IOERR);
+   }
+   nvmet_req_complete(req, status);
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 static int nvmet_bdev_alloc_bip(struct nvmet_req *req, struct bio *bio,
struct sg_mapping_iter *miter)
@@ -450,6 +477,37 @@ static void nvmet_bdev_execute_write_zeroes(struct 
nvmet_req *req)
}
 }
 
+/* At present we handle only one range entry, since copy offload is aligned 
with
+ * copy_file_range, only one entry is passed from block layer.
+ */
+static void nvmet_bdev_execute_copy(struct nvmet_req *req)
+{
+   struct nvme_copy_range range;
+   struct nvme_command *cmd = req->cmd;
+   int ret;
+   u16 status;
+
+   status = nvmet_copy_from_sgl(req, 0, , sizeof(range));
+   if (status)
+   goto out;
+
+   ret = blkdev_issue_copy(req->ns->bdev,
+   le64_to_cpu(cmd->copy.sdlba) << req->ns->blksize_shift,
+   req->ns->bdev,
+   le64_to_cpu(range.slba) << req->ns->blksize_shift,
+   (le16_to_cpu(range.nlb) + 1) << req->ns->blksize_shift,
+   nvmet_bdev_copy_end_io, (void *)req, GFP_KERNEL);

[dm-devel] [PATCH v11 9/9] null_blk: add support for copy offload

2023-05-22 Thread Nitesh Shetty
Implementaion is based on existing read and write infrastructure.
copy_max_bytes: A new configfs and module parameter is introduced, which
can be used to set hardware/driver supported maximum copy limit.

Suggested-by: Damien Le Moal 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
---
 drivers/block/null_blk/main.c | 108 --
 drivers/block/null_blk/null_blk.h |   8 +++
 2 files changed, 111 insertions(+), 5 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index b3fedafe301e..34e009b3ebd5 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -157,6 +157,10 @@ static int g_max_sectors;
 module_param_named(max_sectors, g_max_sectors, int, 0444);
 MODULE_PARM_DESC(max_sectors, "Maximum size of a command (in 512B sectors)");
 
+static unsigned long g_copy_max_bytes = COPY_MAX_BYTES;
+module_param_named(copy_max_bytes, g_copy_max_bytes, ulong, 0444);
+MODULE_PARM_DESC(copy_max_bytes, "Maximum size of a copy command (in bytes)");
+
 static unsigned int nr_devices = 1;
 module_param(nr_devices, uint, 0444);
 MODULE_PARM_DESC(nr_devices, "Number of devices to register");
@@ -409,6 +413,7 @@ NULLB_DEVICE_ATTR(home_node, uint, NULL);
 NULLB_DEVICE_ATTR(queue_mode, uint, NULL);
 NULLB_DEVICE_ATTR(blocksize, uint, NULL);
 NULLB_DEVICE_ATTR(max_sectors, uint, NULL);
+NULLB_DEVICE_ATTR(copy_max_bytes, uint, NULL);
 NULLB_DEVICE_ATTR(irqmode, uint, NULL);
 NULLB_DEVICE_ATTR(hw_queue_depth, uint, NULL);
 NULLB_DEVICE_ATTR(index, uint, NULL);
@@ -550,6 +555,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
_device_attr_queue_mode,
_device_attr_blocksize,
_device_attr_max_sectors,
+   _device_attr_copy_max_bytes,
_device_attr_irqmode,
_device_attr_hw_queue_depth,
_device_attr_index,
@@ -656,7 +662,8 @@ static ssize_t memb_group_features_show(struct config_item 
*item, char *page)
"poll_queues,power,queue_mode,shared_tag_bitmap,size,"
"submit_queues,use_per_node_hctx,virt_boundary,zoned,"
"zone_capacity,zone_max_active,zone_max_open,"
-   "zone_nr_conv,zone_offline,zone_readonly,zone_size\n");
+   "zone_nr_conv,zone_offline,zone_readonly,zone_size,"
+   "copy_max_bytes\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -722,6 +729,7 @@ static struct nullb_device *null_alloc_dev(void)
dev->queue_mode = g_queue_mode;
dev->blocksize = g_bs;
dev->max_sectors = g_max_sectors;
+   dev->copy_max_bytes = g_copy_max_bytes;
dev->irqmode = g_irqmode;
dev->hw_queue_depth = g_hw_queue_depth;
dev->blocking = g_blocking;
@@ -1271,6 +1279,78 @@ static int null_transfer(struct nullb *nullb, struct 
page *page,
return err;
 }
 
+static inline void nullb_setup_copy_read(struct nullb *nullb, struct bio *bio)
+{
+   struct nullb_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+
+   token->subsys = "nullb";
+   token->sector_in = bio->bi_iter.bi_sector;
+   token->nullb = nullb;
+   token->sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+}
+
+static inline int nullb_setup_copy_write(struct nullb *nullb,
+   struct bio *bio, bool is_fua)
+{
+   struct nullb_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+   sector_t sector_in, sector_out;
+   void *in, *out;
+   size_t rem, temp;
+   unsigned long offset_in, offset_out;
+   struct nullb_page *t_page_in, *t_page_out;
+   int ret = -EIO;
+
+   if (unlikely(memcmp(token->subsys, "nullb", 5)))
+   return -EINVAL;
+   if (unlikely(token->nullb != nullb))
+   return -EINVAL;
+   if (WARN_ON(token->sectors != bio->bi_iter.bi_size >> SECTOR_SHIFT))
+   return -EINVAL;
+
+   sector_in = token->sector_in;
+   sector_out = bio->bi_iter.bi_sector;
+   rem = token->sectors << SECTOR_SHIFT;
+
+   spin_lock_irq(>lock);
+   while (rem > 0) {
+   temp = min_t(size_t, nullb->dev->blocksize, rem);
+   offset_in = (sector_in & SECTOR_MASK) << SECTOR_SHIFT;
+   offset_out = (sector_out & SECTOR_MASK) << SECTOR_SHIFT;
+
+   if (null_cache_active(nullb) && !is_fua)
+   null_make_cache_space(nullb, PAGE_SIZE);
+
+   t_page_in = null_lookup_page(nullb, sector_in, false,
+   !null_cache_active(nullb));
+   if (!t_page_in)
+   goto err;
+   t_page_out = null_insert_page(nullb, sector_out,
+   !null_cache_a

[dm-devel] [PATCH v11 7/9] dm: Add support for copy offload

2023-05-22 Thread Nitesh Shetty
Before enabling copy for dm target, check if underlying devices and
dm target support copy. Avoid split happening inside dm target.
Fail early if the request needs split, currently splitting copy
request is not supported.

Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-table.c | 41 +++
 drivers/md/dm.c   |  7 ++
 include/linux/device-mapper.h |  5 +
 3 files changed, 53 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 1398f1d6e83e..b3269271e761 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1867,6 +1867,39 @@ static bool dm_table_supports_nowait(struct dm_table *t)
return true;
 }
 
+static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
+{
+   struct request_queue *q = bdev_get_queue(dev->bdev);
+
+   return !blk_queue_copy(q);
+}
+
+static bool dm_table_supports_copy(struct dm_table *t)
+{
+   struct dm_target *ti;
+   unsigned int i;
+
+   for (i = 0; i < t->num_targets; i++) {
+   ti = dm_table_get_target(t, i);
+
+   if (!ti->copy_offload_supported)
+   return false;
+
+   /*
+* target provides copy support (as implied by setting
+* 'copy_offload_supported')
+* and it relies on _all_ data devices having copy support.
+*/
+   if (!ti->type->iterate_devices ||
+ti->type->iterate_devices(ti,
+device_not_copy_capable, NULL))
+   return false;
+   }
+
+   return true;
+}
+
 static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
  sector_t start, sector_t len, void *data)
 {
@@ -1949,6 +1982,14 @@ int dm_table_set_restrictions(struct dm_table *t, struct 
request_queue *q,
q->limits.discard_misaligned = 0;
}
 
+   if (!dm_table_supports_copy(t)) {
+   blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+   q->limits.max_copy_sectors = 0;
+   q->limits.max_copy_sectors_hw = 0;
+   } else {
+   blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+   }
+
if (!dm_table_supports_secure_erase(t))
q->limits.max_secure_erase_sectors = 0;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3b694ba3a106..ab9069090a7d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1720,6 +1720,13 @@ static blk_status_t __split_and_process_bio(struct 
clone_info *ci)
if (unlikely(ci->is_abnormal_io))
return __process_abnormal_io(ci, ti);
 
+   if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
+   max_io_len(ti, ci->sector) < ci->sector_count)) {
+   DMERR("Error, IO size(%u) > max target size(%llu)\n",
+   ci->sector_count, max_io_len(ti, ci->sector));
+   return BLK_STS_IOERR;
+   }
+
/*
 * Only support bio polling for normal IO, and the target io is
 * exactly inside the dm_io instance (verified in dm_poll_dm_io)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a52d2b9a6846..04016bd76e73 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -398,6 +398,11 @@ struct dm_target {
 * bio_set_dev(). NOTE: ideally a target should _not_ need this.
 */
bool needs_bio_set_dev:1;
+
+   /*
+* copy offload is supported
+*/
+   bool copy_offload_supported:1;
 };
 
 void *dm_per_bio_data(struct bio *bio, size_t data_size);
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v11 1/9] block: Introduce queue limits for copy-offload support

2023-05-22 Thread Nitesh Shetty
Add device limits as sysfs entries,
- copy_offload (RW)
- copy_max_bytes (RW)
- copy_max_bytes_hw (RO)

Above limits help to split the copy payload in block layer.
copy_offload: used for setting copy offload(1) or emulation(0).
copy_max_bytes: maximum total length of copy in single payload.
copy_max_bytes_hw: Reflects the device supported maximum limit.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Anuj Gupta 
---
 Documentation/ABI/stable/sysfs-block | 33 ++
 block/blk-settings.c | 24 +++
 block/blk-sysfs.c| 64 
 include/linux/blkdev.h   | 12 ++
 include/uapi/linux/fs.h  |  3 ++
 5 files changed, 136 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block 
b/Documentation/ABI/stable/sysfs-block
index c57e5b7cb532..e4d31132f77c 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,39 @@ Description:
last zone of the device which may be smaller.
 
 
+What:  /sys/block//queue/copy_offload
+Date:  April 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] When read, this file shows whether offloading copy to a
+   device is enabled (1) or disabled (0). Writing '0' to this
+   file will disable offloading copies for this device.
+   Writing any '1' value will enable this feature. If the device
+   does not support offloading, then writing 1, will result in
+   error.
+
+
+What:  /sys/block//queue/copy_max_bytes
+Date:  April 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] This is the maximum number of bytes, that the block layer
+   will allow for copy request. This will be smaller or equal to
+   the maximum size allowed by the hardware, indicated by
+   'copy_max_bytes_hw'. Attempt to set value higher than
+   'copy_max_bytes_hw' will truncate this to 'copy_max_bytes_hw'.
+
+
+What:  /sys/block//queue/copy_max_bytes_hw
+Date:  April 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RO] This is the maximum number of bytes, that the hardware
+   will allow in a single data copy request.
+   A value of 0 means that the device does not support
+   copy offload.
+
+
 What:  /sys/block//queue/crypto/
 Date:  February 2022
 Contact:   linux-bl...@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 896b4654ab00..23aff2d4dcba 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
+   lim->max_copy_sectors_hw = 0;
+   lim->max_copy_sectors = 0;
 }
 
 /**
@@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_dev_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
lim->max_zone_append_sectors = UINT_MAX;
+   lim->max_copy_sectors_hw = ULONG_MAX;
+   lim->max_copy_sectors = ULONG_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/**
+ * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
+ * @q:  the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ **/
+void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+   unsigned int max_copy_sectors)
+{
+   if (max_copy_sectors > (COPY_MAX_BYTES >> SECTOR_SHIFT))
+   max_copy_sectors = COPY_MAX_BYTES >> SECTOR_SHIFT;
+
+   q->limits.max_copy_sectors_hw = max_copy_sectors;
+   q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -578,6 +598,10 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
   b->max_segment_size);
 
+   t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+   t->max_copy_sectors_hw = min(t->max_copy_sectors_hw,
+   b->max_copy_sectors_hw);
+
t->misaligned |= b->misaligned;
 
alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a64208583853..8

[dm-devel] [PATCH v11 4/9] fs, block: copy_file_range for def_blk_ops for direct block device

2023-05-22 Thread Nitesh Shetty
For direct block device opened with O_DIRECT, use copy_file_range to
issue device copy offload, and fallback to generic_copy_file_range incase
device copy offload capability is absent.
Modify checks to allow bdevs to use copy_file_range.

Suggested-by: Ming Lei 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
---
 block/blk-lib.c| 23 +++
 block/fops.c   | 20 
 fs/read_write.c| 11 +--
 include/linux/blkdev.h |  3 +++
 mm/filemap.c   | 11 ---
 5 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index ba32545eb8d5..7d6ef85692a6 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -523,6 +523,29 @@ int blkdev_issue_copy(struct block_device *bdev_in, loff_t 
pos_in,
 }
 EXPORT_SYMBOL_GPL(blkdev_issue_copy);
 
+/* Returns the length of bytes copied */
+int blkdev_copy_offload(struct block_device *bdev_in, loff_t pos_in,
+ struct block_device *bdev_out, loff_t pos_out, size_t len,
+ gfp_t gfp_mask)
+{
+   struct request_queue *in_q = bdev_get_queue(bdev_in);
+   struct request_queue *out_q = bdev_get_queue(bdev_out);
+   int ret = 0;
+
+   if (blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len))
+   return 0;
+
+   if (blk_queue_copy(in_q) && blk_queue_copy(out_q)) {
+   ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
+   len, NULL, NULL, gfp_mask);
+   if (ret < 0)
+   return 0;
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_copy_offload);
+
 static int __blkdev_issue_write_zeroes(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
struct bio **biop, unsigned flags)
diff --git a/block/fops.c b/block/fops.c
index ab750e8a040f..df8985675ed1 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -614,6 +614,25 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
return ret;
 }
 
+static ssize_t blkdev_copy_file_range(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   size_t len, unsigned int flags)
+{
+   struct block_device *in_bdev = I_BDEV(bdev_file_inode(file_in));
+   struct block_device *out_bdev = I_BDEV(bdev_file_inode(file_out));
+   int comp_len = 0;
+
+   if ((file_in->f_iocb_flags & IOCB_DIRECT) &&
+   (file_out->f_iocb_flags & IOCB_DIRECT))
+   comp_len = blkdev_copy_offload(in_bdev, pos_in, out_bdev,
+pos_out, len, GFP_KERNEL);
+   if (comp_len != len)
+   comp_len = generic_copy_file_range(file_in, pos_in + comp_len,
+   file_out, pos_out + comp_len, len - comp_len, flags);
+
+   return comp_len;
+}
+
 #defineBLKDEV_FALLOC_FL_SUPPORTED  
\
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
@@ -697,6 +716,7 @@ const struct file_operations def_blk_fops = {
.splice_read= generic_file_splice_read,
.splice_write   = iter_file_splice_write,
.fallocate  = blkdev_fallocate,
+   .copy_file_range = blkdev_copy_file_range,
 };
 
 static __init int blkdev_init(void)
diff --git a/fs/read_write.c b/fs/read_write.c
index a21ba3be7dbe..47e848fcfd42 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 #include 
@@ -1447,7 +1448,11 @@ static int generic_copy_file_checks(struct file 
*file_in, loff_t pos_in,
return -EOVERFLOW;
 
/* Shorten the copy to EOF */
-   size_in = i_size_read(inode_in);
+   if (S_ISBLK(inode_in->i_mode))
+   size_in = bdev_nr_bytes(I_BDEV(file_in->f_mapping->host));
+   else
+   size_in = i_size_read(inode_in);
+
if (pos_in >= size_in)
count = 0;
else
@@ -1708,7 +1713,9 @@ int generic_file_rw_checks(struct file *file_in, struct 
file *file_out)
/* Don't copy dirs, pipes, sockets... */
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
return -EISDIR;
-   if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+
+   if ((!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) &&
+   (!S_ISBLK(inode_in->i_mode) || !S_ISBLK(inode_out->i_mode)))
return -EINVAL;
 
if (!(file_in->f_mode & FMODE_READ) ||
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a95c26faa8b6..a9bb7e3a8c79 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -105

[dm-devel] [PATCH v11 5/9] nvme: add copy offload support

2023-05-22 Thread Nitesh Shetty
For device supporting native copy, nvme driver receives read and
write request with BLK_COPY op flags.
For read request the nvme driver populates the payload with source
information.
For write request the driver converts it to nvme copy command using the
source information in the payload and submits to the device.
current design only supports single source range.
This design is courtesy Mikulas Patocka's token based copy

trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.

Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Javier González 
Signed-off-by: Anuj Gupta 
---
 drivers/nvme/host/constants.c |   1 +
 drivers/nvme/host/core.c  | 103 +-
 drivers/nvme/host/fc.c|   5 ++
 drivers/nvme/host/nvme.h  |   7 +++
 drivers/nvme/host/pci.c   |  27 -
 drivers/nvme/host/rdma.c  |   7 +++
 drivers/nvme/host/tcp.c   |  16 ++
 drivers/nvme/host/trace.c |  19 +++
 include/linux/nvme.h  |  43 +-
 9 files changed, 220 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index bc523ca02254..01be882b726f 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -19,6 +19,7 @@ static const char * const nvme_ops[] = {
[nvme_cmd_resv_report] = "Reservation Report",
[nvme_cmd_resv_acquire] = "Reservation Acquire",
[nvme_cmd_resv_release] = "Reservation Release",
+   [nvme_cmd_copy] = "Copy Offload",
[nvme_cmd_zone_mgmt_send] = "Zone Management Send",
[nvme_cmd_zone_mgmt_recv] = "Zone Management Receive",
[nvme_cmd_zone_append] = "Zone Management Append",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ccb6eb1282f8..aef7b59dbd61 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -754,6 +754,77 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
+static inline void nvme_setup_copy_read(struct nvme_ns *ns, struct request 
*req)
+{
+   struct bio *bio = req->bio;
+   struct nvme_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+
+   token->subsys = "nvme";
+   token->ns = ns;
+   token->src_sector = bio->bi_iter.bi_sector;
+   token->sectors = bio->bi_iter.bi_size >> 9;
+}
+
+static inline blk_status_t nvme_setup_copy_write(struct nvme_ns *ns,
+  struct request *req, struct nvme_command *cmnd)
+{
+   struct nvme_copy_range *range = NULL;
+   struct bio *bio = req->bio;
+   struct nvme_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+   sector_t src_sector, dst_sector, n_sectors;
+   u64 src_lba, dst_lba, n_lba;
+   unsigned short nr_range = 1;
+   u16 control = 0;
+
+   if (unlikely(memcmp(token->subsys, "nvme", 4)))
+   return BLK_STS_NOTSUPP;
+   if (unlikely(token->ns != ns))
+   return BLK_STS_NOTSUPP;
+
+   src_sector = token->src_sector;
+   dst_sector = bio->bi_iter.bi_sector;
+   n_sectors = token->sectors;
+   if (WARN_ON(n_sectors != bio->bi_iter.bi_size >> 9))
+   return BLK_STS_NOTSUPP;
+
+   src_lba = nvme_sect_to_lba(ns, src_sector);
+   dst_lba = nvme_sect_to_lba(ns, dst_sector);
+   n_lba = nvme_sect_to_lba(ns, n_sectors);
+
+   if (WARN_ON(!n_lba))
+   return BLK_STS_NOTSUPP;
+
+   if (req->cmd_flags & REQ_FUA)
+   control |= NVME_RW_FUA;
+
+   if (req->cmd_flags & REQ_FAILFAST_DEV)
+   control |= NVME_RW_LR;
+
+   memset(cmnd, 0, sizeof(*cmnd));
+   cmnd->copy.opcode = nvme_cmd_copy;
+   cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+   cmnd->copy.sdlba = cpu_to_le64(dst_lba);
+
+   range = kmalloc_array(nr_range, sizeof(*range),
+   GFP_ATOMIC | __GFP_NOWARN);
+   if (!range)
+   return BLK_STS_RESOURCE;
+
+   range[0].slba = cpu_to_le64(src_lba);
+   range[0].nlb = cpu_to_le16(n_lba - 1);
+
+   cmnd->copy.nr_range = 0;
+
+   req->special_vec.bv_page = virt_to_page(range);
+   req->special_vec.bv_offset = offset_in_page(range);
+   req->special_vec.bv_len = sizeof(*range) * nr_range;
+   req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+   cmnd->copy.control = cpu_to_le16(control);
+
+   return BLK_STS_OK;
+}
+
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
 {
@@ -988,10 +1059,16 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct 
request *req)
ret = nvme_setup_discard(ns, req, cmd);
break;
case REQ_O

[dm-devel] [PATCH v11 0/9] Implement copy offload support

2023-05-22 Thread Nitesh Shetty
The patch series covers the points discussed in past and most recently
in LSFMM'23[0].
We have covered the initial agreed requirements in this patchset and
further additional features suggested by community.
Patchset borrows Mikulas's token based approach for 2 bdev implementation.

This is next iteration of our previous patchset v10[1].

Overall series supports:

1. Driver
- NVMe Copy command (single NS, TP 4065), including support
in nvme-target (for block and file backend).

2. Block layer
- Block-generic copy (REQ_COPY flag), with interface
accommodating two block-devs
- Emulation, for in-kernel user when offload is natively 
absent
- dm-linear support (for cases not requiring split)

3. User-interface
- copy_file_range

Testing
===
Copy offload can be tested on:
a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
parameters mssrl,mcl, msrc. For more info [2].
b. Null block device
c. NVMe Fabrics loopback.
d. blktests[3] (tests block/034-037, nvme/050-053)

Emulation can be tested on any device.

fio[4].

Infra and plumbing:
===
We populate copy_file_range callback in def_blk_fops. 
For devices that support copy-offload, use blkdev_copy_offload to
achieve in-device copy.
However for cases, where device doesn't support offload,
fallback to generic_copy_file_range.
For in-kernel users (like NVMe fabrics), we use blkdev_issue_copy
which implements its own emulation, as fd is not available.
Modify checks in generic_copy_file_range to support block-device.

Performance:

The major benefit of this copy-offload/emulation framework is
observed in fabrics setup, for copy workloads across the network.
The host will send offload command over the network and actual copy
can be achieved using emulation on the target.
This results in higher performance and lower network consumption,
as compared to read and write travelling across the network.
With async-design of copy-offload/emulation we are able to see the
following improvements as compared to userspace read + write on a
NVMeOF TCP setup:

Setup1: Network Speed: 1000Mb/s
Host PC: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Target PC: AMD Ryzen 9 5900X 12-Core Processor
block size 8k:
710% improvement in IO BW (108 MiB/s to 876 MiB/s).
Network utilisation drops from  97% to 15%.
block-size 1M:
2532% improvement in IO BW (101 MiB/s to 2659 MiB/s).
Network utilisation drops from 89% to 0.62%.

Setup2: Network Speed: 100Gb/s
Server: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 72 cores
(host and target have the same configuration)
block-size 8k:
17.5% improvement in IO BW (794 MiB/s to 933 MiB/s).
Network utilisation drops from  6.75% to 0.16%.

Blktests[3]
==
tests/block/034,035: Runs copy offload and emulation on block
  device.
tests/block/035,036: Runs copy offload and emulation on null
  block device.
tests/nvme/050-053: Create a loop backed fabrics device and
  run copy offload and emulation.

Future Work
===
- loopback device copy offload support
- upstream fio to use copy offload
- upstream blktest to use copy offload

These are to be taken up after this minimal series is agreed upon.

Additional links:
=
[0] 
https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zt14mmpgmivczugb33c60...@mail.gmail.com/

https://lore.kernel.org/linux-nvme/f0e19ae4-b37a-e9a3-2be7-a5afb334a...@nvidia.com/

https://lore.kernel.org/linux-nvme/20230113094648.15614-1-nj.she...@samsung.com/
[1] 
https://lore.kernel.org/all/20230419114320.13674-1-nj.she...@samsung.com/
[2] 
https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v11
[4] https://github.com/vincentkfu/fio/commits/copyoffload-3.34-v11

Changes since v10:
=
- NVMeOF: optimization in NVMe fabrics (Chaitanya Kulkarni)
- NVMeOF: sparse warnings (kernel test robot)

Changes since v9:
=
- null_blk, improved documentation, minor fixes(Chaitanya Kulkarni)
- fio, expanded testing and minor fixes (Vincent Fu)

Changes since v8:
=
- null_blk, copy_max_bytes_hw is made config fs parameter
  (Damien Le Moal)
- Negative error handling in copy_file_range (Christian Brauner

[dm-devel] [PATCH v11 3/9] block: add emulation for copy

2023-05-22 Thread Nitesh Shetty
For the devices which does not support copy, copy emulation is added.
It is required for in-kernel users like fabrics, where file descriptor is
not available and hence they can't use copy_file_range.
Copy-emulation is implemented by reading from source into memory and
writing to the corresponding destination asynchronously.
Also emulation is used, if copy offload fails or partially completes.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
Signed-off-by: Anuj Gupta 
---
 block/blk-lib.c| 175 -
 block/blk-map.c|   4 +-
 include/linux/blkdev.h |   3 +
 3 files changed, 179 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index ed089e703cb1..ba32545eb8d5 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -295,6 +295,172 @@ static int __blkdev_copy_offload(struct block_device 
*bdev_in, loff_t pos_in,
return blkdev_copy_wait_completion(cio);
 }
 
+static void *blkdev_copy_alloc_buf(sector_t req_size, sector_t *alloc_size,
+   gfp_t gfp_mask)
+{
+   int min_size = PAGE_SIZE;
+   void *buf;
+
+   while (req_size >= min_size) {
+   buf = kvmalloc(req_size, gfp_mask);
+   if (buf) {
+   *alloc_size = req_size;
+   return buf;
+   }
+   /* retry half the requested size */
+   req_size >>= 1;
+   }
+
+   return NULL;
+}
+
+static void blkdev_copy_emulate_write_endio(struct bio *bio)
+{
+   struct copy_ctx *ctx = bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (bio->bi_status) {
+   clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   }
+   kvfree(page_address(bio->bi_io_vec[0].bv_page));
+   bio_map_kern_endio(bio);
+   kfree(ctx);
+   if (atomic_dec_and_test(>refcount)) {
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+   }
+}
+
+static void blkdev_copy_emulate_read_endio(struct bio *read_bio)
+{
+   struct copy_ctx *ctx = read_bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (read_bio->bi_status) {
+   clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT) -
+   cio->pos_in;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   __free_page(read_bio->bi_io_vec[0].bv_page);
+   bio_map_kern_endio(read_bio);
+   kfree(ctx);
+
+   if (atomic_dec_and_test(>refcount)) {
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+   }
+   }
+   schedule_work(>dispatch_work);
+   kfree(read_bio);
+}
+
+/*
+ * If native copy offload feature is absent, this function tries to emulate,
+ * by copying data from source to a temporary buffer and from buffer to
+ * destination device.
+ * Returns the length of bytes copied or error if encountered
+ */
+static int __blkdev_copy_emulate(struct block_device *bdev_in, loff_t pos_in,
+ struct block_device *bdev_out, loff_t pos_out, size_t len,
+ cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+   struct request_queue *in = bdev_get_queue(bdev_in);
+   struct request_queue *out = bdev_get_queue(bdev_out);
+   struct bio *read_bio, *write_bio;
+   void *buf = NULL;
+   struct copy_ctx *ctx;
+   struct cio *cio;
+   sector_t buf_len, req_len, rem = 0;
+   sector_t max_src_hw_len = min_t(unsigned int,
+   queue_max_hw_sectors(in),
+   queue_max_segments(in) << (PAGE_SHIFT - SECTOR_SHIFT))
+   << SECTOR_SHIFT;
+   sector_t max_dst_hw_len = min_t(unsigned int,
+   queue_max_hw_sectors(out),
+   queue_max_segments(out) << (PAGE_SHIFT - SECTOR_SHIFT))
+   << SECTOR_SHIFT;
+   sector_t max_hw_len = min_t(unsigned int,
+   max_src_hw_len, max_dst_hw_len);
+
+   cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+   if (!cio)
+   return -ENOMEM;
+   atomic_set(>refcount, 0);
+   cio->pos_in = pos_in;
+   cio->pos_out = pos_out;
+   cio->waiter = current;
+   cio->endio = endio;
+   cio->private = private;
+
+   for (rem = len; rem > 0; rem -= buf_len) 

[dm-devel] [PATCH v11 8/9] dm: Enable copy offload for dm-linear target

2023-05-22 Thread Nitesh Shetty
Setting copy_offload_supported flag to enable offload.

Signed-off-by: Nitesh Shetty 
---
 drivers/md/dm-linear.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index f4448d520ee9..1d1ee30bbefb 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -62,6 +62,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
ti->num_discard_bios = 1;
ti->num_secure_erase_bios = 1;
ti->num_write_zeroes_bios = 1;
+   ti->copy_offload_supported = 1;
ti->private = lc;
return 0;
 
-- 
2.35.1.500.gb896f729e2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v11 2/9] block: Add copy offload support infrastructure

2023-05-22 Thread Nitesh Shetty
Introduce blkdev_issue_copy which takes similar arguments as
copy_file_range and performs copy offload between two bdevs.
Introduce REQ_COPY copy offload operation flag. Create a read-write
bio pair with a token as payload and submitted to the device in order.
Read request populates token with source specific information which
is then passed with write request.
This design is courtesy Mikulas Patocka's token based copy

Larger copy will be divided, based on max_copy_sectors limit.

Signed-off-by: Nitesh Shetty 
Signed-off-by: Anuj Gupta 
---
 block/blk-lib.c   | 235 ++
 block/blk.h   |   2 +
 include/linux/blk_types.h |  25 
 include/linux/blkdev.h|   3 +
 4 files changed, 265 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..ed089e703cb1 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -115,6 +115,241 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
 
+/*
+ * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to make it through
+ * blkdev_copy_write_endio.
+ */
+static int blkdev_copy_wait_completion(struct cio *cio)
+{
+   int ret;
+
+   if (cio->endio)
+   return 0;
+
+   if (atomic_read(>refcount)) {
+   __set_current_state(TASK_UNINTERRUPTIBLE);
+   blk_io_schedule();
+   }
+
+   ret = cio->comp_len;
+   kfree(cio);
+
+   return ret;
+}
+
+static void blkdev_copy_offload_write_endio(struct bio *bio)
+{
+   struct copy_ctx *ctx = bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (bio->bi_status) {
+   clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   }
+   __free_page(bio->bi_io_vec[0].bv_page);
+   bio_put(bio);
+
+   kfree(ctx);
+   if (!atomic_dec_and_test(>refcount))
+   return;
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+}
+
+static void blkdev_copy_offload_read_endio(struct bio *read_bio)
+{
+   struct copy_ctx *ctx = read_bio->bi_private;
+   struct cio *cio = ctx->cio;
+   sector_t clen;
+
+   if (read_bio->bi_status) {
+   clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT)
+   - cio->pos_in;
+   cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+   __free_page(read_bio->bi_io_vec[0].bv_page);
+   bio_put(ctx->write_bio);
+   bio_put(read_bio);
+   kfree(ctx);
+   if (atomic_dec_and_test(>refcount)) {
+   if (cio->endio) {
+   cio->endio(cio->private, cio->comp_len);
+   kfree(cio);
+   } else
+   blk_wake_io_task(cio->waiter);
+   }
+   return;
+   }
+
+   schedule_work(>dispatch_work);
+   bio_put(read_bio);
+}
+
+static void blkdev_copy_dispatch_work(struct work_struct *work)
+{
+   struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+   dispatch_work);
+
+   submit_bio(ctx->write_bio);
+}
+
+/*
+ * __blkdev_copy_offload   - Use device's native copy offload feature.
+ * we perform copy operation by sending 2 bio.
+ * 1. First we send a read bio with REQ_COPY flag along with a token and source
+ * and length. Once read bio reaches driver layer, device driver adds all the
+ * source info to token and does a fake completion.
+ * 2. Once read operation completes, we issue write with REQ_COPY flag with 
same
+ * token. In driver layer, token info is used to form a copy offload command.
+ *
+ * Returns the length of bytes copied or error if encountered
+ */
+static int __blkdev_copy_offload(struct block_device *bdev_in, loff_t pos_in,
+   struct block_device *bdev_out, loff_t pos_out, size_t len,
+   cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+   struct cio *cio;
+   struct copy_ctx *ctx;
+   struct bio *read_bio, *write_bio;
+   struct page *token;
+   sector_t copy_len;
+   sector_t rem, max_copy_len;
+
+   cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+   if (!cio)
+   return -ENOMEM;
+   atomic_set(>refcount, 0);
+   cio->waiter = current;
+   cio->endio = endio;
+   cio->private = private;
+
+   max_copy_len = min(bdev_max_copy_sectors(bdev_in),
+ 

Re: [dm-devel] [PATCH v9 6/9] nvmet: add copy command support for bdev and file ns

2023-04-26 Thread Nitesh Shetty
On Tue, Apr 25, 2023 at 06:36:51AM +, Chaitanya Kulkarni wrote:
> On 4/11/23 01:10, Anuj Gupta wrote:
> > From: Nitesh Shetty 
> >
> > Add support for handling target command on target.
> 
> what is target command ?
> 
> command that you have added is :nvme_cmd_copy
>

acked. It was supposed to be nvme_cmd_copy.

> > For bdev-ns we call into blkdev_issue_copy, which the block layer
> > completes by a offloaded copy request to backend bdev or by emulating the
> > request.
> >
> > For file-ns we call vfs_copy_file_range to service our request.
> >
> > Currently target always shows copy capability by setting
> > NVME_CTRL_ONCS_COPY in controller ONCS.
> 
> there is nothing mentioned about target/loop.c in commit log ?
>

acked, will add the description for loop device.

> > Signed-off-by: Nitesh Shetty 
> > Signed-off-by: Anuj Gupta 
> > ---
> >   drivers/nvme/target/admin-cmd.c   |  9 +++--
> >   drivers/nvme/target/io-cmd-bdev.c | 58 +++
> >   drivers/nvme/target/io-cmd-file.c | 52 +++
> >   drivers/nvme/target/loop.c|  6 
> >   drivers/nvme/target/nvmet.h   |  1 +
> >   5 files changed, 124 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/nvme/target/admin-cmd.c 
> > b/drivers/nvme/target/admin-cmd.c
> > index 80099df37314..978786ec6a9e 100644
> > --- a/drivers/nvme/target/admin-cmd.c
> > +++ b/drivers/nvme/target/admin-cmd.c
> > @@ -433,8 +433,7 @@ static void nvmet_execute_identify_ctrl(struct 
> > nvmet_req *req)
> > id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
> > id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
> > id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
> > -   NVME_CTRL_ONCS_WRITE_ZEROES);
> > -
> > +   NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
> > /* XXX: don't report vwc if the underlying device is write through */
> > id->vwc = NVME_CTRL_VWC_PRESENT;
> >   
> > @@ -536,6 +535,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req 
> > *req)
> >   
> > if (req->ns->bdev)
> > nvmet_bdev_set_limits(req->ns->bdev, id);
> > +   else {
> > +   id->msrc = (u8)to0based(BIO_MAX_VECS - 1);
> > +   id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
> > +   (PAGE_SHIFT - SECTOR_SHIFT));
> > +   id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
> > +   }
> >   
> > /*
> >  * We just provide a single LBA format that matches what the
> > diff --git a/drivers/nvme/target/io-cmd-bdev.c 
> > b/drivers/nvme/target/io-cmd-bdev.c
> > index c2d6cea0236b..0af273097aa4 100644
> > --- a/drivers/nvme/target/io-cmd-bdev.c
> > +++ b/drivers/nvme/target/io-cmd-bdev.c
> > @@ -46,6 +46,19 @@ void nvmet_bdev_set_limits(struct block_device *bdev, 
> > struct nvme_id_ns *id)
> > id->npda = id->npdg;
> > /* NOWS = Namespace Optimal Write Size */
> > id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
> > +
> > +   /*Copy limits*/
> 
> above comment doesn't make any sense ...
>

acked, will remove it next version.

> > +   if (bdev_max_copy_sectors(bdev)) {
> > +   id->msrc = id->msrc;
> > +   id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
> > +   SECTOR_SHIFT) / bdev_logical_block_size(bdev));
> > +   id->mcl = cpu_to_le32(id->mssrl);
> > +   } else {
> > +   id->msrc = (u8)to0based(BIO_MAX_VECS - 1);
> > +   id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
> > +   bdev_logical_block_size(bdev));
> > +   id->mcl = cpu_to_le32(id->mssrl);
> > +   }
> >   }
> >   
> >   void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
> > @@ -184,6 +197,19 @@ static void nvmet_bio_done(struct bio *bio)
> > nvmet_req_bio_put(req, bio);
> >   }
> >   
> > +static void nvmet_bdev_copy_end_io(void *private, int comp_len)
> > +{
> > +   struct nvmet_req *req = (struct nvmet_req *)private;
> > +
> > +   if (comp_len == req->copy_len) {
> > +   req->cqe->result.u32 = cpu_to_le32(1);
> > +   nvmet_req_complete(req, errno_to_nvme_status(req, 0));
> > +   } else {
> > +   req->cqe->result.u32 = cpu_to_le32(0);
> > +   nvmet_req_complete(req, blk_to_nvme_status(req, BLK_STS_IOERR));
> > +   }
> &

[dm-devel] [PATCH v10 1/9] block: Introduce queue limits for copy-offload support

2023-04-20 Thread Nitesh Shetty
Add device limits as sysfs entries,
- copy_offload (RW)
- copy_max_bytes (RW)
- copy_max_bytes_hw (RO)

Above limits help to split the copy payload in block layer.
copy_offload: used for setting copy offload(1) or emulation(0).
copy_max_bytes: maximum total length of copy in single payload.
copy_max_bytes_hw: Reflects the device supported maximum limit.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Kanchan Joshi 
Signed-off-by: Anuj Gupta 
---
 Documentation/ABI/stable/sysfs-block | 33 ++
 block/blk-settings.c | 24 +++
 block/blk-sysfs.c| 64 
 include/linux/blkdev.h   | 12 ++
 include/uapi/linux/fs.h  |  3 ++
 5 files changed, 136 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block 
b/Documentation/ABI/stable/sysfs-block
index c57e5b7cb532..e4d31132f77c 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,39 @@ Description:
last zone of the device which may be smaller.
 
 
+What:  /sys/block//queue/copy_offload
+Date:  April 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] When read, this file shows whether offloading copy to a
+   device is enabled (1) or disabled (0). Writing '0' to this
+   file will disable offloading copies for this device.
+   Writing any '1' value will enable this feature. If the device
+   does not support offloading, then writing 1, will result in
+   error.
+
+
+What:  /sys/block//queue/copy_max_bytes
+Date:  April 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RW] This is the maximum number of bytes, that the block layer
+   will allow for copy request. This will be smaller or equal to
+   the maximum size allowed by the hardware, indicated by
+   'copy_max_bytes_hw'. Attempt to set value higher than
+   'copy_max_bytes_hw' will truncate this to 'copy_max_bytes_hw'.
+
+
+What:  /sys/block//queue/copy_max_bytes_hw
+Date:  April 2023
+Contact:   linux-bl...@vger.kernel.org
+Description:
+   [RO] This is the maximum number of bytes, that the hardware
+   will allow in a single data copy request.
+   A value of 0 means that the device does not support
+   copy offload.
+
+
 What:  /sys/block//queue/crypto/
 Date:  February 2022
 Contact:   linux-bl...@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 896b4654ab00..23aff2d4dcba 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
+   lim->max_copy_sectors_hw = 0;
+   lim->max_copy_sectors = 0;
 }
 
 /**
@@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_dev_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
lim->max_zone_append_sectors = UINT_MAX;
+   lim->max_copy_sectors_hw = ULONG_MAX;
+   lim->max_copy_sectors = ULONG_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/**
+ * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
+ * @q:  the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ **/
+void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+   unsigned int max_copy_sectors)
+{
+   if (max_copy_sectors > (COPY_MAX_BYTES >> SECTOR_SHIFT))
+   max_copy_sectors = COPY_MAX_BYTES >> SECTOR_SHIFT;
+
+   q->limits.max_copy_sectors_hw = max_copy_sectors;
+   q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
@@ -578,6 +598,10 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
   b->max_segment_size);
 
+   t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+   t->max_copy_sectors_hw = min(t->max_copy_sectors_hw,
+   b->max_copy_sectors_hw);
+
t->misaligned |= b->misaligned;
 
alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a64208583853..8

[dm-devel] [PATCH v10 9/9] null_blk: add support for copy offload

2023-04-20 Thread Nitesh Shetty
Implementaion is based on existing read and write infrastructure.
copy_max_bytes: A new configfs and module parameter is introduced, which
can be used to set hardware/driver supported maximum copy limit.

Suggested-by: Damien Le Moal 
Signed-off-by: Anuj Gupta 
Signed-off-by: Nitesh Shetty 
Signed-off-by: Vincent Fu 
---
 drivers/block/null_blk/main.c | 108 --
 drivers/block/null_blk/null_blk.h |   8 +++
 2 files changed, 111 insertions(+), 5 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 5d54834d8690..9028aae1efbd 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -157,6 +157,10 @@ static int g_max_sectors;
 module_param_named(max_sectors, g_max_sectors, int, 0444);
 MODULE_PARM_DESC(max_sectors, "Maximum size of a command (in 512B sectors)");
 
+static unsigned long g_copy_max_bytes = COPY_MAX_BYTES;
+module_param_named(copy_max_bytes, g_copy_max_bytes, ulong, 0444);
+MODULE_PARM_DESC(copy_max_bytes, "Maximum size of a copy command (in bytes)");
+
 static unsigned int nr_devices = 1;
 module_param(nr_devices, uint, 0444);
 MODULE_PARM_DESC(nr_devices, "Number of devices to register");
@@ -409,6 +413,7 @@ NULLB_DEVICE_ATTR(home_node, uint, NULL);
 NULLB_DEVICE_ATTR(queue_mode, uint, NULL);
 NULLB_DEVICE_ATTR(blocksize, uint, NULL);
 NULLB_DEVICE_ATTR(max_sectors, uint, NULL);
+NULLB_DEVICE_ATTR(copy_max_bytes, uint, NULL);
 NULLB_DEVICE_ATTR(irqmode, uint, NULL);
 NULLB_DEVICE_ATTR(hw_queue_depth, uint, NULL);
 NULLB_DEVICE_ATTR(index, uint, NULL);
@@ -550,6 +555,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
_device_attr_queue_mode,
_device_attr_blocksize,
_device_attr_max_sectors,
+   _device_attr_copy_max_bytes,
_device_attr_irqmode,
_device_attr_hw_queue_depth,
_device_attr_index,
@@ -656,7 +662,8 @@ static ssize_t memb_group_features_show(struct config_item 
*item, char *page)
"poll_queues,power,queue_mode,shared_tag_bitmap,size,"
"submit_queues,use_per_node_hctx,virt_boundary,zoned,"
"zone_capacity,zone_max_active,zone_max_open,"
-   "zone_nr_conv,zone_offline,zone_readonly,zone_size\n");
+   "zone_nr_conv,zone_offline,zone_readonly,zone_size,"
+   "copy_max_bytes\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -722,6 +729,7 @@ static struct nullb_device *null_alloc_dev(void)
dev->queue_mode = g_queue_mode;
dev->blocksize = g_bs;
dev->max_sectors = g_max_sectors;
+   dev->copy_max_bytes = g_copy_max_bytes;
dev->irqmode = g_irqmode;
dev->hw_queue_depth = g_hw_queue_depth;
dev->blocking = g_blocking;
@@ -1271,6 +1279,78 @@ static int null_transfer(struct nullb *nullb, struct 
page *page,
return err;
 }
 
+static inline void nullb_setup_copy_read(struct nullb *nullb, struct bio *bio)
+{
+   struct nullb_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+
+   token->subsys = "nullb";
+   token->sector_in = bio->bi_iter.bi_sector;
+   token->nullb = nullb;
+   token->sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+}
+
+static inline int nullb_setup_copy_write(struct nullb *nullb,
+   struct bio *bio, bool is_fua)
+{
+   struct nullb_copy_token *token = bvec_kmap_local(>bi_io_vec[0]);
+   sector_t sector_in, sector_out;
+   void *in, *out;
+   size_t rem, temp;
+   unsigned long offset_in, offset_out;
+   struct nullb_page *t_page_in, *t_page_out;
+   int ret = -EIO;
+
+   if (unlikely(memcmp(token->subsys, "nullb", 5)))
+   return -EINVAL;
+   if (unlikely(token->nullb != nullb))
+   return -EINVAL;
+   if (WARN_ON(token->sectors != bio->bi_iter.bi_size >> SECTOR_SHIFT))
+   return -EINVAL;
+
+   sector_in = token->sector_in;
+   sector_out = bio->bi_iter.bi_sector;
+   rem = token->sectors << SECTOR_SHIFT;
+
+   spin_lock_irq(>lock);
+   while (rem > 0) {
+   temp = min_t(size_t, nullb->dev->blocksize, rem);
+   offset_in = (sector_in & SECTOR_MASK) << SECTOR_SHIFT;
+   offset_out = (sector_out & SECTOR_MASK) << SECTOR_SHIFT;
+
+   if (null_cache_active(nullb) && !is_fua)
+   null_make_cache_space(nullb, PAGE_SIZE);
+
+   t_page_in = null_lookup_page(nullb, sector_in, false,
+   !null_cache_active(nullb));
+   if (!t_page_in)
+   goto err;
+   t_page_out = null_insert_page(nullb, sector_out,
+   !null_cache_a

  1   2   3   >