from:"Mike Snitzer"

[dm-devel] [PATCH v2 1/4] nvme: return BLK_STS_DO_NOT_RETRY if the DNR bit is set

2021-04-15 Thread Mike Snitzer

If the DNR bit is set we should not retry the command.

We care about the retryable vs not retryable distinction at the block
layer so propagate the equivalent of the DNR bit by introducing
BLK_STS_DO_NOT_RETRY. Update blk_path_error() to _not_ retry if it
is set.

This change runs with the suggestion made here:
https://lore.kernel.org/linux-nvme/20190813170144.ga10...@lst.de/

Suggested-by: Christoph Hellwig 
Signed-off-by: Mike Snitzer 
---
 drivers/nvme/host/core.c  | 3 +++
 include/linux/blk_types.h | 8 
 2 files changed, 11 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0896e21642be..540d6fd8ffef 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -237,6 +237,9 @@ static void nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl)
 
 static blk_status_t nvme_error_status(u16 status)
 {
+   if (unlikely(status & NVME_SC_DNR))
+   return BLK_STS_DO_NOT_RETRY;
+
switch (status & 0x7ff) {
case NVME_SC_SUCCESS:
return BLK_STS_OK;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db026b6ec15a..1ca724948c56 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -142,6 +142,13 @@ typedef u8 __bitwise blk_status_t;
  */
 #define BLK_STS_ZONE_ACTIVE_RESOURCE   ((__force blk_status_t)16)
 
+/*
+ * BLK_STS_DO_NOT_RETRY is returned from the driver in the completion path
+ * if the device returns a status indicating that if the same command is
+ * re-submitted it is expected to fail.
+ */
+#define BLK_STS_DO_NOT_RETRY   ((__force blk_status_t)17)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with
@@ -157,6 +164,7 @@ typedef u8 __bitwise blk_status_t;
 static inline bool blk_path_error(blk_status_t error)
 {
switch (error) {
+   case BLK_STS_DO_NOT_RETRY:
case BLK_STS_NOTSUPP:
case BLK_STS_NOSPC:
case BLK_STS_TARGET:
-- 
2.15.0

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] nvme: decouple basic ANA log page re-read support from native multipathing

2021-04-15 Thread Mike Snitzer

BZ: 1948690
Upstream Status: RHEL-only

Signed-off-by: Mike Snitzer 

rhel-8.git commit b904f4b8e0f90613bf1b2b9d9ccad3c015741daf
Author: Mike Snitzer 
Date:   Tue Aug 25 21:52:47 2020 -0400

[nvme] nvme: decouple basic ANA log page re-read support from native 
multipathing

Message-id: <20200825215248.2291-10-snit...@redhat.com>
Patchwork-id: 325179
Patchwork-instance: patchwork
O-Subject: [RHEL8.3 PATCH 09/10] nvme: decouple basic ANA log page re-read 
support from native multipathing
Bugzilla: 1843515
RH-Acked-by: David Milburn 
RH-Acked-by: Gopal Tiwari 
RH-Acked-by: Ewan Milne 

BZ: 1843515
Upstream Status: RHEL-only

Whether or not ANA is present is a choice of the target implementation;
the host (and whether it supports multipathing) has _zero_ influence on
this.  If the target declares a path as 'inaccessible' the path _is_
inaccessible to the host.  As such, ANA support should be functional
even if native multipathing is not.

Introduce ability to always re-read ANA log page as required due to ANA
error and make current ANA state available via sysfs -- even if native
multipathing is disabled on the host (e.g. nvme_core.multipath=N).

This affords userspace access to the current ANA state independent of
which layer might be doing multipathing.  It also allows multipath-tools
to rely on the NVMe driver for ANA support while dm-multipath takes care
of multipathing.

And as always, if embedded NVMe users do not want any performance
overhead associated with ANA or native NVMe multipathing they can
disable CONFIG_NVME_MULTIPATH.

Signed-off-by: Mike Snitzer 
Signed-off-by: Frantisek Hrbata 

---
 drivers/nvme/host/core.c  |2 ++
 drivers/nvme/host/multipath.c |   23 ++-
 drivers/nvme/host/nvme.h  |4 
 3 files changed, 24 insertions(+), 5 deletions(-)

Index: linux-rhel9/drivers/nvme/host/core.c
===
--- linux-rhel9.orig/drivers/nvme/host/core.c
+++ linux-rhel9/drivers/nvme/host/core.c
@@ -347,6 +347,8 @@ static inline void nvme_end_req_with_fai
if (unlikely(nvme_status & NVME_SC_DNR))
goto out;
 
+   nvme_update_ana(req);
+
if (!blk_path_error(status)) {
pr_debug("Request meant for failover but blk_status_t 
(errno=%d) was not retryable.\n",
 blk_status_to_errno(status));
Index: linux-rhel9/drivers/nvme/host/multipath.c
===
--- linux-rhel9.orig/drivers/nvme/host/multipath.c
+++ linux-rhel9/drivers/nvme/host/multipath.c
@@ -65,10 +65,25 @@ void nvme_set_disk_name(char *disk_name,
}
 }
 
+static inline void __nvme_update_ana(struct nvme_ns *ns)
+{
+   if (!ns->ctrl->ana_log_buf)
+   return;
+
+   set_bit(NVME_NS_ANA_PENDING, >flags);
+   queue_work(nvme_wq, >ctrl->ana_work);
+}
+
+
+void nvme_update_ana(struct request *req)
+{
+   if (nvme_is_ana_error(nvme_req(req)->status))
+   __nvme_update_ana(req->q->queuedata);
+}
+
 void nvme_failover_req(struct request *req)
 {
struct nvme_ns *ns = req->q->queuedata;
-   u16 status = nvme_req(req)->status & 0x7ff;
unsigned long flags;
 
nvme_mpath_clear_current_path(ns);
@@ -78,10 +93,8 @@ void nvme_failover_req(struct request *r
 * ready to serve this namespace.  Kick of a re-read of the ANA
 * information page, and just try any other available path for now.
 */
-   if (nvme_is_ana_error(status) && ns->ctrl->ana_log_buf) {
-   set_bit(NVME_NS_ANA_PENDING, >flags);
-   queue_work(nvme_wq, >ctrl->ana_work);
-   }
+   if (nvme_is_ana_error(nvme_req(req)->status))
+   __nvme_update_ana(ns);
 
spin_lock_irqsave(>head->requeue_lock, flags);
blk_steal_bios(>head->requeue_list, req);
Index: linux-rhel9/drivers/nvme/host/nvme.h
===
--- linux-rhel9.orig/drivers/nvme/host/nvme.h
+++ linux-rhel9/drivers/nvme/host/nvme.h
@@ -664,6 +664,7 @@ void nvme_mpath_start_freeze(struct nvme
 void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
struct nvme_ctrl *ctrl, int *flags);
 void nvme_failover_req(struct request *req);
+void nvme_update_ana(struct request *req);
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
 void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id);
@@ -714,6 +715,9 @@ static inline void nvme_set_disk_name(ch
 static inline void nvme_failover_req(struct request *req)
 {
 }
+static inline void nvme_update_ana(struct request *req)
+{
+}
 static

[dm-devel] nvme: update failover handling to work with REQ_FAILFAST_TRANSPORT

2021-04-15 Thread Mike Snitzer

BZ: 1948690
Upstream Status: RHEL-only

Signed-off-by: Mike Snitzer 

rhel-8.git commit f8fb6ea1226e2abc525c88da13b346118d548eea
Author: Mike Snitzer 
Date:   Tue Aug 25 21:52:46 2020 -0400

[nvme] nvme: update failover handling to work with REQ_FAILFAST_TRANSPORT

Message-id: <20200825215248.2291-9-snit...@redhat.com>
Patchwork-id: 325177
Patchwork-instance: patchwork
O-Subject: [RHEL8.3 PATCH 08/10] nvme: update failover handling to work 
with REQ_FAILFAST_TRANSPORT
Bugzilla: 1843515
RH-Acked-by: David Milburn 
RH-Acked-by: Gopal Tiwari 
RH-Acked-by: Ewan Milne 

BZ: 1843515
Upstream Status: RHEL-only

If REQ_FAILFAST_TRANSPORT is set it means the driver should not retry
IO that completed with transport errors.  REQ_FAILFAST_TRANSPORT is
set by multipathing software (e.g. dm-multipath) before it issues IO.
Update NVMe to prepare for failover of requests marked with either
REQ_NVME_MPATH or REQ_FAILFAST_TRANSPORT.  This allows such requests
to be given a disposition of FAILOVER.

Introduce nvme_end_req_with_failover() for use in nvme_complete_rq()
if REQ_NVME_MPATH isn't set.  nvme_end_req_with_failover() ensures
request is completed with a retryable IO error when appropriate.
__nvme_end_req() was factored out for use by both nvme_end_req() and
nvme_end_req_with_failover().

Signed-off-by: Mike Snitzer 
Signed-off-by: Frantisek Hrbata 

---
 drivers/nvme/host/core.c |   33 -
 1 file changed, 28 insertions(+), 5 deletions(-)

Index: linux-rhel9/drivers/nvme/host/core.c
===
--- linux-rhel9.orig/drivers/nvme/host/core.c
+++ linux-rhel9/drivers/nvme/host/core.c
@@ -311,7 +311,7 @@ static inline enum nvme_disposition nvme
nvme_req(req)->retries >= nvme_max_retries)
return COMPLETE;
 
-   if (req->cmd_flags & REQ_NVME_MPATH) {
+   if (req->cmd_flags & (REQ_NVME_MPATH | REQ_FAILFAST_TRANSPORT)) {
if (nvme_is_path_error(nvme_req(req)->status) ||
blk_queue_dying(req->q))
return FAILOVER;
@@ -323,10 +323,8 @@ static inline enum nvme_disposition nvme
return RETRY;
 }
 
-static inline void nvme_end_req(struct request *req)
+static inline void __nvme_end_req(struct request *req, blk_status_t status)
 {
-   blk_status_t status = nvme_error_status(nvme_req(req)->status);
-
if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
req_op(req) == REQ_OP_ZONE_APPEND)
req->__sector = nvme_lba_to_sect(req->q->queuedata,
@@ -336,6 +334,28 @@ static inline void nvme_end_req(struct r
blk_mq_end_request(req, status);
 }
 
+static inline void nvme_end_req(struct request *req)
+{
+   __nvme_end_req(req, nvme_error_status(nvme_req(req)->status));
+}
+
+static inline void nvme_end_req_with_failover(struct request *req)
+{
+   u16 nvme_status = nvme_req(req)->status;
+   blk_status_t status = nvme_error_status(nvme_status);
+
+   if (unlikely(nvme_status & NVME_SC_DNR))
+   goto out;
+
+   if (!blk_path_error(status)) {
+   pr_debug("Request meant for failover but blk_status_t 
(errno=%d) was not retryable.\n",
+blk_status_to_errno(status));
+   status = BLK_STS_IOERR;
+   }
+out:
+   __nvme_end_req(req, status);
+}
+
 void nvme_complete_rq(struct request *req)
 {
trace_nvme_complete_rq(req);
@@ -352,7 +372,10 @@ void nvme_complete_rq(struct request *re
nvme_retry_req(req);
return;
case FAILOVER:
-   nvme_failover_req(req);
+   if (req->cmd_flags & REQ_NVME_MPATH)
+   nvme_failover_req(req);
+   else
+   nvme_end_req_with_failover(req);
return;
}
 }

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] nvme: allow retry for requests with REQ_FAILFAST_TRANSPORT set

2021-04-15 Thread Mike Snitzer

BZ: 1948690
Upstream Status: RHEL-only

Signed-off-by: Mike Snitzer 

rhel-8.git commit 7dadadb072515f243868e6fe2f7e9c97fd3516c9
Author: Mike Snitzer 
Date:   Tue Aug 25 21:52:48 2020 -0400

[nvme] nvme: allow retry for requests with REQ_FAILFAST_TRANSPORT set

Message-id: <20200825215248.2291-11-snit...@redhat.com>
Patchwork-id: 325180
Patchwork-instance: patchwork
O-Subject: [RHEL8.3 PATCH 10/10] nvme: allow retry for requests with 
REQ_FAILFAST_TRANSPORT set
Bugzilla: 1843515
RH-Acked-by: David Milburn 
RH-Acked-by: Gopal Tiwari 
RH-Acked-by: Ewan Milne 

BZ: 1843515
Upstream Status: RHEL-only

Based on patch that was proposed upstream but ultimately rejected, see:
https://www.spinics.net/lists/linux-block/msg57490.html

I'd have made this change even if this wasn't already posted obviously,
but I figured I'd give proper attribution due to their public post with
the same code change.

Author: Chao Leng 
Date:   Wed Aug 12 16:18:55 2020 +0800

nvme: allow retry for requests with REQ_FAILFAST_TRANSPORT set

REQ_FAILFAST_TRANSPORT may be designed for SCSI, because SCSI protocol
does not define the local retry mechanism. SCSI implements a fuzzy
local retry mechanism, so REQ_FAILFAST_TRANSPORT is needed to allow
higher-level multipathing software to perform failover/retry.

NVMe is different with SCSI about this. It defines a local retry
mechanism and path error codes, so NVMe should retry local for non
path error.  If path related error, whether to retry and how to retry
is still determined by higher-level multipathing's failover.

Unlike SCSI, NVMe shouldn't prevent retry if REQ_FAILFAST_TRANSPORT
because NVMe's local retry is needed -- as is NVMe specific logic to
categorize whether an error is path related.

In this way, the mechanism of NVMe multipath or other multipath are
now equivalent.  The mechanism is: non path related error will be
retry local, path related error is handled by multipath.

Signed-off-by: Chao Leng 
[snitzer: edited header for grammar and to make clearer]
Signed-off-by: Mike Snitzer 

Signed-off-by: Mike Snitzer 
Signed-off-by: Frantisek Hrbata 

---
 drivers/nvme/host/core.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: linux-rhel9/drivers/nvme/host/core.c
===
--- linux-rhel9.orig/drivers/nvme/host/core.c
+++ linux-rhel9/drivers/nvme/host/core.c
@@ -306,7 +306,14 @@ static inline enum nvme_disposition nvme
if (likely(nvme_req(req)->status == 0))
return COMPLETE;
 
-   if (blk_noretry_request(req) ||
+   /*
+* REQ_FAILFAST_TRANSPORT is set by upper layer software that
+* handles multipathing. Unlike SCSI, NVMe's error handling was
+* specifically designed to handle local retry for non-path errors.
+* As such, allow NVMe's local retry mechanism to be used for
+* requests marked with REQ_FAILFAST_TRANSPORT.
+*/
+   if ((req->cmd_flags & (REQ_FAILFAST_DEV | REQ_FAILFAST_DRIVER)) ||
(nvme_req(req)->status & NVME_SC_DNR) ||
nvme_req(req)->retries >= nvme_max_retries)
return COMPLETE;

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] nvme: Return BLK_STS_TARGET if the DNR bit is set

2021-04-15 Thread Mike Snitzer

BZ: 1948690
Upstream Status: RHEL-only

Signed-off-by: Mike Snitzer 

rhel-8.git commit ef4ab90c12db5e0e50800ec323736b95be7a6ff5
Author: Mike Snitzer 
Date:   Tue Aug 25 21:52:45 2020 -0400

[nvme] nvme: Return BLK_STS_TARGET if the DNR bit is set

Message-id: <20200825215248.2291-8-snit...@redhat.com>
Patchwork-id: 325178
Patchwork-instance: patchwork
O-Subject: [RHEL8.3 PATCH 07/10] nvme: Return BLK_STS_TARGET if the DNR bit 
is set
Bugzilla: 1843515
RH-Acked-by: David Milburn 
RH-Acked-by: Gopal Tiwari 
RH-Acked-by: Ewan Milne 

BZ: 1843515
Upstream Status: RHEL-only

If the DNR bit is set we should not retry the command, even if
the standard status evaluation indicates so.

SUSE is carrying this patch in their kernel:
https://lwn.net/Articles/800370/

Based on patch posted for upstream inclusion but rejected:
v1: https://lore.kernel.org/linux-nvme/20190806111036.113233-1-h...@suse.de/
v2: https://lore.kernel.org/linux-nvme/20190807071208.101882-1-h...@suse.de/
v2-keith: 
https://lore.kernel.org/linux-nvme/20190807144725.GB25621@localhost.localdomain/
v3: https://lore.kernel.org/linux-nvme/20190812075147.79598-1-h...@suse.de/
v3-keith: 
https://lore.kernel.org/linux-nvme/20190813141510.GB32686@localhost.localdomain/

This commit's change is basically "v3-keith".

    Signed-off-by: Mike Snitzer 
Signed-off-by: Frantisek Hrbata 

---
 drivers/nvme/host/core.c |3 +++
 1 file changed, 3 insertions(+)

Index: linux-rhel9/drivers/nvme/host/core.c
===
--- linux-rhel9.orig/drivers/nvme/host/core.c
+++ linux-rhel9/drivers/nvme/host/core.c
@@ -237,6 +237,9 @@ static void nvme_delete_ctrl_sync(struct
 
 static blk_status_t nvme_error_status(u16 status)
 {
+   if (unlikely(status & NVME_SC_DNR))
+   return BLK_STS_TARGET;
+
switch (status & 0x7ff) {
case NVME_SC_SUCCESS:
return BLK_STS_OK;

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [git pull] device mapper fix for 5.12 final

2021-04-14 Thread Mike Snitzer

Hi Linus,

The following changes since commit d434405aaab7d0ebc516b68a8fc4100922d7f5ef:

  Linux 5.12-rc7 (2021-04-11 15:16:13 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-5.12/dm-fixes-3

for you to fetch changes up to 8ca7cab82bda4eb0b8064befeeeaa38106cac637:

  dm verity fec: fix misaligned RS roots IO (2021-04-14 14:28:29 -0400)

Please pull, thanks.
Mike


Fix DM verity target FEC support's RS roots IO to always be
aligned. This fixes a previous stable@ fix that overcorrected for a
different configuration that also resulted in misaligned roots IO.


Jaegeuk Kim (1):
  dm verity fec: fix misaligned RS roots IO

 drivers/md/dm-verity-fec.c | 11 ---
 drivers/md/dm-verity-fec.h |  1 +
 2 files changed, 9 insertions(+), 3 deletions(-)

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v8 0/4] block device interposer

2021-04-09 Thread Mike Snitzer

On Fri, Apr 09 2021 at  7:48am -0400,
Sergei Shtepa  wrote:

> I think I'm ready to suggest the next version of block device interposer
> (blk_interposer). It allows to redirect bio requests to other block
> devices.
> 
> In this series of patches, I reviewed the process of attaching and
> detaching device mapper via blk_interposer.
> 
> Now the dm-target is attached to the interposed block device when the
> interposer dm-target is fully ready to accept requests, and the interposed
> block device queue is locked, and the file system on it is frozen.
> The detaching is also performed when the file system on the interposed
> block device is in a frozen state, the queue is locked, and the interposer
> dm-target is suspended.
> 
> To make it possible to lock the receipt of new bio requests without locking
> the processing of bio requests that the interposer creates, I had to change
> the submit_bio_noacct() function and add a lock. To minimize the impact of
> locking, I chose percpu_rw_sem. I tried to do without a new lock, but I'm
> afraid it's impossible.
> 
> Checking the operation of the interposer, I did not limit myself to
> a simple dm-linear. When I experimented with dm-era, I noticed that it
> accepts two block devices. Since Mike was against changing the logic in
> the dm-targets itself to support the interrupter, I decided to add the
> [interpose] option to the block device path.
> 
>  echo "0 ${DEV_SZ} era ${META} [interpose]${DEV} ${BLK_SZ}" | \
>   dmsetup create dm-era --interpose
> 
> I believe this option can replace the DM_INTERPOSE_FLAG flag. Of course,
> we can assume that if the device cannot be opened with the FMODE_EXCL,
> then it is considered an interposed device, but it seems to me that
> algorithm is unsafe. I hope to get Mike's opinion on this.
> 
> I have successfully tried taking snapshots. But I ran into a problem
> when I removed origin-target:
> [   49.031156] [ cut here ]
> [   49.031180] kernel BUG at block/bio.c:1476!
> [   49.031198] invalid opcode:  [#1] SMP NOPTI
> [   49.031213] CPU: 9 PID: 636 Comm: dmsetup Tainted: GE 
> 5.12.0-rc6-ip+ #52
> [   49.031235] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
> VirtualBox 12/01/2006
> [   49.031257] RIP: 0010:bio_split+0x74/0x80
> [   49.031273] Code: 89 c7 e8 5f 56 03 00 41 8b 74 24 28 48 89 ef e8 12 ea ff 
> ff f6 45 15 01 74 08 66 41 81 4c 24 14 00 01 4c 89 e0 5b 5d 41 5c c3 <0f> 0b 
> 0f 0b 0f 0b 45 31 e4 eb ed 90 0f 1f 44 00 00 39 77 28 76 05
> [   49.031322] RSP: 0018:9a6100993ab0 EFLAGS: 00010246
> [   49.031337] RAX: 0008 RBX:  RCX: 
> 8e26938f96d8
> [   49.031357] RDX: 0c00 RSI:  RDI: 
> 8e26937d1300
> [   49.031375] RBP: 8e2692ddc000 R08:  R09: 
> 
> [   49.031394] R10: 8e2692b1de00 R11: 8e2692b1de58 R12: 
> 8e26937d1300
> [   49.031413] R13: 8e2692ddcd18 R14: 8e2691d22140 R15: 
> 8e26937d1300
> [   49.031432] FS:  7efffa6e7800() GS:8e269bc8() 
> knlGS:
> [   49.031453] CS:  0010 DS:  ES:  CR0: 80050033
> [   49.031470] CR2: 7efffa96cda0 CR3: 000114bd CR4: 
> 000506e0
> [   49.031490] Call Trace:
> [   49.031501]  dm_submit_bio+0x383/0x500 [dm_mod]
> [   49.031522]  submit_bio_noacct+0x370/0x770
> [   49.031537]  submit_bh_wbc+0x160/0x190
> [   49.031550]  __sync_dirty_buffer+0x65/0x130
> [   49.031564]  ext4_commit_super+0xbc/0x120 [ext4]
> [   49.031602]  ext4_freeze+0x54/0x80 [ext4]
> [   49.031631]  freeze_super+0xc8/0x160
> [   49.031643]  freeze_bdev+0xb2/0xc0
> [   49.031654]  lock_bdev_fs+0x1c/0x30 [dm_mod]
> [   49.031671]  __dm_suspend+0x2b9/0x3b0 [dm_mod]
> [   49.032095]  dm_suspend+0xed/0x160 [dm_mod]
> [   49.032496]  ? __find_device_hash_cell+0x5b/0x2a0 [dm_mod]
> [   49.032897]  ? remove_all+0x30/0x30 [dm_mod]
> [   49.033299]  dev_remove+0x4c/0x1c0 [dm_mod]
> [   49.033679]  ctl_ioctl+0x1a5/0x470 [dm_mod]
> [   49.034067]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
> [   49.034432]  __x64_sys_ioctl+0x83/0xb0
> [   49.034785]  do_syscall_64+0x33/0x80
> [   49.035139]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> When suspend is executed for origin-target before the interposer is
> being detached, in the origin_map() function the value of the
> o->split_binary variable is zero, since no snapshots were connected to it.
> I think that if no snapshots are connected, then it does not make sense
> to split the bio request into parts.

The dm-snapshot code requires careful order of operations.  You say you
removed the origin target.. please show exactly what you did.  Your 4th
patch shouldn't be tied to this patchset. Can be dealt with
independently.

> Changes summary for this patchset v7:
>   * The attaching and detaching to interposed device moved to
> __dm_suspend() and __dm_resume() functions.

Why? Those hooks are inherently more constrained.  And in the

[dm-devel] [git pull] device mapper fixes for 5.12-rc5

2021-03-26 Thread Mike Snitzer

Hi Linus,

The following changes since commit 0d02ec6b3136c73c09e7859f0d0e4e2c4c07b49b:

  Linux 5.12-rc4 (2021-03-21 14:56:43 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-5.12/dm-fixes-2

for you to fetch changes up to 4edbe1d7bcffcd6269f3b5eb63f710393ff2ec7a:

  dm ioctl: fix out of bounds array access when no devices (2021-03-26 14:51:50 
-0400)

Please pull, thanks.
Mike


- Fix DM verity target's optional argument processing.

- Fix DM core's zoned model and zone sectors checks.

- Fix spurious "detected capacity change" pr_info() when creating new
  DM device.

- Fix DM ioctl out of bounds array access in handling of
  DM_LIST_DEVICES_CMD when no devices exist.


JeongHyeon Lee (1):
  dm verity: fix DM_VERITY_OPTS_MAX value

Mikulas Patocka (2):
  dm: don't report "detected capacity change" on device creation
  dm ioctl: fix out of bounds array access when no devices

Shin'ichiro Kawasaki (1):
  dm table: Fix zoned model check and zone sectors check

 drivers/md/dm-ioctl.c |  2 +-
 drivers/md/dm-table.c | 33 +
 drivers/md/dm-verity-target.c |  2 +-
 drivers/md/dm-zoned-target.c  |  2 +-
 drivers/md/dm.c   |  5 -
 include/linux/device-mapper.h | 15 ++-
 6 files changed, 46 insertions(+), 13 deletions(-)

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH V3 13/13] dm: support IO polling for bio-based dm device

2021-03-25 Thread Mike Snitzer

On Wed, Mar 24 2021 at  8:19am -0400,
Ming Lei  wrote:

> From: Jeffle Xu 
> 
> IO polling is enabled when all underlying target devices are capable
> of IO polling. The sanity check supports the stacked device model, in
> which one dm device may be build upon another dm device. In this case,
> the mapped device will check if the underlying dm target device
> supports IO polling.
> 
> Signed-off-by: Jeffle Xu 
> Signed-off-by: Ming Lei 
> ---
>  drivers/md/dm-table.c | 24 
>  drivers/md/dm.c   | 14 ++
>  include/linux/device-mapper.h |  1 +
>  3 files changed, 39 insertions(+)
> 

...

> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 50b693d776d6..fe6893b078dc 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1720,6 +1720,19 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
>   return ret;
>  }
>  
> +static bool dm_bio_poll_capable(struct gendisk *disk)
> +{
> + int ret, srcu_idx;
> + struct mapped_device *md = disk->private_data;
> + struct dm_table *t;
> +
> + t = dm_get_live_table(md, _idx);
> + ret = dm_table_supports_poll(t);
> + dm_put_live_table(md, srcu_idx);
> +
> + return ret;
> +}
> +

I know this code will only get called by blk-core if bio-based but there
isn't anything about this method's implementation that is inherently
bio-based only.

So please rename from dm_bio_poll_capable to dm_poll_capable

Other than that:

Reviewed-by: Mike Snitzer 

>  /*-
>   * An IDR is used to keep track of allocated minor numbers.
>   *---*/
> @@ -3132,6 +3145,7 @@ static const struct pr_ops dm_pr_ops = {
>  };
>  
>  static const struct block_device_operations dm_blk_dops = {
> + .poll_capable = dm_bio_poll_capable,
>   .submit_bio = dm_submit_bio,
>   .open = dm_blk_open,
>   .release = dm_blk_close,

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] md/dm-mpath: check whether all pgpaths have same uuid in multipath_ctr()

2021-03-25 Thread Mike Snitzer

On Wed, Mar 24 2021 at  9:21pm -0400,
Zhiqiang Liu  wrote:

> 
> 
> On 2021/3/22 22:22, Mike Snitzer wrote:
> > On Mon, Mar 22 2021 at  4:11am -0400,
> > Christoph Hellwig  wrote:
> > 
> >> On Sat, Mar 20, 2021 at 03:19:23PM +0800, Zhiqiang Liu wrote:
> >>> From: Zhiqiang Liu 
> >>>
> >>> When we make IO stress test on multipath device, there will
> >>> be a metadata err because of wrong path. In the test, we
> >>> concurrent execute 'iscsi device login|logout' and
> >>> 'multipath -r' command with IO stress on multipath device.
> >>> In some case, systemd-udevd may have not time to process
> >>> uevents of iscsi device logout|login, and then 'multipath -r'
> >>> command triggers multipathd daemon calls ioctl to load table
> >>> with incorrect old device info from systemd-udevd.
> >>> Then, one iscsi path may be incorrectly attached to another
> >>> multipath which has different uuid. Finally, the metadata err
> >>> occurs when umounting filesystem to down write metadata on
> >>> the iscsi device which is actually not owned by the multipath
> >>> device.
> >>>
> >>> So we need to check whether all pgpaths of one multipath have
> >>> the same uuid, if not, we should throw a error.
> >>>
> >>> Signed-off-by: Zhiqiang Liu 
> >>> Signed-off-by: lixiaokeng 
> >>> Signed-off-by: linfeilong 
> >>> Signed-off-by: Wubo 
> >>> ---
> >>>  drivers/md/dm-mpath.c   | 52 +
> >>>  drivers/scsi/scsi_lib.c |  1 +
> >>>  2 files changed, 53 insertions(+)
> >>>
> >>> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> >>> index bced42f082b0..f0b995784b53 100644
> >>> --- a/drivers/md/dm-mpath.c
> >>> +++ b/drivers/md/dm-mpath.c
> >>> @@ -24,6 +24,7 @@
> >>>  #include 
> >>>  #include 
> >>>  #include 
> >>> +#include 
> >>>  #include 
> >>>  #include 
> >>>
> >>> @@ -1169,6 +1170,45 @@ static int parse_features(struct dm_arg_set *as, 
> >>> struct multipath *m)
> >>>   return r;
> >>>  }
> >>>
> >>> +#define SCSI_VPD_LUN_ID_PREFIX_LEN 4
> >>> +#define MPATH_UUID_PREFIX_LEN 7
> >>> +static int check_pg_uuid(struct priority_group *pg, char *md_uuid)
> >>> +{
> >>> + char pgpath_uuid[DM_UUID_LEN] = {0};
> >>> + struct request_queue *q;
> >>> + struct pgpath *pgpath;
> >>> + struct scsi_device *sdev;
> >>> + ssize_t count;
> >>> + int r = 0;
> >>> +
> >>> + list_for_each_entry(pgpath, >pgpaths, list) {
> >>> + q = bdev_get_queue(pgpath->path.dev->bdev);
> >>> + sdev = scsi_device_from_queue(q);
> >>
> >> Common dm-multipath code should never poke into scsi internals.  This
> >> is something for the device handler to check.  It probably also won't
> >> work for all older devices.
> > 
> > Definitely.
> > 
> > But that aside, userspace (multipathd) _should_ be able to do extra
> > validation, _before_ pushing down a new table to the kernel, rather than
> > forcing the kernel to do it.
> 
> As your said, it is better to do extra validation in userspace (multipathd).
> However, in some cases, the userspace cannot see the real-time present devices
> info as Martin (committer of multipath-tools) said.
> In addition, the kernel can see right device info in the table at any time,
> so the uuid check in kernel can ensure one multipath is composed with paths 
> mapped to
> the same device.
> 
> Considering the severity of the wrong path in multipath, I think it worths 
> more
> checking.

As already said: this should be fixable in userspace.  Please work with
multipath-tools developers to address this.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] dm-integrity - add the "reset_recalculate" flag

2021-03-23 Thread Mike Snitzer

On Tue, Mar 23 2021 at 10:59am -0400,
Mikulas Patocka  wrote:

> This patch adds a new flag "reset_recalculate" that will restart
> recalculating from the beginning of the device. It can be used if we want
> to change the hash function. Example:
> 
> #!/bin/sh
> dmsetup remove_all
> rmmod brd
> set -e
> modprobe brd rd_size=1048576
> dmsetup create in --table '0 200 integrity /dev/ram0 0 16 J 2 
> internal_hash:sha256 recalculate'
> sleep 10
> dmsetup status
> dmsetup remove in
> dmsetup create in --table '0 200 integrity /dev/ram0 0 16 J 2 
> internal_hash:sha3-256 reset_recalculate'
> 
> Signed-off-by: Mikulas Patocka 
> 
> Index: linux-2.6/drivers/md/dm-integrity.c
> ===
> --- linux-2.6.orig/drivers/md/dm-integrity.c
> +++ linux-2.6/drivers/md/dm-integrity.c
> @@ -262,6 +262,7 @@ struct dm_integrity_c {
>   bool journal_uptodate;
>   bool just_formatted;
>   bool recalculate_flag;
> + bool reset_recalculate_flag;
>   bool discard;
>   bool fix_padding;
>   bool fix_hmac;
> @@ -3134,7 +3135,8 @@ static void dm_integrity_resume(struct d
>   rw_journal_sectors(ic, REQ_OP_READ, 0, 0,
>  ic->n_bitmap_blocks * (BITMAP_BLOCK_SIZE >> 
> SECTOR_SHIFT), NULL);
>   if (ic->mode == 'B') {
> - if (ic->sb->log2_blocks_per_bitmap_bit == 
> ic->log2_blocks_per_bitmap_bit) {
> + if (ic->sb->log2_blocks_per_bitmap_bit == 
> ic->log2_blocks_per_bitmap_bit &&
> + !ic->reset_recalculate_flag) {
>   block_bitmap_copy(ic, ic->recalc_bitmap, 
> ic->journal);
>   block_bitmap_copy(ic, ic->may_write_bitmap, 
> ic->journal);
>   if (!block_bitmap_op(ic, ic->journal, 0, 
> ic->provided_data_sectors,
> @@ -3156,7 +3158,8 @@ static void dm_integrity_resume(struct d
>   }
>   } else {
>   if (!(ic->sb->log2_blocks_per_bitmap_bit == 
> ic->log2_blocks_per_bitmap_bit &&
> -   block_bitmap_op(ic, ic->journal, 0, 
> ic->provided_data_sectors, BITMAP_OP_TEST_ALL_CLEAR))) {
> +   block_bitmap_op(ic, ic->journal, 0, 
> ic->provided_data_sectors, BITMAP_OP_TEST_ALL_CLEAR)) ||
> + ic->reset_recalculate_flag) {
>   ic->sb->flags |= 
> cpu_to_le32(SB_FLAG_RECALCULATING);
>   ic->sb->recalc_sector = cpu_to_le64(0);
>   }
> @@ -3169,6 +3172,10 @@ static void dm_integrity_resume(struct d
>   dm_integrity_io_error(ic, "writing superblock", r);
>   } else {
>   replay_journal(ic);
> + if (ic->reset_recalculate_flag) {
> + ic->sb->flags |= cpu_to_le32(SB_FLAG_RECALCULATING);
> + ic->sb->recalc_sector = cpu_to_le64(0);
> + }
>   if (ic->mode == 'B') {
>   ic->sb->flags |= cpu_to_le32(SB_FLAG_DIRTY_BITMAP);
>   ic->sb->log2_blocks_per_bitmap_bit = 
> ic->log2_blocks_per_bitmap_bit;
> @@ -3242,6 +3249,7 @@ static void dm_integrity_status(struct d
>   arg_count += !!ic->meta_dev;
>   arg_count += ic->sectors_per_block != 1;
>   arg_count += !!(ic->sb->flags & 
> cpu_to_le32(SB_FLAG_RECALCULATING));
> + arg_count += ic->reset_recalculate_flag;
>   arg_count += ic->discard;
>   arg_count += ic->mode == 'J';
>   arg_count += ic->mode == 'J';
> @@ -3261,6 +3269,8 @@ static void dm_integrity_status(struct d
>   DMEMIT(" block_size:%u", ic->sectors_per_block << 
> SECTOR_SHIFT);
>   if (ic->sb->flags & cpu_to_le32(SB_FLAG_RECALCULATING))
>   DMEMIT(" recalculate");
> + if (ic->reset_recalculate_flag)
> + DMEMIT(" reset_recalculate");
>   if (ic->discard)
>   DMEMIT(" allow_discards");
>   DMEMIT(" journal_sectors:%u", ic->initial_sectors - SB_SECTORS);
> @@ -4058,6 +4068,9 @@ static int dm_integrity_ctr(struct dm_ta
>   goto bad;
>   } else if (!strcmp(opt_string, "recalculate")) {
>   ic->recalculate_flag = true;
> + } else if (!strcmp(opt_string, "reset_recalculate")) {
> + ic->recalculate_flag = true;
> + ic->reset_recalculate_flag = true;
>   } else if (!strcmp(opt_string, "allow_discards")) {
>   ic->discard = true;
>   } else if (!strcmp(opt_string, "fix_padding")) {

Do you need to bump the number of feature args supported (from 17 to
18)?

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] md/dm-mpath: check whether all pgpaths have same uuid in multipath_ctr()

2021-03-22 Thread Mike Snitzer

On Mon, Mar 22 2021 at  4:11am -0400,
Christoph Hellwig  wrote:

> On Sat, Mar 20, 2021 at 03:19:23PM +0800, Zhiqiang Liu wrote:
> > From: Zhiqiang Liu 
> > 
> > When we make IO stress test on multipath device, there will
> > be a metadata err because of wrong path. In the test, we
> > concurrent execute 'iscsi device login|logout' and
> > 'multipath -r' command with IO stress on multipath device.
> > In some case, systemd-udevd may have not time to process
> > uevents of iscsi device logout|login, and then 'multipath -r'
> > command triggers multipathd daemon calls ioctl to load table
> > with incorrect old device info from systemd-udevd.
> > Then, one iscsi path may be incorrectly attached to another
> > multipath which has different uuid. Finally, the metadata err
> > occurs when umounting filesystem to down write metadata on
> > the iscsi device which is actually not owned by the multipath
> > device.
> > 
> > So we need to check whether all pgpaths of one multipath have
> > the same uuid, if not, we should throw a error.
> > 
> > Signed-off-by: Zhiqiang Liu 
> > Signed-off-by: lixiaokeng 
> > Signed-off-by: linfeilong 
> > Signed-off-by: Wubo 
> > ---
> >  drivers/md/dm-mpath.c   | 52 +
> >  drivers/scsi/scsi_lib.c |  1 +
> >  2 files changed, 53 insertions(+)
> > 
> > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> > index bced42f082b0..f0b995784b53 100644
> > --- a/drivers/md/dm-mpath.c
> > +++ b/drivers/md/dm-mpath.c
> > @@ -24,6 +24,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> > 
> > @@ -1169,6 +1170,45 @@ static int parse_features(struct dm_arg_set *as, 
> > struct multipath *m)
> > return r;
> >  }
> > 
> > +#define SCSI_VPD_LUN_ID_PREFIX_LEN 4
> > +#define MPATH_UUID_PREFIX_LEN 7
> > +static int check_pg_uuid(struct priority_group *pg, char *md_uuid)
> > +{
> > +   char pgpath_uuid[DM_UUID_LEN] = {0};
> > +   struct request_queue *q;
> > +   struct pgpath *pgpath;
> > +   struct scsi_device *sdev;
> > +   ssize_t count;
> > +   int r = 0;
> > +
> > +   list_for_each_entry(pgpath, >pgpaths, list) {
> > +   q = bdev_get_queue(pgpath->path.dev->bdev);
> > +   sdev = scsi_device_from_queue(q);
> 
> Common dm-multipath code should never poke into scsi internals.  This
> is something for the device handler to check.  It probably also won't
> work for all older devices.

Definitely.

But that aside, userspace (multipathd) _should_ be able to do extra
validation, _before_ pushing down a new table to the kernel, rather than
forcing the kernel to do it.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH V2 00/13] block: support bio based io polling

2021-03-19 Thread Mike Snitzer

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei  wrote:

> Hi,
> 
> Add per-task io poll context for holding HIPRI blk-mq/underlying bios
> queued from bio based driver's io submission context, and reuse one bio
> padding field for storing 'cookie' returned from submit_bio() for these
> bios. Also explicitly end these bios in poll context by adding two
> new bio flags.
> 
> In this way, we needn't to poll all underlying hw queues any more,
> which is implemented in Jeffle's patches. And we can just poll hw queues
> in which there is HIPRI IO queued.
> 
> Usually io submission and io poll share same context, so the added io
> poll context data is just like one stack variable, and the cost for
> saving bios is cheap.
> 
> Any comments are welcome.

I really like your approach and am very encouraged by the early results
Jeffle has shared.

Please review my various nits for your next iteration of this patchset.
But I think you aren't far from these changes being ready to make the
5.13 merge, which is really pretty awesome.

Outstanding job Ming, thanks so much for taking on this line of work!

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll

2021-03-19 Thread Mike Snitzer

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei  wrote:

> Currently bio based IO poll needs to poll all hw queue blindly, this way
> is very inefficient, and the big reason is that we can't pass bio
> submission result to io poll task.

This is awkward because bio-based IO polling doesn't exist upstream yet,
so this header should be covering your approach as a clean slate, e.g.:

The complexity associated with frequent bio splitting with bio-based
devices makes it difficult to implement IO polling efficiently because
the fan-out of underlying hw queues that need to be polled (as a
side-effect of bios being split) creates a need for more easily mapping
a group of bios to the hw queues that need to be polled.

> In IO submission context, track associated underlying bios by per-task
> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> and return current->pid to caller of submit_bio() for any bio based
> driver's IO, which is submitted from FS.
> 
> In IO poll context, the passed cookie tells us the PID of submission
> context, and we can find the bio from that submission context. Moving

Maybe be more precise by covering how all bios from that task's
submission context will be moved to poll queue of poll context?

> bio from submission queue to poll queue of the poll context, and keep
> polling until these bios are ended. Remove bio from poll queue if the
> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> 
> In previous version, kfifo is used to implement submission queue, and
> Jeffle Xu found that kfifo can't scale well in case of high queue depth.

Awkward to reference "previous version", maybe instead say:

In was found that kfifo doesn't scale well for a submission queue as
queue depth is increased, so a new mechanism for tracking bios is
needed.

> So far bio's size is close to 2 cacheline size, and it may not be
> accepted to add new field into bio for solving the scalability issue by
> tracking bios via linked list, switch to bio group list for tracking bio,
> the idea is to reuse .bi_end_io for linking bios into a linked list for
> all sharing same .bi_end_io(call it bio group), which is recovered before
> really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> provide very limited groups, such as 32 for fixing the scalability issue.
> 
> Usually submission shares context with io poll. The per-task poll context
> is just like stack variable, and it is cheap to move data between the two
> per-task queues.
> 
> Signed-off-by: Ming Lei 
> ---
>  block/bio.c   |   5 ++
>  block/blk-core.c  | 149 +++-
>  block/blk-mq.c| 173 +-
>  block/blk.h   |   9 ++
>  include/linux/blk_types.h |  16 +++-
>  5 files changed, 348 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 26b7f721cda8..04c043dc60fc 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>   **/
>  void bio_endio(struct bio *bio)
>  {
> + /* BIO_END_BY_POLL has to be set before calling submit_bio */
> + if (bio_flagged(bio, BIO_END_BY_POLL)) {
> + bio_set_flag(bio, BIO_DONE);
> + return;
> + }
>  again:
>   if (!bio_remaining_done(bio))
>   return;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index efc7a61a84b4..778d25a7e76c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned 
> int nr_grps)
>   sizeof(struct bio_grp_list_data);
>  }
>  
> +static inline void *bio_grp_data(struct bio *bio)
> +{
> + return bio->bi_poll;
> +}
> +
> +/* add bio into bio group list, return true if it is added */
> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> +{
> + int i;
> + struct bio_grp_list_data *grp;
> +
> + for (i = 0; i < list->nr_grps; i++) {
> + grp = >head[i];
> + if (grp->grp_data == bio_grp_data(bio)) {
> + __bio_grp_list_add(>list, bio);
> + return true;
> + }
> + }
> +
> + if (i == list->max_nr_grps)
> + return false;
> +
> + /* create a new group */
> + grp = >head[i];
> + bio_list_init(>list);
> + grp->grp_data = bio_grp_data(bio);
> + __bio_grp_list_add(>list, bio);
> + list->nr_grps++;
> +
> + return true;
> +}
> +
> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> +{
> + int i;
> + struct bio_grp_list_data *grp;
> +
> + for (i = 0; i < list->max_nr_grps; i++) {
> + grp = >head[i];
> + if (grp->grp_data == grp_data)
> + return i;
> + }
> + for (i = 0; i < list->max_nr_grps; i++) {
> +

Re: [dm-devel] [RFC PATCH V2 06/13] block: add new field into 'struct bvec_iter'

2021-03-19 Thread Mike Snitzer

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei  wrote:

> There is a hole at the end of 'struct bvec_iter', so put a new field
> here and we can save cookie returned from submit_bio() here for
> supporting bio based polling.
> 
> This way can avoid to extend bio unnecessarily.
> 
> Signed-off-by: Ming Lei 
> ---
>  include/linux/bvec.h | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index ff832e698efb..61c0f55f7165 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -43,6 +43,15 @@ struct bvec_iter {
>  
>   unsigned intbi_bvec_done;   /* number of bytes completed in
>  current bvec */
> +
> + /*
> +  * There is a hole at the end of bvec_iter, define one filed to

s/filed/field/

> +  * hold something which isn't relate with 'bvec_iter', so that we can

s/relate/related/
or
s/isn't relate with/doesn't relate to/

> +  * avoid to extend bio. So far this new field is used for bio based

s/to extend/extending/

> +  * pooling, we will store returning value of underlying queue's

s/pooling/polling/

> +  * submit_bio() here.
> +  */
> + unsigned intbi_private_data;
>  };
>  
>  struct bvec_iter_all {
> -- 
> 2.29.2
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH V2 05/13] block: add req flag of REQ_TAG

2021-03-19 Thread Mike Snitzer

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei  wrote:

> Add one req flag REQ_TAG which will be used in the following patch for
> supporting bio based IO polling.

"REQ_TAG" is so generic yet is used in such a specific way (to mark an
FS bio as having polling context)

I don't have a great suggestion for a better name, just seems "REQ_TAG"
is lacking... (especially given the potential for confusion due to
blk-mq's notion of "tag").

REQ_FS? REQ_FS_CTX? REQ_POLL? REQ_POLL_CTX? REQ_NAMING_IS_HARD :)

Maybe others have better ideas?

Mike

> Exactly this flag can help us to do:
> 
> 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
> as REQ_TAG, all bios cloned from this FS bio will be marked as REQ_TAG.
> 
> 2)create per-task io polling context if the bio based queue supports polling
> and the submitted bio is HIPRI. This per-task io polling context will be
> created during submit_bio() before marking this HIPRI bio as REQ_TAG. Then
> we can avoid to create such io polling context if one cloned bio with REQ_TAG
> is submitted from another kernel context.
> 
> 3) for supporting bio based io polling, we need to poll IOs from all
> underlying queues of bio device/driver, this way help us to recognize which
> IOs need to polled in bio based style, which will be implemented in next
> patch.
> 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-core.c  | 29 +++--
>  include/linux/blk_types.h |  4 
>  2 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 0b00c21cbefb..efc7a61a84b4 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct 
> request_queue *q)
>  static inline void blk_bio_poll_preprocess(struct request_queue *q,
>   struct bio *bio)
>  {
> + bool mq;
> +
>   if (!(bio->bi_opf & REQ_HIPRI))
>   return;
>  
> - if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> + /*
> +  * Can't support bio based IO poll without per-task poll queue
> +  *
> +  * Now we have created per-task io poll context, and mark this
> +  * bio as REQ_TAG, so: 1) if any cloned bio from this bio is
> +  * submitted from another kernel context, we won't create bio
> +  * poll context for it, so that bio will be completed by IRQ;
> +  * 2) If such bio is submitted from current context, we will
> +  * complete it via blk_poll(); 3) If driver knows that one
> +  * underlying bio allocated from driver is for FS bio, meantime
> +  * it is submitted in current context, driver can mark such bio
> +  * as REQ_TAG manually, so the bio can be completed via blk_poll
> +  * too.
> +  */
> + mq = queue_is_mq(q);
> + if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
>   bio->bi_opf &= ~REQ_HIPRI;
> + else if (!mq)
> + bio->bi_opf |= REQ_TAG;
>  }
>  
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -893,9 +912,15 @@ static noinline_for_stack bool submit_bio_checks(struct 
> bio *bio)
>  
>   /*
>* Created per-task io poll queue if we supports bio polling
> -  * and it is one HIPRI bio.
> +  * and it is one HIPRI bio, and this HIPRI bio has to be from
> +  * FS. If REQ_TAG isn't set for HIPRI bio, we think it originated
> +  * from FS.
> +  *
> +  * Driver may allocated bio by itself and REQ_TAG is set, but they
> +  * won't be marked as HIPRI.
>*/
>   blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> + !(bio->bi_opf & REQ_TAG) &&
>   (bio->bi_opf & REQ_HIPRI));
>  
>   blk_bio_poll_preprocess(q, bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index db026b6ec15a..a1bcade4bcc3 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -394,6 +394,9 @@ enum req_flag_bits {
>  
>   __REQ_HIPRI,
>  
> + /* for marking IOs originated from same FS bio in same context */
> + __REQ_TAG,
> +
>   /* for driver use */
>   __REQ_DRV,
>   __REQ_SWAP, /* swapping request. */
> @@ -418,6 +421,7 @@ enum req_flag_bits {
>  
>  #define REQ_NOUNMAP  (1ULL << __REQ_NOUNMAP)
>  #define REQ_HIPRI(1ULL << __REQ_HIPRI)
> +#define REQ_TAG  (1ULL << __REQ_TAG)
>  
>  #define REQ_DRV  (1ULL << __REQ_DRV)
>  #define REQ_SWAP (1ULL << __REQ_SWAP)
> -- 
> 2.29.2
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH V2 04/13] block: create io poll context for submission and poll task

2021-03-19 Thread Mike Snitzer

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei  wrote:

> Create per-task io poll context for both IO submission and poll task
> if the queue is bio based and supports polling.
> 
> This io polling context includes two queues:
1) submission queue(sq) for storing HIPRI bio submission result(cookie)
   and the bio, written by submission task and read by poll task.
2) polling queue(pq) for holding data moved from sq, only used in poll
   context for running bio polling.
 
(nit, but it just reads a bit clearer to enumerate the 2 queues)

> Following patches will support bio poll.
> 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-core.c  | 71 ---
>  block/blk-ioc.c   |  1 +
>  block/blk-mq.c| 14 
>  block/blk.h   | 46 +
>  include/linux/iocontext.h |  2 ++
>  5 files changed, 122 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d58f8a0c80de..0b00c21cbefb 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -792,16 +792,59 @@ static inline blk_status_t blk_check_zone_append(struct 
> request_queue *q,
>   return BLK_STS_OK;
>  }
>  
> -static inline void blk_create_io_context(struct request_queue *q)
> +static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
>  {
> - /*
> -  * Various block parts want %current->io_context, so allocate it up
> -  * front rather than dealing with lots of pain to allocate it only
> -  * where needed. This may fail and the block layer knows how to live
> -  * with it.
> -  */
> - if (unlikely(!current->io_context))
> - create_task_io_context(current, GFP_ATOMIC, q->node);
> + struct io_context *ioc = current->io_context;
> +
> + return ioc ? ioc->data : NULL;
> +}
> +
> +static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> +{
> + return sizeof(struct bio_grp_list) + nr_grps *
> + sizeof(struct bio_grp_list_data);
> +}
> +
> +static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> +{
> + pc->sq = (void *)pc + sizeof(*pc);
> + pc->sq->max_nr_grps = BLK_BIO_POLL_SQ_SZ;
> +
> + pc->pq = (void *)pc->sq + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ);
> + pc->pq->max_nr_grps = BLK_BIO_POLL_PQ_SZ;
> +
> + spin_lock_init(>sq_lock);
> + mutex_init(>pq_lock);
> +}
> +
> +void bio_poll_ctx_alloc(struct io_context *ioc)
> +{
> + struct blk_bio_poll_ctx *pc;
> + unsigned int size = sizeof(*pc) +
> + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ) +
> + bio_grp_list_size(BLK_BIO_POLL_PQ_SZ);
> +
> + pc = kzalloc(GFP_ATOMIC, size);
> + if (pc) {
> + bio_poll_ctx_init(pc);
> + if (cmpxchg(>data, NULL, (void *)pc))
> + kfree(pc);
> + }
> +}
> +
> +static inline bool blk_queue_support_bio_poll(struct request_queue *q)
> +{
> + return !queue_is_mq(q) && blk_queue_poll(q);
> +}
> +
> +static inline void blk_bio_poll_preprocess(struct request_queue *q,
> + struct bio *bio)
> +{
> + if (!(bio->bi_opf & REQ_HIPRI))
> + return;
> +
> + if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> + bio->bi_opf &= ~REQ_HIPRI;
>  }
>  
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -848,10 +891,14 @@ static noinline_for_stack bool submit_bio_checks(struct 
> bio *bio)
>   }
>   }
>  
> - blk_create_io_context(q);
> + /*
> +  * Created per-task io poll queue if we supports bio polling
> +  * and it is one HIPRI bio.
> +  */

Create per-task io poll queue if bio polling supported and HIPRI set.


> + blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> + (bio->bi_opf & REQ_HIPRI));
>  
> - if (!blk_queue_poll(q))
> - bio->bi_opf &= ~REQ_HIPRI;
> + blk_bio_poll_preprocess(q, bio);
>  
>   switch (bio_op(bio)) {
>   case REQ_OP_DISCARD:
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index b0cde18c4b8c..5574c398eff6 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
>  
>  static inline void free_io_context(struct io_context *ioc)
>  {
> + kfree(ioc->data);
>   kmem_cache_free(iocontext_cachep, ioc);
>  }
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 63c81df3b8b5..c832faa52ca0 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
>   return blk_mq_poll_hybrid_sleep(q, rq);
>  }
>  
> +static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +{
> + /*
> +  * Create poll queue for storing poll bio and its cookie from
> +  * submission queue
> +  */
> + blk_create_io_context(q, true);
> +
> + return 0;
> +}
> +
>  /**
>   * blk_poll - poll for IO completions
>   * @q:

Re: [dm-devel] [RFC PATCH V2 01/13] block: add helper of blk_queue_poll

2021-03-19 Thread Mike Snitzer

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei  wrote:

> There has been 3 users, and will be more, so add one such helper.
> 
> Signed-off-by: Ming Lei 

Not sure if you're collecting Reviewed-by or Acked-by at this point?
Seems you dropped Chaitanya's Reviewed-by to v1:
https://listman.redhat.com/archives/dm-devel/2021-March/msg00166.html

Do you plan to iterate a lot more before you put out a non-RFC?  For
this RFC v2, I'll withhold adding any of my Reviewed-by tags and just
reply where I see things that might need folding into the next
iteration.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll

2021-03-18 Thread Mike Snitzer

On Thu, Mar 18, 2021 at 1:26 PM Mike Snitzer  wrote:
>
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei  wrote:
>
> > Currently bio based IO poll needs to poll all hw queue blindly, this way
> > is very inefficient, and the big reason is that we can't pass bio
> > submission result to io poll task.
> >
> > In IO submission context, track associated underlying bios by per-task
> > submission queue and save 'cookie' poll data in 
> > bio->bi_iter.bi_private_data,
> > and return current->pid to caller of submit_bio() for any bio based
> > driver's IO, which is submitted from FS.
> >
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, and we can find the bio from that submission context. Moving
> > bio from submission queue to poll queue of the poll context, and keep
> > polling until these bios are ended. Remove bio from poll queue if the
> > bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >
> > In previous version, kfifo is used to implement submission queue, and
> > Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> > So far bio's size is close to 2 cacheline size, and it may not be
> > accepted to add new field into bio for solving the scalability issue by
> > tracking bios via linked list, switch to bio group list for tracking bio,
> > the idea is to reuse .bi_end_io for linking bios into a linked list for
> > all sharing same .bi_end_io(call it bio group), which is recovered before
> > really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> > Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> > provide very limited groups, such as 32 for fixing the scalability issue.
> >
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> >
> > Signed-off-by: Ming Lei 
> > ---
> >  block/bio.c   |   5 ++
> >  block/blk-core.c  | 149 +++-
> >  block/blk-mq.c| 173 +-
> >  block/blk.h   |   9 ++
> >  include/linux/blk_types.h |  16 +++-
> >  5 files changed, 348 insertions(+), 4 deletions(-)
> >
> > diff --git a/block/bio.c b/block/bio.c
> > index 26b7f721cda8..04c043dc60fc 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio 
> > *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > + /* BIO_END_BY_POLL has to be set before calling submit_bio */
> > + if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > + bio_set_flag(bio, BIO_DONE);
> > + return;
> > + }
> >  again:
> >   if (!bio_remaining_done(bio))
> >   return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index efc7a61a84b4..778d25a7e76c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned 
> > int nr_grps)
> >   sizeof(struct bio_grp_list_data);
> >  }
> >
> > +static inline void *bio_grp_data(struct bio *bio)
> > +{
> > + return bio->bi_poll;
> > +}
> > +
> > +/* add bio into bio group list, return true if it is added */
> > +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> > +{
> > + int i;
> > + struct bio_grp_list_data *grp;
> > +
> > + for (i = 0; i < list->nr_grps; i++) {
> > + grp = >head[i];
> > + if (grp->grp_data == bio_grp_data(bio)) {
> > + __bio_grp_list_add(>list, bio);
> > + return true;
> > + }
> > + }
> > +
> > + if (i == list->max_nr_grps)
> > + return false;
> > +
> > + /* create a new group */
> > + grp = >head[i];
> > + bio_list_init(>list);
> > + grp->grp_data = bio_grp_data(bio);
> > + __bio_grp_list_add(>list, bio);
> > + list->nr_grps++;
> > +
> > + return true;
> > +}
> > +
> > +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> > +{
> > + int i;
> > + struct bio_grp_list_data *grp;
> > +
> > + for (i = 0; i < list->max_nr_grps; i++) {
> > + grp = >head[i

Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll

2021-03-18 Thread Mike Snitzer

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei  wrote:

> Currently bio based IO poll needs to poll all hw queue blindly, this way
> is very inefficient, and the big reason is that we can't pass bio
> submission result to io poll task.
> 
> In IO submission context, track associated underlying bios by per-task
> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> and return current->pid to caller of submit_bio() for any bio based
> driver's IO, which is submitted from FS.
> 
> In IO poll context, the passed cookie tells us the PID of submission
> context, and we can find the bio from that submission context. Moving
> bio from submission queue to poll queue of the poll context, and keep
> polling until these bios are ended. Remove bio from poll queue if the
> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> 
> In previous version, kfifo is used to implement submission queue, and
> Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> So far bio's size is close to 2 cacheline size, and it may not be
> accepted to add new field into bio for solving the scalability issue by
> tracking bios via linked list, switch to bio group list for tracking bio,
> the idea is to reuse .bi_end_io for linking bios into a linked list for
> all sharing same .bi_end_io(call it bio group), which is recovered before
> really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> provide very limited groups, such as 32 for fixing the scalability issue.
> 
> Usually submission shares context with io poll. The per-task poll context
> is just like stack variable, and it is cheap to move data between the two
> per-task queues.
> 
> Signed-off-by: Ming Lei 
> ---
>  block/bio.c   |   5 ++
>  block/blk-core.c  | 149 +++-
>  block/blk-mq.c| 173 +-
>  block/blk.h   |   9 ++
>  include/linux/blk_types.h |  16 +++-
>  5 files changed, 348 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 26b7f721cda8..04c043dc60fc 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>   **/
>  void bio_endio(struct bio *bio)
>  {
> + /* BIO_END_BY_POLL has to be set before calling submit_bio */
> + if (bio_flagged(bio, BIO_END_BY_POLL)) {
> + bio_set_flag(bio, BIO_DONE);
> + return;
> + }
>  again:
>   if (!bio_remaining_done(bio))
>   return;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index efc7a61a84b4..778d25a7e76c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned 
> int nr_grps)
>   sizeof(struct bio_grp_list_data);
>  }
>  
> +static inline void *bio_grp_data(struct bio *bio)
> +{
> + return bio->bi_poll;
> +}
> +
> +/* add bio into bio group list, return true if it is added */
> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> +{
> + int i;
> + struct bio_grp_list_data *grp;
> +
> + for (i = 0; i < list->nr_grps; i++) {
> + grp = >head[i];
> + if (grp->grp_data == bio_grp_data(bio)) {
> + __bio_grp_list_add(>list, bio);
> + return true;
> + }
> + }
> +
> + if (i == list->max_nr_grps)
> + return false;
> +
> + /* create a new group */
> + grp = >head[i];
> + bio_list_init(>list);
> + grp->grp_data = bio_grp_data(bio);
> + __bio_grp_list_add(>list, bio);
> + list->nr_grps++;
> +
> + return true;
> +}
> +
> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> +{
> + int i;
> + struct bio_grp_list_data *grp;
> +
> + for (i = 0; i < list->max_nr_grps; i++) {
> + grp = >head[i];
> + if (grp->grp_data == grp_data)
> + return i;
> + }
> + for (i = 0; i < list->max_nr_grps; i++) {
> + grp = >head[i];
> + if (bio_grp_list_grp_empty(grp))
> + return i;
> + }
> + return -1;
> +}
> +
> +/* Move as many as possible groups from 'src' to 'dst' */
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> +{
> + int i, j, cnt = 0;
> + struct bio_grp_list_data *grp;
> +
> + for (i = src->nr_grps - 1; i >= 0; i--) {
> + grp = >head[i];
> + j = bio_grp_list_find_grp(dst, grp->grp_data);
> + if (j < 0)
> + break;
> + if (bio_grp_list_grp_empty(>head[j]))
> + dst->head[j].grp_data = grp->grp_data;
> + __bio_grp_list_merge(>head[j].list, >list);
> + bio_list_init(>list);
> + cnt++;
> + }
> +
>

Re: [dm-devel] [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll

2021-03-18 Thread Mike Snitzer

On Wed, Mar 17 2021 at  3:19am -0400,
Ming Lei  wrote:

> On Wed, Mar 17, 2021 at 11:49:00AM +0800, JeffleXu wrote:
> > 
> > 
> > On 3/16/21 7:00 PM, JeffleXu wrote:
> > > 
> > > 
> > > On 3/16/21 3:17 PM, Ming Lei wrote:
> > >> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
> > >>> It is a giant progress to gather all split bios that need to be polled
> > >>> in a per-task queue. Still some comments below.
> > >>>
> > >>>
> > >>> On 3/16/21 11:15 AM, Ming Lei wrote:
> >  Currently bio based IO poll needs to poll all hw queue blindly, this 
> >  way
> >  is very inefficient, and the big reason is that we can't pass bio
> >  submission result to io poll task.
> > 
> >  In IO submission context, store associated underlying bios into the
> >  submission queue and save 'cookie' poll data in 
> >  bio->bi_iter.bi_private_data,
> >  and return current->pid to caller of submit_bio() for any DM or bio 
> >  based
> >  driver's IO, which is submitted from FS.
> > 
> >  In IO poll context, the passed cookie tells us the PID of submission
> >  context, and we can find the bio from that submission context. Moving
> >  bio from submission queue to poll queue of the poll context, and keep
> >  polling until these bios are ended. Remove bio from poll queue if the
> >  bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> > 
> >  Usually submission shares context with io poll. The per-task poll 
> >  context
> >  is just like stack variable, and it is cheap to move data between the 
> >  two
> >  per-task queues.
> > 
> >  Signed-off-by: Ming Lei 
> >  ---
> >   block/bio.c   |   5 ++
> >   block/blk-core.c  |  74 +-
> >   block/blk-mq.c| 156 +-
> >   include/linux/blk_types.h |   3 +
> >   4 files changed, 235 insertions(+), 3 deletions(-)
> > 
> >  diff --git a/block/bio.c b/block/bio.c
> >  index a1c4d2900c7a..bcf5eca0e8e3 100644
> >  --- a/block/bio.c
> >  +++ b/block/bio.c
> >  @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct 
> >  bio *bio)
> >    **/
> >   void bio_endio(struct bio *bio)
> >   {
> >  +  /* BIO_END_BY_POLL has to be set before calling submit_bio */
> >  +  if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >  +  bio_set_flag(bio, BIO_DONE);
> >  +  return;
> >  +  }
> >   again:
> > if (!bio_remaining_done(bio))
> > return;
> >  diff --git a/block/blk-core.c b/block/blk-core.c
> >  index a082bbc856fb..970b23fa2e6e 100644
> >  --- a/block/blk-core.c
> >  +++ b/block/blk-core.c
> >  @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct 
> >  request_queue *q,
> > bio->bi_opf |= REQ_TAG;
> >   }
> >   
> >  +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct 
> >  bio *bio)
> >  +{
> >  +  struct blk_bio_poll_data data = {
> >  +  .bio=   bio,
> >  +  };
> >  +  struct blk_bio_poll_ctx *pc = ioc->data;
> >  +  unsigned int queued;
> >  +
> >  +  /* lock is required if there is more than one writer */
> >  +  if (unlikely(atomic_read(>nr_tasks) > 1)) {
> >  +  spin_lock(>lock);
> >  +  queued = kfifo_put(>sq, data);
> >  +  spin_unlock(>lock);
> >  +  } else {
> >  +  queued = kfifo_put(>sq, data);
> >  +  }
> >  +
> >  +  /*
> >  +   * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> >  +   * so we can save cookie into this bio after submit_bio().
> >  +   */
> >  +  if (queued)
> >  +  bio_set_flag(bio, BIO_END_BY_POLL);
> >  +  else
> >  +  bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> >  +
> >  +  return queued;
> >  +}
> > >>>
> > >>> The size of kfifo is limited, and it seems that once the sq of kfifio is
> > >>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
> > >>> enqueued into the default hw queue, which is IRQ driven.
> > >>
> > >> Yeah, this patch starts with 64 queue depth, and we can increase it to
> > >> 128, which should cover most of cases.
> > >>
> > >>>
> > >>>
> >  +
> >  +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> >  +{
> >  +  bio->bi_iter.bi_private_data = cookie;
> >  +}
> >  +
> >   static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >   {
> > struct block_device *bdev = bio->bi_bdev;
> >  @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >    * bio_list_on_stack[1] contains bios that were submitted before the 
>

Re: [dm-devel] [PATCH v7 2/3] block: add bdev_interposer

2021-03-17 Thread Mike Snitzer

On Wed, Mar 17 2021 at  2:14pm -0400,
Sergei Shtepa  wrote:

> The 03/17/2021 18:04, Mike Snitzer wrote:
> > On Wed, Mar 17 2021 at  8:22am -0400,
> > Sergei Shtepa  wrote:
> > 
> > > The 03/17/2021 06:03, Ming Lei wrote:
> > > > On Tue, Mar 16, 2021 at 07:35:44PM +0300, Sergei Shtepa wrote:
> > > > > The 03/16/2021 11:09, Ming Lei wrote:
> > > > > > On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > > > > > > bdev_interposer allows to redirect bio requests to another 
> > > > > > > devices.
> > > > > > > 
> > > > > > > Signed-off-by: Sergei Shtepa 
> > > > > > > ---
> > > > > > >  block/bio.c   |  2 ++
> > > > > > >  block/blk-core.c  | 57 
> > > > > > > +++
> > > > > > >  block/genhd.c | 54 
> > > > > > > +
> > > > > > >  include/linux/blk_types.h |  3 +++
> > > > > > >  include/linux/blkdev.h|  9 +++
> > > > > > >  5 files changed, 125 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > > > index a1c4d2900c7a..0bfbf06475ee 100644
> > > > > > > --- a/block/bio.c
> > > > > > > +++ b/block/bio.c
> > > > > > > @@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct 
> > > > > > > bio *bio_src)
> > > > > > >   bio_set_flag(bio, BIO_THROTTLED);
> > > > > > >   if (bio_flagged(bio_src, BIO_REMAPPED))
> > > > > > >   bio_set_flag(bio, BIO_REMAPPED);
> > > > > > > + if (bio_flagged(bio_src, BIO_INTERPOSED))
> > > > > > > + bio_set_flag(bio, BIO_INTERPOSED);
> > > > > > >   bio->bi_opf = bio_src->bi_opf;
> > > > > > >   bio->bi_ioprio = bio_src->bi_ioprio;
> > > > > > >   bio->bi_write_hint = bio_src->bi_write_hint;
> > > > > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > > > > index fc60ff208497..da1abc4c27a9 100644
> > > > > > > --- a/block/blk-core.c
> > > > > > > +++ b/block/blk-core.c
> > > > > > > @@ -1018,6 +1018,55 @@ static blk_qc_t 
> > > > > > > __submit_bio_noacct_mq(struct bio *bio)
> > > > > > >   return ret;
> > > > > > >  }
> > > > > > >  
> > > > > > > +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> > > > > > > +{
> > > > > > > + blk_qc_t ret = BLK_QC_T_NONE;
> > > > > > > + struct bio_list bio_list[2] = { };
> > > > > > > + struct gendisk *orig_disk;
> > > > > > > +
> > > > > > > + if (current->bio_list) {
> > > > > > > + bio_list_add(>bio_list[0], bio);
> > > > > > > + return BLK_QC_T_NONE;
> > > > > > > + }
> > > > > > > +
> > > > > > > + orig_disk = bio->bi_bdev->bd_disk;
> > > > > > > + if (unlikely(bio_queue_enter(bio)))
> > > > > > > + return BLK_QC_T_NONE;
> > > > > > > +
> > > > > > > + current->bio_list = bio_list;
> > > > > > > +
> > > > > > > + do {
> > > > > > > + struct block_device *interposer = 
> > > > > > > bio->bi_bdev->bd_interposer;
> > > > > > > +
> > > > > > > + if (unlikely(!interposer)) {
> > > > > > > + /* interposer was removed */
> > > > > > > + bio_list_add(>bio_list[0], bio);
> > > > > > > + break;
> > > > > > > + }
> > > > > > > + /* assign bio to interposer device */
> > > > > > > + bio_set_dev(bio, interposer);
> > > > > > > + bio_set_flag(bio, BIO_INTERPOSED);
> > > > > > > +
> > > > > > > + if (!submit_bio_checks(bio))
> > > > > > > + break;
>

Re: [dm-devel] [PATCH v7 2/3] block: add bdev_interposer

2021-03-17 Thread Mike Snitzer

On Wed, Mar 17 2021 at  8:22am -0400,
Sergei Shtepa  wrote:

> The 03/17/2021 06:03, Ming Lei wrote:
> > On Tue, Mar 16, 2021 at 07:35:44PM +0300, Sergei Shtepa wrote:
> > > The 03/16/2021 11:09, Ming Lei wrote:
> > > > On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > > > > bdev_interposer allows to redirect bio requests to another devices.
> > > > > 
> > > > > Signed-off-by: Sergei Shtepa 
> > > > > ---
> > > > >  block/bio.c   |  2 ++
> > > > >  block/blk-core.c  | 57 
> > > > > +++
> > > > >  block/genhd.c | 54 +
> > > > >  include/linux/blk_types.h |  3 +++
> > > > >  include/linux/blkdev.h|  9 +++
> > > > >  5 files changed, 125 insertions(+)
> > > > > 
> > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > index a1c4d2900c7a..0bfbf06475ee 100644
> > > > > --- a/block/bio.c
> > > > > +++ b/block/bio.c
> > > > > @@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio 
> > > > > *bio_src)
> > > > >   bio_set_flag(bio, BIO_THROTTLED);
> > > > >   if (bio_flagged(bio_src, BIO_REMAPPED))
> > > > >   bio_set_flag(bio, BIO_REMAPPED);
> > > > > + if (bio_flagged(bio_src, BIO_INTERPOSED))
> > > > > + bio_set_flag(bio, BIO_INTERPOSED);
> > > > >   bio->bi_opf = bio_src->bi_opf;
> > > > >   bio->bi_ioprio = bio_src->bi_ioprio;
> > > > >   bio->bi_write_hint = bio_src->bi_write_hint;
> > > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > > index fc60ff208497..da1abc4c27a9 100644
> > > > > --- a/block/blk-core.c
> > > > > +++ b/block/blk-core.c
> > > > > @@ -1018,6 +1018,55 @@ static blk_qc_t __submit_bio_noacct_mq(struct 
> > > > > bio *bio)
> > > > >   return ret;
> > > > >  }
> > > > >  
> > > > > +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> > > > > +{
> > > > > + blk_qc_t ret = BLK_QC_T_NONE;
> > > > > + struct bio_list bio_list[2] = { };
> > > > > + struct gendisk *orig_disk;
> > > > > +
> > > > > + if (current->bio_list) {
> > > > > + bio_list_add(>bio_list[0], bio);
> > > > > + return BLK_QC_T_NONE;
> > > > > + }
> > > > > +
> > > > > + orig_disk = bio->bi_bdev->bd_disk;
> > > > > + if (unlikely(bio_queue_enter(bio)))
> > > > > + return BLK_QC_T_NONE;
> > > > > +
> > > > > + current->bio_list = bio_list;
> > > > > +
> > > > > + do {
> > > > > + struct block_device *interposer = 
> > > > > bio->bi_bdev->bd_interposer;
> > > > > +
> > > > > + if (unlikely(!interposer)) {
> > > > > + /* interposer was removed */
> > > > > + bio_list_add(>bio_list[0], bio);
> > > > > + break;
> > > > > + }
> > > > > + /* assign bio to interposer device */
> > > > > + bio_set_dev(bio, interposer);
> > > > > + bio_set_flag(bio, BIO_INTERPOSED);
> > > > > +
> > > > > + if (!submit_bio_checks(bio))
> > > > > + break;
> > > > > + /*
> > > > > +  * Because the current->bio_list is initialized,
> > > > > +  * the submit_bio callback will always return 
> > > > > BLK_QC_T_NONE.
> > > > > +  */
> > > > > + interposer->bd_disk->fops->submit_bio(bio);
> > > > 
> > > > Given original request queue may become live when calling attach() and
> > > > detach(), see below comment. bdev_interposer_detach() may be run
> > > > when running ->submit_bio(), meantime the interposer device is
> > > > gone during the period, then kernel oops.
> > > 
> > > I think that since the bio_queue_enter() function was called,
> > > q->q_usage_counter will not allow the critical code in the attach/detach
> > > functions to be executed, which is located between the blk_freeze_queue
> > > and blk_unfreeze_queue calls.
> > > Please correct me if I'm wrong.
> > > 
> > > > 
> > > > > + } while (false);
> > > > > +
> > > > > + current->bio_list = NULL;
> > > > > +
> > > > > + blk_queue_exit(orig_disk->queue);
> > > > > +
> > > > > + /* Resubmit remaining bios */
> > > > > + while ((bio = bio_list_pop(_list[0])))
> > > > > + ret = submit_bio_noacct(bio);
> > > > > +
> > > > > + return ret;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * submit_bio_noacct - re-submit a bio to the block device layer for 
> > > > > I/O
> > > > >   * @bio:  The bio describing the location in memory and on the 
> > > > > device.
> > > > > @@ -1029,6 +1078,14 @@ static blk_qc_t __submit_bio_noacct_mq(struct 
> > > > > bio *bio)
> > > > >   */
> > > > >  blk_qc_t submit_bio_noacct(struct bio *bio)
> > > > >  {
> > > > > + /*
> > > > > +  * Checking the BIO_INTERPOSED flag is necessary so that the bio
> > > > > +  * created by the bdev_interposer do not get to it for 
> > > > > processing.
> > > > > +  */
> >

Re: [dm-devel] [PATCH v7 2/3] block: add bdev_interposer

2021-03-17 Thread Mike Snitzer

On Tue, Mar 16 2021 at 11:03pm -0400,
Ming Lei  wrote:

> On Tue, Mar 16, 2021 at 07:35:44PM +0300, Sergei Shtepa wrote:
> > The 03/16/2021 11:09, Ming Lei wrote:
> > > On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > > > bdev_interposer allows to redirect bio requests to another devices.
> > > > 
> > > > Signed-off-by: Sergei Shtepa 

...

> > > > +
> > > > +int bdev_interposer_attach(struct block_device *original,
> > > > +  struct block_device *interposer)
> > > > +{
> > > > +   int ret = 0;
> > > > +
> > > > +   if (WARN_ON(((!original) || (!interposer
> > > > +   return -EINVAL;
> > > > +   /*
> > > > +* interposer should be simple, no a multi-queue device
> > > > +*/
> > > > +   if (!interposer->bd_disk->fops->submit_bio)
> > > > +   return -EINVAL;
> > > > +
> > > > +   if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
> > > > +   return -EPERM;
> > > 
> > > The original request queue may become live now...
> > 
> > Yes.
> > I will remove the blk_mq_is_queue_frozen() function and use a different
> > approach.
> 
> Looks what attach and detach needs is that queue is kept as frozen state
> instead of being froze simply at the beginning of the two functions, so
> you can simply call freeze/unfreeze inside the two functions.
> 
> But what if 'original' isn't a MQ queue?  queue usage counter is just
> grabed when calling ->submit_bio(), and queue freeze doesn't guarantee there
> isn't any io activity, is that a problem for bdev_interposer use case?

Right, I raised the same concern here:
https://listman.redhat.com/archives/dm-devel/2021-March/msg00135.html
(toward bottom inlined after dm_disk_{freeze,unfreeze}

Anyway, this certainly needs to be addressed.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2] dm table: Fix zoned model check and zone sectors check

2021-03-16 Thread Mike Snitzer

On Tue, Mar 16 2021 at  2:14am -0400,
Damien Le Moal  wrote:

> On 2021/03/16 13:36, Shin'ichiro Kawasaki wrote:
> > Commit 24f6b6036c9e ("dm table: fix zoned iterate_devices based device
> > capability checks") triggered dm table load failure when dm-zoned device
> > is set up for zoned block devices and a regular device for cache.
> > 
> > The commit inverted logic of two callback functions for iterate_devices:
> > device_is_zoned_model() and device_matches_zone_sectors(). The logic of
> > device_is_zoned_model() was inverted then all destination devices of all
> > targets in dm table are required to have the expected zoned model. This
> > is fine for dm-linear, dm-flakey and dm-crypt on zoned block devices
> > since each target has only one destination device. However, this results
> > in failure for dm-zoned with regular cache device since that target has
> > both regular block device and zoned block devices.
> > 
> > As for device_matches_zone_sectors(), the commit inverted the logic to
> > require all zoned block devices in each target have the specified
> > zone_sectors. This check also fails for regular block device which does
> > not have zones.
> > 
> > To avoid the check failures, fix the zone model check and the zone
> > sectors check. For zone model check, introduce the new feature flag
> > DM_TARGET_MIXED_ZONED_MODEL, and set it to dm-zoned target. When the
> > target has this flag, allow it to have destination devices with any
> > zoned model. For zone sectors check, skip the check if the destination
> > device is not a zoned block device. Also add comments and improve an
> > error message to clarify expectations to the two checks.
> > 
> > Fixes: 24f6b6036c9e ("dm table: fix zoned iterate_devices based device 
> > capability checks")
> > Signed-off-by: Shin'ichiro Kawasaki 
> > Signed-off-by: Damien Le Moal 
> > ---
> > Changes from v1:
> > * Added DM_TARGET_MIXED_ZONED_MODEL feature for zoned model check of 
> > dm-zoned
> > 
> >  drivers/md/dm-table.c | 34 ++
> >  drivers/md/dm-zoned-target.c  |  2 +-
> >  include/linux/device-mapper.h | 15 ++-
> >  3 files changed, 41 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > index 95391f78b8d5..cc73d5b473eb 100644
> > --- a/drivers/md/dm-table.c
> > +++ b/drivers/md/dm-table.c
> > @@ -1594,6 +1594,13 @@ static int device_not_zoned_model(struct dm_target 
> > *ti, struct dm_dev *dev,
> > return blk_queue_zoned_model(q) != *zoned_model;
> >  }
> >  
> > +/*
> > + * Check the device zoned model based on the target feature flag. If the 
> > target
> > + * has the DM_TARGET_ZONED_HM feature flag set, host-managed zoned devices 
> > are
> > + * also accepted but all devices must have the same zoned model. If the 
> > target
> > + * has the DM_TARGET_MIXED_ZONED_MODEL feature set, the devices can have 
> > any
> > + * zoned model with all zoned devices having the same zone size.
> > + */
> >  static bool dm_table_supports_zoned_model(struct dm_table *t,
> >   enum blk_zoned_model zoned_model)
> >  {
> > @@ -1603,13 +1610,16 @@ static bool dm_table_supports_zoned_model(struct 
> > dm_table *t,
> > for (i = 0; i < dm_table_get_num_targets(t); i++) {
> > ti = dm_table_get_target(t, i);
> >  
> > -   if (zoned_model == BLK_ZONED_HM &&
> > -   !dm_target_supports_zoned_hm(ti->type))
> > -   return false;
> > -
> > -   if (!ti->type->iterate_devices ||
> > -   ti->type->iterate_devices(ti, device_not_zoned_model, 
> > _model))
> > -   return false;
> > +   if (dm_target_supports_zoned_hm(ti->type)) {
> > +   if (!ti->type->iterate_devices ||
> > +   ti->type->iterate_devices(ti,
> > + device_not_zoned_model,
> > + _model))
> > +   return false;
> > +   } else if (!dm_target_supports_mixed_zoned_model(ti->type)) {
> > +   if (zoned_model == BLK_ZONED_HM)
> > +   return false;
> > +   }
> > }
> >  
> > return true;
> > @@ -1621,9 +1631,17 @@ static int device_not_matches_zone_sectors(struct 
> > dm_target *ti, struct dm_dev *
> > struct request_queue *q = bdev_get_queue(dev->bdev);
> > unsigned int *zone_sectors = data;
> >  
> > +   if (!blk_queue_is_zoned(q))
> > +   return 0;
> > +
> > return blk_queue_zone_sectors(q) != *zone_sectors;
> >  }
> >  
> > +/*
> > + * Check consistency of zoned model and zone sectors across all targets. 
> > For
> > + * zone sectors, if the destination device is a zoned block device, it 
> > shall
> > + * have the specified zone_sectors.
> > + */
> >  static int validate_hardware_zoned_model(struct dm_table *table,
> >

Re: [dm-devel] dm table: Fix zoned model check and zone sectors check

2021-03-12 Thread Mike Snitzer

On Thu, Mar 11 2021 at  6:30pm -0500,
Damien Le Moal  wrote:

> On 2021/03/12 2:54, Mike Snitzer wrote:
> > On Wed, Mar 10 2021 at  3:25am -0500,
> > Shin'ichiro Kawasaki  wrote:
> > 
> >> Commit 24f6b6036c9e ("dm table: fix zoned iterate_devices based device
> >> capability checks") triggered dm table load failure when dm-zoned device
> >> is set up for zoned block devices and a regular device for cache.
> >>
> >> The commit inverted logic of two callback functions for iterate_devices:
> >> device_is_zoned_model() and device_matches_zone_sectors(). The logic of
> >> device_is_zoned_model() was inverted then all destination devices of all
> >> targets in dm table are required to have the expected zoned model. This
> >> is fine for dm-linear, dm-flakey and dm-crypt on zoned block devices
> >> since each target has only one destination device. However, this results
> >> in failure for dm-zoned with regular cache device since that target has
> >> both regular block device and zoned block devices.
> >>
> >> As for device_matches_zone_sectors(), the commit inverted the logic to
> >> require all zoned block devices in each target have the specified
> >> zone_sectors. This check also fails for regular block device which does
> >> not have zones.
> >>
> >> To avoid the check failures, fix the zone model check and the zone
> >> sectors check. For zone model check, invert the device_is_zoned_model()
> >> logic again to require at least one destination device in one target has
> >> the specified zoned model. For zone sectors check, skip the check if the
> >> destination device is not a zoned block device. Also add comments and
> >> improve error messages to clarify expectations to the two checks.
> >>
> >> Signed-off-by: Shin'ichiro Kawasaki 
> >> Fixes: 24f6b6036c9e ("dm table: fix zoned iterate_devices based device 
> >> capability checks")
> >> ---
> >>  drivers/md/dm-table.c | 21 +++--
> >>  1 file changed, 15 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> >> index 95391f78b8d5..04b7a3978ef8 100644
> >> --- a/drivers/md/dm-table.c
> >> +++ b/drivers/md/dm-table.c
> >> @@ -1585,13 +1585,13 @@ bool dm_table_has_no_data_devices(struct dm_table 
> >> *table)
> >>return true;
> >>  }
> >>  
> >> -static int device_not_zoned_model(struct dm_target *ti, struct dm_dev 
> >> *dev,
> >> -sector_t start, sector_t len, void *data)
> >> +static int device_is_zoned_model(struct dm_target *ti, struct dm_dev *dev,
> >> +   sector_t start, sector_t len, void *data)
> >>  {
> >>struct request_queue *q = bdev_get_queue(dev->bdev);
> >>enum blk_zoned_model *zoned_model = data;
> >>  
> >> -  return blk_queue_zoned_model(q) != *zoned_model;
> >> +  return blk_queue_zoned_model(q) == *zoned_model;
> >>  }
> >>  
> >>  static bool dm_table_supports_zoned_model(struct dm_table *t,
> >> @@ -1608,7 +1608,7 @@ static bool dm_table_supports_zoned_model(struct 
> >> dm_table *t,
> >>return false;
> >>  
> >>if (!ti->type->iterate_devices ||
> >> -  ti->type->iterate_devices(ti, device_not_zoned_model, 
> >> _model))
> >> +  !ti->type->iterate_devices(ti, device_is_zoned_model, 
> >> _model))
> >>return false;
> >>}
> > 
> > The point here is to ensure all zoned devices match the specific model,
> > right?
> > 
> > I understand commit 24f6b6036c9e wasn't correct, sorry about that.
> > But I don't think your change is correct either.  It'll allow a mix of
> > various zoned models (that might come after the first positive match for
> > the specified zoned_model)... but because the first match short-circuits
> > the loop those later mismatched zoned devices aren't checked.
> > 
> > Should device_is_zoned_model() also be trained to ignore BLK_ZONED_NONE
> > (like you did below)?
> 
> Thinking more about this, I think we may have a deeper problem here. We need 
> to
> allow the combination of BLK_ZONED_NONE and BLK_ZONED_HM for dm-zoned multi
> drive config using a regular SSD as cache. But blindly allowing such 
> combination
> of zoned and non-zoned drives will also end up allowing a setup comb

Re: [dm-devel] [PATCH v7 1/3] block: add blk_mq_is_queue_frozen()

2021-03-12 Thread Mike Snitzer

On Fri, Mar 12 2021 at 10:44am -0500,
Sergei Shtepa  wrote:

> blk_mq_is_queue_frozen() allow to assert that the queue is frozen.
> 
> Signed-off-by: Sergei Shtepa 
> ---
>  block/blk-mq.c | 13 +
>  include/linux/blk-mq.h |  1 +
>  2 files changed, 14 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d4d7c1caa439..2f188a865024 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -161,6 +161,19 @@ int blk_mq_freeze_queue_wait_timeout(struct 
> request_queue *q,
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
>  
> +bool blk_mq_is_queue_frozen(struct request_queue *q)
> +{
> + bool frozen;
> +
> + mutex_lock(>mq_freeze_lock);
> + frozen = percpu_ref_is_dying(>q_usage_counter) &&
> +  percpu_ref_is_zero(>q_usage_counter);
> + mutex_unlock(>mq_freeze_lock);
> +
> + return frozen;
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_is_queue_frozen);
> +

This is returning a frozen state that is immediately stale.  I don't
think any code calling this is providing the guarantees you think it
does due to the racey nature of this state once the mutex is dropped.

>  /*
>   * Guarantee no request is in use, so we can change any data structure of
>   * the queue afterward.
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 2c473c9b8990..6f01971abf7b 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -533,6 +533,7 @@ void blk_freeze_queue_start(struct request_queue *q);
>  void blk_mq_freeze_queue_wait(struct request_queue *q);
>  int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
>unsigned long timeout);
> +bool blk_mq_is_queue_frozen(struct request_queue *q);
>  
>  int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
>  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int 
> nr_hw_queues);
> -- 
> 2.20.1
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG

2021-03-12 Thread Mike Snitzer

On Fri, Mar 12 2021 at 10:44am -0500,
Sergei Shtepa  wrote:

> DM_INTERPOSED_FLAG allow to create DM targets on "the fly".
> Underlying block device opens without a flag FMODE_EXCL.
> DM target receives bio from the original device via
> bdev_interposer.
> 
> Signed-off-by: Sergei Shtepa 
> ---
>  drivers/md/dm-core.h  |  3 ++
>  drivers/md/dm-ioctl.c | 13 
>  drivers/md/dm-table.c | 61 +--
>  drivers/md/dm.c   | 38 +++---
>  include/linux/device-mapper.h |  1 +
>  include/uapi/linux/dm-ioctl.h |  6 
>  6 files changed, 101 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index 5953ff2bd260..9eae419c7b18 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -114,6 +114,9 @@ struct mapped_device {
>   bool init_tio_pdu:1;
>  
>   struct srcu_struct io_barrier;
> +
> + /* attach target via block-layer interposer */
> + bool is_interposed;
>  };

This flag is a mix of uses.  First it is used to store that
DM_INTERPOSED_FLAG was provided as input param during load.

And the same 'is_interposed' name is used in 'struct dm_dev'.

To me this state should be elevated to block core -- awkward for every
driver that might want to use blk-interposer to be sprinkling state
around its core structures.

So I'd prefer you:
1) rename 'struct mapped_device' to 'interpose' _and_ add it just after
   "bool init_tio_pdu:1;" with "bool interpose:1;" -- this reflects
   interpose was requested during load.
2) bdev_interposer_attach() should be made to set some block core state
   that is able to be tested to check if a device is_interposed.
3) Don't add an 'is_interposed' flag to 'struct dm_dev'

>  
>  void disable_discard(struct mapped_device *md);
> diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> index 5e306bba4375..2b4c9557c283 100644
> --- a/drivers/md/dm-ioctl.c
> +++ b/drivers/md/dm-ioctl.c
> @@ -1267,6 +1267,11 @@ static inline fmode_t get_mode(struct dm_ioctl *param)
>   return mode;
>  }
>  
> +static inline bool get_interposer_flag(struct dm_ioctl *param)
> +{
> + return (param->flags & DM_INTERPOSED_FLAG);
> +}
> +

As I mention at the end: rename to DM_INTERPOSE_FLAG

>  static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
>  struct dm_target_spec **spec, char **target_params)
>  {
> @@ -1293,6 +1298,10 @@ static int populate_table(struct dm_table *table,
>   DMWARN("populate_table: no targets specified");
>   return -EINVAL;
>   }
> + if (table->md->is_interposed && (param->target_count != 1)) {
> + DMWARN("%s: with interposer should be specified only one 
> target", __func__);

This error/warning reads very awkwardly. Maybe?:
"%s: interposer requires only a single target be specified"

> + return -EINVAL;
> + }
>  
>   for (i = 0; i < param->target_count; i++) {
>  
> @@ -1338,6 +1347,8 @@ static int table_load(struct file *filp, struct 
> dm_ioctl *param, size_t param_si
>   if (!md)
>   return -ENXIO;
>  
> + md->is_interposed = get_interposer_flag(param);
> +
>   r = dm_table_create(, get_mode(param), param->target_count, md);
>   if (r)
>   goto err;
> @@ -2098,6 +2109,8 @@ int __init dm_early_create(struct dm_ioctl *dmi,
>   if (r)
>   goto err_hash_remove;
>  
> + md->is_interposed = get_interposer_flag(dmi);
> +
>   /* add targets */
>   for (i = 0; i < dmi->target_count; i++) {
>   r = dm_table_add_target(t, spec_array[i]->target_type,
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 95391f78b8d5..f6e2eb3f8949 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -225,12 +225,13 @@ void dm_table_destroy(struct dm_table *t)
>  /*
>   * See if we've already got a device in the list.
>   */
> -static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev)
> +static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev, 
> bool is_interposed)

Think in make more sense to internalize the need to consider
md->interpose here.

So:

static struct dm_dev_internal *find_device(struct dm_table *t, dev_t dev)
{
struct list_head *l = >devices;
bool is_interposed = t->md->interpose;
...

>  {
>   struct dm_dev_internal *dd;
>  
>   list_for_each_entry (dd, l, list)
> - if (dd->dm_dev->bdev->bd_dev == dev)
> + if ((dd->dm_dev->bdev->bd_dev == dev) &&
> + (dd->dm_dev->is_interposed == is_interposed))
>   return dd;

But why must this extra state be used/tested?  Seems like quite a deep
embedding of interposer into dm_dev_internal.. feels unnecessary.

>  
>   return NULL;
> @@ -358,6 +359,18 @@ dev_t dm_get_dev_t(const char *path)
>  }
>  EXPORT_SYMBOL_GPL(dm_get_dev_t);
>  
> +static

Re: [dm-devel] [PATCH v2] dm-ioctl: return UUID in DM_LIST_DEVICES_CMD result

2021-03-11 Thread Mike Snitzer

On Thu, Mar 11 2021 at  2:43pm -0500,
Mikulas Patocka  wrote:

> 
> 
> On Thu, 11 Mar 2021, Mike Snitzer wrote:
> 
> > > Index: linux-2.6/include/uapi/linux/dm-ioctl.h
> > > ===
> > > --- linux-2.6.orig/include/uapi/linux/dm-ioctl.h  2021-03-09 
> > > 12:20:23.0 +0100
> > > +++ linux-2.6/include/uapi/linux/dm-ioctl.h   2021-03-11 
> > > 18:42:14.0 +0100
> > > @@ -193,8 +193,15 @@ struct dm_name_list {
> > >   __u32 next; /* offset to the next record from
> > >  the _start_ of this */
> > >   char name[0];
> > > +
> > > + /* uint32_t event_nr; */
> > > + /* uint32_t flags; */
> > > + /* char uuid[0]; */
> > >  };
> > 
> > If extra padding is being leveraged here (from the __u32 next), why not
> > at least explicitly add the members and then pad out the balance of that
> > __u32?  I'm not liking the usage of phantom struct members.. e.g.
> > the games played with accessing them.
> > 
> > Mike
> 
> What exactly do you mean?
> 
> Do you want to create another structure that holds event_nr, flags and 
> uuid? Or something else?

Just not liking the comments you added in lieu of explicit struct
members.

Can't you remove __u32 next; and replace with named members of
appropriate size? Adding explicit padding to end to get you to 32bit
offset?  I'd need to look closer at the way the code is written, but I
just feel like this patch makes the code even more fiddley.

But I can let it go if you don't see a way forward to make it better..

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2] dm-ioctl: return UUID in DM_LIST_DEVICES_CMD result

2021-03-11 Thread Mike Snitzer

On Thu, Mar 11 2021 at  1:26pm -0500,
Mikulas Patocka  wrote:

> When LVM needs to find a device with a particular UUID it needs to ask for
> UUID for each device. This patch returns UUID directly in the list of
> devices, so that LVM doesn't have to query all the devices with an ioctl.
> The UUID is returned if the flag DM_UUID_FLAG is set in the parameters.
> 
> Returning UUID is done in backward-compatible way. There's one unused
> 32-bit word value after the event number. This patch sets the bit
> DM_NAME_LIST_FLAG_HAS_UUID if UUID is present and
> DM_NAME_LIST_FLAG_DOESNT_HAVE_UUID if it isn't (if none of these bits is
> set, then we have an old kernel that doesn't support returning UUIDs). The
> UUID is stored after this word. The 'next' value is updated to point after
> the UUID, so that old version of libdevmapper will skip the UUID without
> attempting to interpret it.
> 
> Signed-off-by: Mikulas Patocka 
> 
> ---
>  drivers/md/dm-ioctl.c |   20 +---
>  include/uapi/linux/dm-ioctl.h |7 +++
>  2 files changed, 24 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6/drivers/md/dm-ioctl.c
> ===
> --- linux-2.6.orig/drivers/md/dm-ioctl.c  2021-03-09 21:04:07.0 
> +0100
> +++ linux-2.6/drivers/md/dm-ioctl.c   2021-03-11 18:53:58.0 +0100
> @@ -558,7 +558,9 @@ static int list_devices(struct file *fil
>   for (n = rb_first(_rb_tree); n; n = rb_next(n)) {
>   hc = container_of(n, struct hash_cell, name_node);
>   needed += align_val(offsetof(struct dm_name_list, name) + 
> strlen(hc->name) + 1);
> - needed += align_val(sizeof(uint32_t));
> + needed += align_val(sizeof(uint32_t) * 2);
> + if (param->flags & DM_UUID_FLAG && hc->uuid)
> + needed += align_val(strlen(hc->uuid) + 1);
>   }
>  
>   /*
> @@ -577,6 +579,7 @@ static int list_devices(struct file *fil
>* Now loop through filling out the names.
>*/
>   for (n = rb_first(_rb_tree); n; n = rb_next(n)) {
> + void *uuid_ptr;
>   hc = container_of(n, struct hash_cell, name_node);
>   if (old_nl)
>   old_nl->next = (uint32_t) ((void *) nl -
> @@ -588,8 +591,19 @@ static int list_devices(struct file *fil
>  
>   old_nl = nl;
>   event_nr = align_ptr(nl->name + strlen(hc->name) + 1);
> - *event_nr = dm_get_event_nr(hc->md);
> - nl = align_ptr(event_nr + 1);
> + event_nr[0] = dm_get_event_nr(hc->md);
> + event_nr[1] = 0;
> + uuid_ptr = align_ptr(event_nr + 2);
> + if (param->flags & DM_UUID_FLAG) {
> + if (hc->uuid) {
> + event_nr[1] |= DM_NAME_LIST_FLAG_HAS_UUID;
> + strcpy(uuid_ptr, hc->uuid);
> + uuid_ptr = align_ptr(uuid_ptr + 
> strlen(hc->uuid) + 1);
> + } else {
> + event_nr[1] |= 
> DM_NAME_LIST_FLAG_DOESNT_HAVE_UUID;
> + }
> + }
> + nl = uuid_ptr;
>   }
>   /*
>* If mismatch happens, security may be compromised due to buffer
> Index: linux-2.6/include/uapi/linux/dm-ioctl.h
> ===
> --- linux-2.6.orig/include/uapi/linux/dm-ioctl.h  2021-03-09 
> 12:20:23.0 +0100
> +++ linux-2.6/include/uapi/linux/dm-ioctl.h   2021-03-11 18:42:14.0 
> +0100
> @@ -193,8 +193,15 @@ struct dm_name_list {
>   __u32 next; /* offset to the next record from
>  the _start_ of this */
>   char name[0];
> +
> + /* uint32_t event_nr; */
> + /* uint32_t flags; */
> + /* char uuid[0]; */
>  };

If extra padding is being leveraged here (from the __u32 next), why not
at least explicitly add the members and then pad out the balance of that
__u32?  I'm not liking the usage of phantom struct members.. e.g.
the games played with accessing them.

Mike

>  
> +#define DM_NAME_LIST_FLAG_HAS_UUID   1
> +#define DM_NAME_LIST_FLAG_DOESNT_HAVE_UUID   2
> +
>  /*
>   * Used to retrieve the target versions
>   */

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 2/6] dm crypt: Handle DM_CRYPT_NO_*_WORKQUEUE more explicit.

2021-03-11 Thread Mike Snitzer

On Sat, Feb 13 2021 at  9:31am -0500,
Ignat Korchagin  wrote:

> On Sat, Feb 13, 2021 at 11:11 AM Sebastian Andrzej Siewior
>  wrote:
> >
> > By looking at the handling of DM_CRYPT_NO_*_WORKQUEUE in
> > kcryptd_queue_crypt() it appears that READ and WRITE requests might be
> > handled in the tasklet context as long as interrupts are disabled or it
> > is handled in hardirq context.
> >
> > The WRITE requests should always be fed in preemptible context. There
> > are other requirements in the write path which sleep or acquire a mutex.
> >
> > The READ requests should come from the storage driver, likely not in a
> > preemptible context. The source of the requests depends on the driver
> > and other factors like multiple queues in the block layer.
> 
> My personal opinion: I really don't like the guesswork and
> assumptions. If we want
> to remove the usage of in_*irq() and alike, we should propagate the execution
> context from the source. Storage drivers have this information and can
> pass it on to the device-mapper framework, which in turn can pass it
> on to dm modules.

I'm missing where DM core has the opportunity to convey this context in
a clean manner.

Any quick patch that shows the type of transform you'd like to see would
be appreciated.. doesn't need to be comprehensive, just enough for me or
others to carry through to completion.
 
> Assuming WRITE requests are always in preemptible context might break with the
> addition of some new type of obscure storage hardware.
> 
> In our testing we saw a lot of cases with SATA disks, where READ requests come
> from preemptible contexts, so probably don't want to pay (no matter how small)
> tasklet setup overhead, not to mention executing it in softirq, which
> is hard later to
> attribute to a specific process in metrics.
> 
> In other words, I think we should be providing support for this in the
> device-mapper
> framework itself, not start from individual modules.

I think your concerns are valid... it does seem like this patch is
assuming too much.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] dm table: Fix zoned model check and zone sectors check

2021-03-11 Thread Mike Snitzer

On Wed, Mar 10 2021 at  3:25am -0500,
Shin'ichiro Kawasaki  wrote:

> Commit 24f6b6036c9e ("dm table: fix zoned iterate_devices based device
> capability checks") triggered dm table load failure when dm-zoned device
> is set up for zoned block devices and a regular device for cache.
> 
> The commit inverted logic of two callback functions for iterate_devices:
> device_is_zoned_model() and device_matches_zone_sectors(). The logic of
> device_is_zoned_model() was inverted then all destination devices of all
> targets in dm table are required to have the expected zoned model. This
> is fine for dm-linear, dm-flakey and dm-crypt on zoned block devices
> since each target has only one destination device. However, this results
> in failure for dm-zoned with regular cache device since that target has
> both regular block device and zoned block devices.
> 
> As for device_matches_zone_sectors(), the commit inverted the logic to
> require all zoned block devices in each target have the specified
> zone_sectors. This check also fails for regular block device which does
> not have zones.
> 
> To avoid the check failures, fix the zone model check and the zone
> sectors check. For zone model check, invert the device_is_zoned_model()
> logic again to require at least one destination device in one target has
> the specified zoned model. For zone sectors check, skip the check if the
> destination device is not a zoned block device. Also add comments and
> improve error messages to clarify expectations to the two checks.
> 
> Signed-off-by: Shin'ichiro Kawasaki 
> Fixes: 24f6b6036c9e ("dm table: fix zoned iterate_devices based device 
> capability checks")
> ---
>  drivers/md/dm-table.c | 21 +++--
>  1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 95391f78b8d5..04b7a3978ef8 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -1585,13 +1585,13 @@ bool dm_table_has_no_data_devices(struct dm_table 
> *table)
>   return true;
>  }
>  
> -static int device_not_zoned_model(struct dm_target *ti, struct dm_dev *dev,
> -   sector_t start, sector_t len, void *data)
> +static int device_is_zoned_model(struct dm_target *ti, struct dm_dev *dev,
> +  sector_t start, sector_t len, void *data)
>  {
>   struct request_queue *q = bdev_get_queue(dev->bdev);
>   enum blk_zoned_model *zoned_model = data;
>  
> - return blk_queue_zoned_model(q) != *zoned_model;
> + return blk_queue_zoned_model(q) == *zoned_model;
>  }
>  
>  static bool dm_table_supports_zoned_model(struct dm_table *t,
> @@ -1608,7 +1608,7 @@ static bool dm_table_supports_zoned_model(struct 
> dm_table *t,
>   return false;
>  
>   if (!ti->type->iterate_devices ||
> - ti->type->iterate_devices(ti, device_not_zoned_model, 
> _model))
> + !ti->type->iterate_devices(ti, device_is_zoned_model, 
> _model))
>   return false;
>   }

The point here is to ensure all zoned devices match the specific model,
right?

I understand commit 24f6b6036c9e wasn't correct, sorry about that.
But I don't think your change is correct either.  It'll allow a mix of
various zoned models (that might come after the first positive match for
the specified zoned_model)... but because the first match short-circuits
the loop those later mismatched zoned devices aren't checked.

Should device_is_zoned_model() also be trained to ignore BLK_ZONED_NONE
(like you did below)?

But _not_ invert the logic, so keep device_not_zoned_model.. otherwise
the first positive return of a match will short-circuit checking all
other devices match.

>  
> @@ -1621,9 +1621,18 @@ static int device_not_matches_zone_sectors(struct 
> dm_target *ti, struct dm_dev *
>   struct request_queue *q = bdev_get_queue(dev->bdev);
>   unsigned int *zone_sectors = data;
>  
> + if (blk_queue_zoned_model(q) == BLK_ZONED_NONE)
> + return 0;
> +
>   return blk_queue_zone_sectors(q) != *zone_sectors;
>  }

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v5 10/12] block: fastpath for bio-based polling

2021-03-10 Thread Mike Snitzer

On Wed, Mar 03 2021 at  6:57am -0500,
Jeffle Xu  wrote:

> Offer one fastpath for bio-based polling when bio submitted to dm
> device is not split.
> 
> In this case, there will be only one bio submitted to only one polling
> hw queue of one underlying mq device, and thus we don't need to track
> all split bios or iterate through all polling hw queues. The pointer to
> the polling hw queue the bio submitted to is returned here as the
> returned cookie. In this case, the polling routine will call
> mq_ops->poll() directly with the hw queue converted from the input
> cookie.
> 
> If the original bio submitted to dm device is split to multiple bios and
> thus submitted to multiple polling hw queues, the polling routine will
> fall back to iterating all hw queues (in polling mode) of all underlying
> mq devices.
> 
> Signed-off-by: Jeffle Xu 
> ---
>  block/blk-core.c  | 73 +--
>  include/linux/blk_types.h | 66 +--
>  include/linux/types.h |  2 +-
>  3 files changed, 135 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 6d7d53030d7c..e5cd4ff08f5c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -947,14 +947,22 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  {
>   struct bio_list bio_list_on_stack[2];
>   blk_qc_t ret = BLK_QC_T_NONE;
> + struct request_queue *top_q;
> + bool poll_on;
>  
>   BUG_ON(bio->bi_next);
>  
>   bio_list_init(_list_on_stack[0]);
>   current->bio_list = bio_list_on_stack;
>  
> + top_q = bio->bi_bdev->bd_disk->queue;
> + poll_on = test_bit(QUEUE_FLAG_POLL, _q->queue_flags) &&
> +   (bio->bi_opf & REQ_HIPRI);
> +
>   do {
> - struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> + blk_qc_t cookie;
> + struct block_device *bdev = bio->bi_bdev;
> + struct request_queue *q = bdev->bd_disk->queue;
>   struct bio_list lower, same;
>  
>   if (unlikely(bio_queue_enter(bio) != 0))
> @@ -966,7 +974,23 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>   bio_list_on_stack[1] = bio_list_on_stack[0];
>   bio_list_init(_list_on_stack[0]);
>  
> - ret = __submit_bio(bio);
> + cookie = __submit_bio(bio);
> +
> + if (poll_on && blk_qc_t_valid(cookie)) {
> + unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
> + unsigned int devt = bdev_whole(bdev)->bd_dev;
> +
> + cookie = blk_qc_t_get_by_devt(devt, queue_num);

The need to rebuild the cookie here is pretty awkward.  This
optimization living in block core may be worthwhile but the duality of
block core conditionally overriding the driver's returned cookie (that
is meant to be opaque to upper layer) is not great.

> +
> + if (!blk_qc_t_valid(ret)) {
> + /* set initial value */
> + ret = cookie;
> + } else if (ret != cookie) {
> + /* bio gets split and enqueued to multi hctxs */
> + ret = BLK_QC_T_BIO_POLL_ALL;
> + poll_on = false;
> + }
> + }
>  
>   /*
>* Sort new bios into those for a lower level and those for the
> @@ -989,6 +1013,7 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>   } while ((bio = bio_list_pop(_list_on_stack[0])));
>  
>   current->bio_list = NULL;
> +
>   return ret;
>  }

Nit: Not seeing need to alter white space here.

>  
> @@ -1119,6 +1144,44 @@ blk_qc_t submit_bio(struct bio *bio)
>  }
>  EXPORT_SYMBOL(submit_bio);
>  
> +static int blk_poll_bio(blk_qc_t cookie)
> +{
> + unsigned int devt = blk_qc_t_to_devt(cookie);
> + unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
> + struct block_device *bdev;
> + struct request_queue *q;
> + struct blk_mq_hw_ctx *hctx;
> + int ret;
> +
> + bdev = blkdev_get_no_open(devt);

As you pointed out to me in private, but for the benefit of others,
blkdev_get_no_open()'s need to take inode lock is not ideal here.

> +
> + /*
> +  * One such case is that dm device has reloaded table and the original
> +  * underlying device the bio submitted to has been detached. When
> +  * reloading table, dm will ensure that previously submitted IOs have
> +  * all completed, thus return directly here.
> +  */
> + if (!bdev)
> + return 1;
> +
> + q = bdev->bd_disk->queue;
> + hctx = q->queue_hw_ctx[queue_num];
> +
> + /*
> +  * Similar to the case described in the above comment, that dm device
> +  * has reloaded table and the original underlying device the bio
> +  * submitted to has been detached. Thus the dev_t stored in cookie may
> +  * be reused by

Re: [dm-devel] [PATCH v5 04/12] block: add poll_capable method to support bio-based IO polling

2021-03-10 Thread Mike Snitzer

On Wed, Mar 03 2021 at  6:57am -0500,
Jeffle Xu  wrote:

> This method can be used to check if bio-based device supports IO polling
> or not. For mq devices, checking for hw queue in polling mode is
> adequate, while the sanity check shall be implementation specific for
> bio-based devices. For example, dm device needs to check if all
> underlying devices are capable of IO polling.
> 
> Though bio-based device may have done the sanity check during the
> device initialization phase, cacheing the result of this sanity check
> (such as by cacheing in the queue_flags) may not work. Because for dm

s/cacheing/caching/

> devices, users could change the state of the underlying devices through
> '/sys/block//io_poll', bypassing the dm device above. In this case,
> the cached result of the very beginning sanity check could be
> out-of-date. Thus the sanity check needs to be done every time 'io_poll'
> is to be modified.
> 
> Signed-off-by: Jeffle Xu 

Ideally QUEUE_FLAG_POLL would be authoritative.. but I appreciate the
problem you've described.  Though I do wonder if this should be solved
by bio-based's fops->poll() method clearing the request_queue's
QUEUE_FLAG_POLL if it finds an underlying device doesn't have
QUEUE_FLAG_POLL set.  Though making bio-based's fops->poll() always need
to validate the an underlying device does support polling is pretty
unfortunate.

Either way, queue_poll_store() will need to avoid blk-mq specific poll
checking for bio-based devices.

Mike

> ---
>  block/blk-sysfs.c  | 14 +++---
>  include/linux/blkdev.h |  1 +
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 0f4f0c8a7825..367c1d9a55c6 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -426,9 +426,17 @@ static ssize_t queue_poll_store(struct request_queue *q, 
> const char *page,
>   unsigned long poll_on;
>   ssize_t ret;
>  
> - if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
> - !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
> - return -EINVAL;
> + if (queue_is_mq(q)) {
> + if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
> + !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
> + return -EINVAL;
> + } else {
> + struct gendisk *disk = queue_to_disk(q);
> +
> + if (!disk->fops->poll_capable ||
> + !disk->fops->poll_capable(disk))
> + return -EINVAL;
> + }
>  
>   ret = queue_var_store(_on, page, count);
>   if (ret < 0)
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 9dc83c30e7bc..7df40792c032 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1867,6 +1867,7 @@ static inline void blk_ksm_unregister(struct 
> request_queue *q) { }
>  struct block_device_operations {
>   blk_qc_t (*submit_bio) (struct bio *bio);
>   int (*poll)(struct request_queue *q, blk_qc_t cookie);
> + bool (*poll_capable)(struct gendisk *disk);
>   int (*open) (struct block_device *, fmode_t);
>   void (*release) (struct gendisk *, fmode_t);
>   int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned 
> int);
> -- 
> 2.27.0
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v5 03/12] block: add poll method to support bio-based IO polling

2021-03-10 Thread Mike Snitzer

On Wed, Mar 03 2021 at  6:57am -0500,
Jeffle Xu  wrote:

> ->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
> callback to struct request_queue") to support bio-based queues such as
> nvme multipath, but was later removed in commit 529262d56dbe ("block:
> remove ->poll_fn").
> 
> Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
> block_device_operations") restore the possibility of bio-based IO
> polling support by adding an ->poll method to gendisk->fops.
> 
> Make blk_mq_poll() specific to blk-mq, while blk_bio_poll() is
> originally a copy from blk_mq_poll(), and is specific to bio-based
> polling. Currently hybrid polling is not supported by bio-based polling.
> 
> Signed-off-by: Jeffle Xu 
> ---
>  block/blk-core.c   | 58 ++
>  block/blk-mq.c | 22 +---
>  include/linux/blk-mq.h |  1 +
>  include/linux/blkdev.h |  1 +
>  4 files changed, 61 insertions(+), 21 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fc60ff208497..6d7d53030d7c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1119,6 +1119,64 @@ blk_qc_t submit_bio(struct bio *bio)
>  }
>  EXPORT_SYMBOL(submit_bio);
>  
> +

Minor nit: Extra empty new line here? ^

Otherwise, looks good (I like the end result of blk-mq and bio-based
polling being decoupled like hch suggested).

Reviewed-by: Mike Snitzer 

> +static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +{
> + long state;
> + struct gendisk *disk = queue_to_disk(q);
> +
> + state = current->state;
> + do {
> + int ret;
> +
> + ret = disk->fops->poll(q, cookie);
> + if (ret > 0) {
> + __set_current_state(TASK_RUNNING);
> + return ret;
> + }
> +
> + if (signal_pending_state(state, current))
> + __set_current_state(TASK_RUNNING);
> +
> + if (current->state == TASK_RUNNING)
> + return 1;
> + if (ret < 0 || !spin)
> + break;
> + cpu_relax();
> + } while (!need_resched());
> +
> + __set_current_state(TASK_RUNNING);
> + return 0;
> +}
> +
> +/**
> + * blk_poll - poll for IO completions
> + * @q:  the queue
> + * @cookie: cookie passed back at IO submission time
> + * @spin: whether to spin for completions
> + *
> + * Description:
> + *Poll for completions on the passed in queue. Returns number of
> + *completed entries found. If @spin is true, then blk_poll will continue
> + *looping until at least one completion is found, unless the task is
> + *otherwise marked running (or we need to reschedule).
> + */
> +int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +{
> + if (!blk_qc_t_valid(cookie) ||
> + !test_bit(QUEUE_FLAG_POLL, >queue_flags))
> + return 0;
> +
> + if (current->plug)
> + blk_flush_plug_list(current->plug, false);
> +
> + if (queue_is_mq(q))
> + return blk_mq_poll(q, cookie, spin);
> + else
> + return blk_bio_poll(q, cookie, spin);
> +}
> +EXPORT_SYMBOL_GPL(blk_poll);
> +
>  /**
>   * blk_cloned_rq_check_limits - Helper function to check a cloned request
>   *  for the new queue limits
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d4d7c1caa439..214fa30b460a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3852,30 +3852,11 @@ static bool blk_mq_poll_hybrid(struct request_queue 
> *q,
>   return blk_mq_poll_hybrid_sleep(q, rq);
>  }
>  
> -/**
> - * blk_poll - poll for IO completions
> - * @q:  the queue
> - * @cookie: cookie passed back at IO submission time
> - * @spin: whether to spin for completions
> - *
> - * Description:
> - *Poll for completions on the passed in queue. Returns number of
> - *completed entries found. If @spin is true, then blk_poll will continue
> - *looping until at least one completion is found, unless the task is
> - *otherwise marked running (or we need to reschedule).
> - */
> -int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  {
>   struct blk_mq_hw_ctx *hctx;
>   long state;
>  
> - if (!blk_qc_t_valid(cookie) ||
> - !test_bit(QUEUE_FLAG_POLL, >queue_flags))
> - return 0;
> -
> - if (current->plug)
> - blk_flush_plug_list(curre

Re: [dm-devel] [PATCH v5 09/12] nvme/pci: don't wait for locked polling queue

2021-03-10 Thread Mike Snitzer

On Wed, Mar 03 2021 at  6:57am -0500,
Jeffle Xu  wrote:

> There's no sense waiting for the hw queue when it currently has been
> locked by another polling instance. The polling instance currently
> occupying the hw queue will help reap the completion events.
> 
> It shall be safe to surrender the hw queue, as long as we could reapply
> for polling later. For Synchronous polling, blk_poll() will reapply for
> polling, since @spin is always True in this case. While For asynchronous
> polling, i.e. io_uring itself will reapply for polling when the previous
> polling returns 0.
> 
> Besides, it shall do no harm to the polling performance of mq devices.
> 
> Signed-off-by: Jeffle Xu 

You should probably just send this to the linux-nvme list independent of
this patchset.

Mike


> ---
>  drivers/nvme/host/pci.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 38b0d694dfc9..150e56ed6d15 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1106,7 +1106,9 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx)
>   if (!nvme_cqe_pending(nvmeq))
>   return 0;
>  
> - spin_lock(>cq_poll_lock);
> + if (!spin_trylock(>cq_poll_lock))
> + return 0;
> +
>   found = nvme_process_cq(nvmeq);
>   spin_unlock(>cq_poll_lock);
>  
> -- 
> 2.27.0
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v3 00/11] dm: support IO polling

2021-03-10 Thread Mike Snitzer

On Mon, Feb 08 2021 at  3:52am -0500,
Jeffle Xu  wrote:

> 
> [Performance]
> 1. One thread (numjobs=1) randread (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw
> queue per nvme device.
> 
>  | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
>  | --- |  | 
>   dm |  208k | 279k | ~34%
> 
> 
> 2. Three threads (numjobs=3) randread (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw
> queue per nvme device.
> 
> It's compared to 3 threads directly randread 3 nvme devices, with one
> polling hw queue per nvme device. No CPU affinity set for these 3
> threads. Thus every thread can access every nvme device
> (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
> need to competing for every polling hw queue.
> 
>  | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
>  | --- |  | 
>   dm |  615k | 728k | ~18%
> nvme |  728k | 873k | ~20%
> 
> The result shows that the performance gain of bio-based polling is
> comparable as that of mq polling in the same test case.
> 
> 
> 3. Three threads (numjobs=3) randread (bs=12k, direct=1) one
> **dm-stripe** device, which is built upon 3 nvme devices, with one
> polling hw queue per nvme device.
> 
> It's compared to 3 threads directly randread 3 nvme devices, with one
> polling hw queue per nvme device. No CPU affinity set for these 3
> threads. Thus every thread can access every nvme device
> (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
> need to competing for every polling hw queue.
> 
>  | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
>  | --- |  | 
>   dm |  314k | 354k | ~13%
> nvme |  728k | 873k | ~20%
> 

So this "3." case is meant to illustrate effects of polling when bio is
split to 3 different underlying blk-mq devices. (think it odd that
dm-stripe across 3 nvme devices is performing so much worse than a
single nvme device)

Would be nice to see comparison of this same workload but _with_ CPU
affinity to show relative benefit of your patch 11 in this series (where
you try to leverage CPU pinning).

In general, I don't think patch 11 a worthwhile effort.  It is a
half-measure that is trying to paper over the fact that this bio-based
IO polling patchset is quite coarse grained about how it copes with the
need to poll multiple devices.

Patch 10 is a proper special case that should be accounted for (when a
bio isn't split and only gets submitted to a single blk-mq
device/hctx).  But even patch 10's approach is fragile (we we've
discussed in private and I'll touch on in reply to patch 10).

But I think patch 11 should be dropped and we defer optimizing bio
splitting at a later time with follow-on work.

Just so you're aware, I'm now thinking the proper way forward is to
update DM core, at the time of DM table load, to assemble an array of
underlying _data_ devices in that table (as iterated with
.iterate_devices) -- this would allow each underlying data device to be
assigned an immutable index for the lifetime of a DM table.  It'd be
hooked off the 'struct dm_table' and would share that object's
lifetime.

With that bit of infrastructure in place, we could then take steps to
make DM's cookie more dense in its description of underlying devices
that need polling.  This is where I'll get a bit handwavvy.. but I
raised this idea with Joe Thornber and he is going to have a think about
it too.

But this is all to say: optimizing the complex case of bio-splitting
that is such an integral part of bio-based IO processing needs more than
what you've attempted to do (noble effort on your part but again, really
just a half-measure).

SO I think it best to keep the initial implementation of bio-based
polling relatively simple by laying foundation for follow-on work.  And
it is also important to _not_ encode in block core some meaning to what
_should_ be a largely opaque cookie (blk_qc_t) that is for the
underlying driver to make sense of.

> 
> 4. This patchset shall do no harm to the performance of the original mq
> polling. Following is the test results of one thread (numjobs=1)
> randread (bs=4k, direct=1) one nvme device.
> 
>| IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
>  | --- |  | 
> without patchset |  242k | 332k | ~39%
> with patchset|  236k | 332k | ~39%

OK, good, this needs to be the case.

> 
> [Changes since v2]
> 
> Patchset v2 caches all hw queues (in polling mode) of underlying mq
> devices in dm layer. The polling routine actually iterates through all
> these cached hw queues.
> 
> However, mq may change the queue

Re: [dm-devel] dm: remove unneeded variable 'sz'

2021-03-09 Thread Mike Snitzer

On Tue, Mar 09 2021 at  4:32am -0500,
Yang Li  wrote:

> Fix the following coccicheck warning:
> ./drivers/md/dm-ps-service-time.c:85:10-12: Unneeded variable: "sz".
> Return "0" on line 105
> 
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 

This type of change gets proposed regaularly.  Would appreciate it if
you could fix coccicheck to not get this wrong.  The local 'sz' variable
is used by the DMEMIT macro (as the earlier reply to this email informed
you).

Also, had you tried to compile the code with your patch applied you'd
have quickly realized your patch wasn't correct.

Mike


> ---
>  drivers/md/dm-ps-service-time.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-ps-service-time.c b/drivers/md/dm-ps-service-time.c
> index 9cfda66..12dd5ce 100644
> --- a/drivers/md/dm-ps-service-time.c
> +++ b/drivers/md/dm-ps-service-time.c
> @@ -82,7 +82,6 @@ static void st_destroy(struct path_selector *ps)
>  static int st_status(struct path_selector *ps, struct dm_path *path,
>status_type_t type, char *result, unsigned maxlen)
>  {
> - unsigned sz = 0;
>   struct path_info *pi;
>  
>   if (!path)
> @@ -102,7 +101,7 @@ static int st_status(struct path_selector *ps, struct 
> dm_path *path,
>   }
>   }
>  
> - return sz;
> + return 0;
>  }
>  
>  static int st_add_path(struct path_selector *ps, struct dm_path *path,
> -- 
> 1.8.3.1
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [git pull] device mapper fixes for 5.12-rc2

2021-03-05 Thread Mike Snitzer

Hi Linus,

The following changes since commit a666e5c05e7c4aaabb2c5d58117b0946803d03d2:

  dm: fix deadlock when swapping to encrypted device (2021-02-11 09:45:28 -0500)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-5.12/dm-fixes

for you to fetch changes up to df7b59ba9245c4a3115ebaa905e3e5719a3810da:

  dm verity: fix FEC for RS roots unaligned to block size (2021-03-04 15:08:18 
-0500)

Please pull, thanks.
Mike


Fix DM verity target's optional Forward Error Correction (FEC) for
Reed-Solomon roots that are unaligned to block size.


Mikulas Patocka (1):
  dm bufio: subtract the number of initial sectors in 
dm_bufio_get_device_size

Milan Broz (1):
  dm verity: fix FEC for RS roots unaligned to block size

 drivers/md/dm-bufio.c  |  4 
 drivers/md/dm-verity-fec.c | 23 ---
 2 files changed, 16 insertions(+), 11 deletions(-)

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 4/4] dm: support I/O polling

2021-03-05 Thread Mike Snitzer

On Fri, Mar 05 2021 at 12:56pm -0500,
Heinz Mauelshagen  wrote:

> 
> On 3/5/21 6:46 PM, Heinz Mauelshagen wrote:
> >On 3/5/21 10:52 AM, JeffleXu wrote:
> >>
> >>On 3/3/21 6:09 PM, Mikulas Patocka wrote:
> >>>
> >>>On Wed, 3 Mar 2021, JeffleXu wrote:
> >>>
> 
> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
> 
> >Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
> >cookie.
> >
> >Signed-off-by: Mikulas Patocka 
> >
> >---
> >  drivers/md/dm.c |    5 +
> >  1 file changed, 5 insertions(+)
> >
> >Index: linux-2.6/drivers/md/dm.c
> >===
> >--- linux-2.6.orig/drivers/md/dm.c    2021-03-02
> >19:26:34.0 +0100
> >+++ linux-2.6/drivers/md/dm.c    2021-03-02 19:26:34.0 +0100
> >@@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
> >  }
> >  }
> >  +    if (ci.poll_cookie != BLK_QC_T_NONE) {
> >+    while (atomic_read(>io_count) > 1 &&
> >+   blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
> >+    }
> >+
> >  /* drop the extra reference count */
> >  dec_pending(ci.io, errno_to_blk_status(error));
> >  }
> It seems that the general idea of your design is to
> 1) submit *one* split bio
> 2) blk_poll(), waiting the previously submitted split bio complets
> >>>No, I submit all the bios and poll for the last one.
> >>>
> and then submit next split bio, repeating the above process.
> I'm afraid
> the performance may be an issue here, since the batch every time
> blk_poll() reaps may decrease.
> >>>Could you benchmark it?
> >>I only tested dm-linear.
> >>
> >>The configuration (dm table) of dm-linear is:
> >>0 1048576 linear /dev/nvme0n1 0
> >>1048576 1048576 linear /dev/nvme2n1 0
> >>2097152 1048576 linear /dev/nvme5n1 0
> >>
> >>
> >>fio script used is:
> >>```
> >>$cat fio.conf
> >>[global]
> >>name=iouring-sqpoll-iopoll-1
> >>ioengine=io_uring
> >>iodepth=128
> >>numjobs=1
> >>thread
> >>rw=randread
> >>direct=1
> >>registerfiles=1
> >>hipri=1
> >>runtime=10
> >>time_based
> >>group_reporting
> >>randrepeat=0
> >>filename=/dev/mapper/testdev
> >>bs=4k
> >>
> >>[job-1]
> >>cpus_allowed=14
> >>```
> >>
> >>IOPS (IRQ mode) | IOPS (iopoll mode (hipri=1))
> >>--- | 
> >>    213k |   19k
> >>
> >>At least, it doesn't work well with io_uring interface.
> >>
> >>
> >
> >
> >Jeffle,
> >
> >I ran your above fio test on a linear LV split across 3 NVMes to
> >second your split mapping
> >(system: 32 core Intel, 256GiB RAM) comparing io engines sync,
> >libaio and io_uring,
> >the latter w/ and w/o hipri (sync+libaio obviously w/o
> >registerfiles and hipri) which resulted ok:
> >
> >
> >
> >sync  |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1)
> >--|--|-|- 56.3K
> >|    290K  |    329K | 351K I can't second
> >your drastic hipri=1 drop here...
> 
> 
> Sorry, email mess.
> 
> 
> sync   |  libaio  |  IRQ mode (hipri=0) | iopoll (hipri=1)
> ---|--|-|-
> 56.3K  |    290K  |    329K | 351K
> 
> 
> 
> I can't second your drastic hipri=1 drop here...

I think your result is just showcasing your powerful system's ability to
poll every related HW queue.. whereas Jeffle's system is likely somehow
more constrained (on a cpu level, memory, whatever).

My basis for this is that Mikulas' changes simply always return an
invalid cookie (BLK_QC_T_NONE) for purposes of intelligent IO polling.

Such an implementation is completely invalid.

I discussed with Jens and he said:
"it needs to return something that f_op->iopoll() can make sense of.
otherwise you have no option but to try everything."

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 4/4] dm: support I/O polling

2021-03-04 Thread Mike Snitzer

On Thu, Mar 04 2021 at 10:01am -0500,
Jeff Moyer  wrote:

> Hi, Mikulas,
> 
> Mikulas Patocka  writes:
> 
> > On Wed, 3 Mar 2021, JeffleXu wrote:
> >
> >> 
> >> 
> >> On 3/3/21 3:05 AM, Mikulas Patocka wrote:
> >> 
> >> > Support I/O polling if submit_bio_noacct_mq_direct returned non-empty
> >> > cookie.
> >> > 
> >> > Signed-off-by: Mikulas Patocka 
> >> > 
> >> > ---
> >> >  drivers/md/dm.c |5 +
> >> >  1 file changed, 5 insertions(+)
> >> > 
> >> > Index: linux-2.6/drivers/md/dm.c
> >> > ===
> >> > --- linux-2.6.orig/drivers/md/dm.c   2021-03-02 19:26:34.0 
> >> > +0100
> >> > +++ linux-2.6/drivers/md/dm.c2021-03-02 19:26:34.0 +0100
> >> > @@ -1682,6 +1682,11 @@ static void __split_and_process_bio(stru
> >> >  }
> >> >  }
> >> >  
> >> > +if (ci.poll_cookie != BLK_QC_T_NONE) {
> >> > +while (atomic_read(>io_count) > 1 &&
> >> > +   blk_poll(ci.poll_queue, ci.poll_cookie, true)) ;
> >> > +}
> >> > +
> >> >  /* drop the extra reference count */
> >> >  dec_pending(ci.io, errno_to_blk_status(error));
> >> >  }
> >> 
> >> It seems that the general idea of your design is to
> >> 1) submit *one* split bio
> >> 2) blk_poll(), waiting the previously submitted split bio complets
> >
> > No, I submit all the bios and poll for the last one.
> 
> What happens if the last bio completes first?  It looks like you will
> call blk_poll with a cookie that already completed, and I'm pretty sure
> that's invalid.

In addition, I'm concerned this approach to have DM internalize IO
polling is a non-starter.

I just don't think this approach adheres to the io_uring + IO polling
interface.. it never returns a cookie to upper layers... so there is
really no opportunity for standard io_uring + IO polling interface to
work is there?

But Heinz and Mikulas are about to kick off some fio io_uring + hipri=1
(io polling) testing of Jeffle's latest v5 patchset:
https://patchwork.kernel.org/project/dm-devel/list/?series=442075

compared to Mikulas' patchset:
https://patchwork.kernel.org/project/dm-devel/list/?series=440719

We should have definitive answers soon enough, just using Jeffle's fio
config (with hipri=1 for IO polling) that was documented here:
https://listman.redhat.com/archives/dm-devel/2020-October/msg00129.html

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v4 00/12] dm: support IO polling

2021-02-23 Thread Mike Snitzer

On Mon, Feb 22 2021 at 10:55pm -0500,
JeffleXu  wrote:

> 
> 
> On 2/20/21 7:06 PM, Jeffle Xu wrote:
> > [Changes since v3]
> > - newly add patch 7 and patch 11, as a new optimization improving
> > performance of multiple polling processes. Now performance of multiple
> > polling processes can be as scalable as single polling process (~30%).
> > Refer to the following [Performance] chapter for more details.
> > 
> 
> Hi Mike, would please evaluate this new version patch set? I think this
> mechanism is near maturity, since multi-thread performance is as
> scalable as single-thread (~30%) now.

OK, can do. But first I think you need to repost with a v5 that
addresses Mikulas' v3 feedback:

https://listman.redhat.com/archives/dm-devel/2021-February/msg00254.html
https://listman.redhat.com/archives/dm-devel/2021-February/msg00255.html

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [git pull] device mapper changes for 5.12

2021-02-17 Thread Mike Snitzer

Hi Linus,

These DM changes happen to be based on linux-block from a few weeks ago
(but an expected DM dependency on block turned out to not be needed). 
And the few block/keyslot-manager changes are accompanied by Jens'
Acked-by.

The following changes since commit 8358c28a5d44bf0223a55a2334086c3707bb4185:

  block: fix memory leak of bvec (2021-02-02 08:57:56 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-5.12/dm-changes

for you to fetch changes up to a666e5c05e7c4aaabb2c5d58117b0946803d03d2:

  dm: fix deadlock when swapping to encrypted device (2021-02-11 09:45:28 -0500)

Please pull, thanks.
Mike


- Fix DM integrity's HMAC support to provide enhanced security of
  internal_hash and journal_mac capabilities.

- Various DM writecache fixes to address performance, fix table output
  to match what was provided at table creation, fix writing beyond end
  of device when shrinking underlying data device, and a couple other
  small cleanups.

- Add DM crypt support for using trusted keys.

- Fix deadlock when swapping to DM crypt device by throttling number
  of in-flight REQ_SWAP bios. Implemented in DM core so that other
  bio-based targets can opt-in by setting ti->limit_swap_bios.

- Fix various inverted logic bugs in the .iterate_devices callout
  functions that are used to assess if specific feature or capability
  is supported across all devices being combined/stacked by DM.

- Fix DM era target bugs that exposed users to lost writes or memory
  leaks.

- Add DM core support for passing through inline crypto support of
  underlying devices. Includes block/keyslot-manager changes that
  enable extending this support to DM.

- Various small fixes and cleanups (spelling fixes, front padding
  calculation cleanup, cleanup conditional zoned support in targets,
  etc).


Ahmad Fatoum (2):
  dm crypt: replaced #if defined with IS_ENABLED
  dm crypt: support using trusted keys

Colin Ian King (1):
  dm integrity: fix spelling mistake "flusing" -> "flushing"

Geert Uytterhoeven (1):
  dm crypt: Spelling s/cihper/cipher/

Jeffle Xu (5):
  dm: cleanup of front padding calculation
  dm table: fix iterate_devices based device capability checks
  dm table: fix DAX iterate_devices based device capability checks
  dm table: fix zoned iterate_devices based device capability checks
  dm table: remove needless request_queue NULL pointer checks

Jinoh Kang (1):
  dm persistent data: fix return type of shadow_root()

Mike Snitzer (2):
  dm writecache: use bdev_nr_sectors() instead of open-coded equivalent
  dm: simplify target code conditional on CONFIG_BLK_DEV_ZONED

Mikulas Patocka (5):
  dm integrity: introduce the "fix_hmac" argument
  dm writecache: fix performance degradation in ssd mode
  dm writecache: return the exact table values that were set
  dm writecache: fix writing beyond end of underlying device when shrinking
  dm: fix deadlock when swapping to encrypted device

Nikos Tsironis (7):
  dm era: Recover committed writeset after crash
  dm era: Update in-core bitset after committing the metadata
  dm era: Reinitialize bitset cache before digesting a new writeset
  dm era: Verify the data block size hasn't changed
  dm era: Fix bitset memory leaks
  dm era: Use correct value size in equality function of writeset tree
  dm era: only resize metadata in preresume

Satya Tangirala (5):
  block/keyslot-manager: Introduce passthrough keyslot manager
  block/keyslot-manager: Introduce functions for device mapper support
  dm: add support for passing through inline crypto support
  dm: support key eviction from keyslot managers of underlying devices
  dm: set DM_TARGET_PASSES_CRYPTO feature for some targets

Tian Tao (1):
  dm writecache: fix unnecessary NULL check warnings

Tom Rix (1):
  dm dust: remove h from printk format specifier

 .../admin-guide/device-mapper/dm-crypt.rst |   2 +-
 .../admin-guide/device-mapper/dm-integrity.rst |  11 +
 block/blk-crypto.c |   1 +
 block/keyslot-manager.c| 146 
 drivers/md/Kconfig |   1 +
 drivers/md/dm-core.h   |   9 +
 drivers/md/dm-crypt.c  |  39 +-
 drivers/md/dm-dust.c   |   2 +-
 drivers/md/dm-era-target.c |  93 +++--
 drivers/md/dm-flakey.c |   6 +-
 drivers/md/dm-integrity.c  | 140 +++-
 drivers/md/dm-linear.c |   8 +-
 drivers/md/dm-table.c  | 399 +

Re: [dm-devel] [PATCH 12/78] dm: use set_capacity_and_notify

2021-02-17 Thread Mike Snitzer

On Mon, Nov 16, 2020 at 10:05 AM Christoph Hellwig  wrote:
>
> Use set_capacity_and_notify to set the size of both the disk and block
> device.  This also gets the uevent notifications for the resize for free.
>
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Hannes Reinecke 
> ---
>  drivers/md/dm.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index c18fc25485186d..62ad44925e73ec 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1971,8 +1971,7 @@ static struct dm_table *__bind(struct mapped_device 
> *md, struct dm_table *t,
> if (size != dm_get_size(md))
> memset(>geometry, 0, sizeof(md->geometry));
>
> -   set_capacity(md->disk, size);
> -   bd_set_nr_sectors(md->bdev, size);
> +   set_capacity_and_notify(md->disk, size);
>
> dm_table_event_callback(t, event_callback, md);
>

Not yet pinned down _why_ DM is calling set_capacity_and_notify() with
a size of 0 but, when running various DM regression tests, I'm seeing
a lot of noise like:

[  689.240037] dm-2: detected capacity change from 2097152 to 0

Is this pr_info really useful?  Should it be moved to below: if
(!capacity || !size) so that it only prints if a uevent is sent?

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v5 5/6] dm: add 'noexcl' option for dm-linear

2021-02-15 Thread Mike Snitzer

On Mon, Feb 15 2021 at  5:34am -0500,
Sergei Shtepa  wrote:

> The 02/12/2021 19:06, Mike Snitzer wrote:
> > On Fri, Feb 12 2021 at  6:34am -0500,
> > Sergei Shtepa  wrote:
> > 
> > > The 02/11/2021 20:51, Mike Snitzer wrote:
> > > > On Tue, Feb 09 2021 at  9:30am -0500,
> > > > Sergei Shtepa  wrote:
> > > > 
> > > > > The 'noexcl' option allow to open underlying block-device
> > > > > without FMODE_EXCL.
> > > > > 
> > > > > Signed-off-by: Sergei Shtepa 
> > > > > ---
> > > > >  drivers/md/dm-linear.c| 14 +-
> > > > >  drivers/md/dm-table.c | 14 --
> > > > >  drivers/md/dm.c   | 26 +++---
> > > > >  drivers/md/dm.h   |  2 +-
> > > > >  include/linux/device-mapper.h |  7 +++
> > > > >  5 files changed, 48 insertions(+), 15 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> > > > > index 00774b5d7668..b16d89802b9d 100644
> > > > > --- a/drivers/md/dm-linear.c
> > > > > +++ b/drivers/md/dm-linear.c
> > > > > @@ -33,7 +33,7 @@ static int linear_ctr(struct dm_target *ti, 
> > > > > unsigned int argc, char **argv)
> > > > >   char dummy;
> > > > >   int ret;
> > > > >  
> > > > > - if (argc != 2) {
> > > > > + if ((argc < 2) || (argc > 3)) {
> > > > >   ti->error = "Invalid argument count";
> > > > >   return -EINVAL;
> > > > >   }
> > > > > @@ -51,6 +51,18 @@ static int linear_ctr(struct dm_target *ti, 
> > > > > unsigned int argc, char **argv)
> > > > >   }
> > > > >   lc->start = tmp;
> > > > >  
> > > > > + ti->non_exclusive = false;
> > > > > + if (argc > 2) {
> > > > > + if (strcmp("noexcl", argv[2]) == 0)
> > > > > + ti->non_exclusive = true;
> > > > > + else if (strcmp("excl", argv[2]) == 0)
> > > > > + ti->non_exclusive = false;
> > > > > + else {
> > > > > + ti->error = "Invalid exclusive option";
> > > > > + return -EINVAL;
> > > > > + }
> > > > > + }
> > > > > +
> > > > >   ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), 
> > > > > >dev);
> > > > >   if (ret) {
> > > > >   ti->error = "Device lookup failed";
> > > > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > > > > index 4acf2342f7ad..f020459465bd 100644
> > > > > --- a/drivers/md/dm-table.c
> > > > > +++ b/drivers/md/dm-table.c
> > > > > @@ -322,7 +322,7 @@ static int device_area_is_invalid(struct 
> > > > > dm_target *ti, struct dm_dev *dev,
> > > > >   * device and not to touch the existing bdev field in case
> > > > >   * it is accessed concurrently.
> > > > >   */
> > > > > -static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
> > > > > +static int upgrade_mode(struct dm_dev_internal *dd, fmode_t 
> > > > > new_mode, bool non_exclusive,
> > > > >   struct mapped_device *md)
> > > > >  {
> > > > >   int r;
> > > > > @@ -330,8 +330,8 @@ static int upgrade_mode(struct dm_dev_internal 
> > > > > *dd, fmode_t new_mode,
> > > > >  
> > > > >   old_dev = dd->dm_dev;
> > > > >  
> > > > > - r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev,
> > > > > - dd->dm_dev->mode | new_mode, _dev);
> > > > > + r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev, 
> > > > > dd->dm_dev->mode | new_mode,
> > > > > + non_exclusive, _dev);
> > > > >   if (r)
> > > > >   return r;
> > > > >  
> > > > > @@ -387,7 +387,8 @@ int dm_get_device(struct dm_target *t

Re: [dm-devel] [PATCH v5 4/6] dm: new ioctl DM_DEV_REMAP_CMD

2021-02-12 Thread Mike Snitzer

On Tue, Feb 09 2021 at  9:30am -0500,
Sergei Shtepa  wrote:

> New ioctl DM_DEV_REMAP_CMD allow to remap bio requests
> from regular block device to dm device.

I really dislike the (ab)use of "REMAP" for this. DM is and always has
been about remapping IO.  Would prefer DM_DEV_INTERPOSE_CMD

Similarly, all places documenting "remap" or variables with "remap"
changed to "interpose".

Also, any chance you'd be open to putting all these interposer specific
changes in dm-interposer.[ch] ?
(the various internal structs for DM core _should_ be available via dm-core.h)

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v5 5/6] dm: add 'noexcl' option for dm-linear

2021-02-12 Thread Mike Snitzer

On Fri, Feb 12 2021 at  6:34am -0500,
Sergei Shtepa  wrote:

> The 02/11/2021 20:51, Mike Snitzer wrote:
> > On Tue, Feb 09 2021 at  9:30am -0500,
> > Sergei Shtepa  wrote:
> > 
> > > The 'noexcl' option allow to open underlying block-device
> > > without FMODE_EXCL.
> > > 
> > > Signed-off-by: Sergei Shtepa 
> > > ---
> > >  drivers/md/dm-linear.c| 14 +-
> > >  drivers/md/dm-table.c | 14 --
> > >  drivers/md/dm.c   | 26 +++---
> > >  drivers/md/dm.h   |  2 +-
> > >  include/linux/device-mapper.h |  7 +++
> > >  5 files changed, 48 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> > > index 00774b5d7668..b16d89802b9d 100644
> > > --- a/drivers/md/dm-linear.c
> > > +++ b/drivers/md/dm-linear.c
> > > @@ -33,7 +33,7 @@ static int linear_ctr(struct dm_target *ti, unsigned 
> > > int argc, char **argv)
> > >   char dummy;
> > >   int ret;
> > >  
> > > - if (argc != 2) {
> > > + if ((argc < 2) || (argc > 3)) {
> > >   ti->error = "Invalid argument count";
> > >   return -EINVAL;
> > >   }
> > > @@ -51,6 +51,18 @@ static int linear_ctr(struct dm_target *ti, unsigned 
> > > int argc, char **argv)
> > >   }
> > >   lc->start = tmp;
> > >  
> > > + ti->non_exclusive = false;
> > > + if (argc > 2) {
> > > + if (strcmp("noexcl", argv[2]) == 0)
> > > + ti->non_exclusive = true;
> > > + else if (strcmp("excl", argv[2]) == 0)
> > > + ti->non_exclusive = false;
> > > + else {
> > > + ti->error = "Invalid exclusive option";
> > > + return -EINVAL;
> > > + }
> > > + }
> > > +
> > >   ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), 
> > > >dev);
> > >   if (ret) {
> > >   ti->error = "Device lookup failed";
> > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > > index 4acf2342f7ad..f020459465bd 100644
> > > --- a/drivers/md/dm-table.c
> > > +++ b/drivers/md/dm-table.c
> > > @@ -322,7 +322,7 @@ static int device_area_is_invalid(struct dm_target 
> > > *ti, struct dm_dev *dev,
> > >   * device and not to touch the existing bdev field in case
> > >   * it is accessed concurrently.
> > >   */
> > > -static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
> > > +static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode, 
> > > bool non_exclusive,
> > >   struct mapped_device *md)
> > >  {
> > >   int r;
> > > @@ -330,8 +330,8 @@ static int upgrade_mode(struct dm_dev_internal *dd, 
> > > fmode_t new_mode,
> > >  
> > >   old_dev = dd->dm_dev;
> > >  
> > > - r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev,
> > > - dd->dm_dev->mode | new_mode, _dev);
> > > + r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev, dd->dm_dev->mode 
> > > | new_mode,
> > > + non_exclusive, _dev);
> > >   if (r)
> > >   return r;
> > >  
> > > @@ -387,7 +387,8 @@ int dm_get_device(struct dm_target *ti, const char 
> > > *path, fmode_t mode,
> > >   if (!dd)
> > >   return -ENOMEM;
> > >  
> > > - if ((r = dm_get_table_device(t->md, dev, mode, >dm_dev))) {
> > > + r = dm_get_table_device(t->md, dev, mode, ti->non_exclusive, 
> > > >dm_dev);
> > > + if (r) {
> > >   kfree(dd);
> > >   return r;
> > >   }
> > > @@ -396,8 +397,9 @@ int dm_get_device(struct dm_target *ti, const char 
> > > *path, fmode_t mode,
> > >   list_add(>list, >devices);
> > >   goto out;
> > >  
> > > - } else if (dd->dm_dev->mode != (mode | dd->dm_dev->mode)) {
> > > - r = upgrade_mode(dd, mode, t->md);
> > > + } else if ((dd->dm_dev->mode != (mode | dd->dm_dev->mode)) &&
> > > +(dd->dm_dev->non_exclusive != ti-&g

Re: [dm-devel] [PATCH v4 0/5] add support for inline encryption to device mapper

2021-02-11 Thread Mike Snitzer

On Thu, Feb 11 2021 at  6:01pm -0500,
Satya Tangirala  wrote:

> On Wed, Feb 10, 2021 at 12:59:59PM -0700, Jens Axboe wrote:
> > On 2/10/21 12:33 PM, Mike Snitzer wrote:
> > > On Mon, Feb 01 2021 at 12:10am -0500,
> > > Satya Tangirala  wrote:
> > > 
> > >> This patch series adds support for inline encryption to the device 
> > >> mapper.
> > >>
> > >> Patch 1 introduces the "passthrough" keyslot manager.
> > >>
> > >> The regular keyslot manager is designed for inline encryption hardware 
> > >> that
> > >> have only a small fixed number of keyslots. A DM device itself does not
> > >> actually have only a small fixed number of keyslots - it doesn't actually
> > >> have any keyslots in the first place, and programming an encryption 
> > >> context
> > >> into a DM device doesn't make much semantic sense. It is possible for a 
> > >> DM
> > >> device to set up a keyslot manager with some "sufficiently large" number 
> > >> of
> > >> keyslots in its request queue, so that upper layers can use the inline
> > >> encryption capabilities of the DM device's underlying devices, but the
> > >> memory being allocated for the DM device's keyslots is a waste since they
> > >> won't actually be used by the DM device.
> > >>
> > >> The passthrough keyslot manager solves this issue - when the block layer
> > >> sees that a request queue has a passthrough keyslot manager, it doesn't
> > >> attempt to program any encryption context into the keyslot manager. The
> > >> passthrough keyslot manager only allows the device to expose its inline
> > >> encryption capabilities, and a way for upper layers to evict keys if
> > >> necessary.
> > >>
> > >> There also exist inline encryption hardware that can handle encryption
> > >> contexts directly, and allow users to pass them a data request along with
> > >> the encryption context (as opposed to inline encryption hardware that
> > >> require users to first program a keyslot with an encryption context, and
> > >> then require the users to pass the keyslot index with the data request).
> > >> Such devices can also make use of the passthrough keyslot manager.
> > >>
> > >> Patch 2 introduces some keyslot manager functions useful for the device
> > >> mapper.
> > >>
> > >> Patch 3 introduces the changes for inline encryption support for the 
> > >> device
> > >> mapper. A DM device only exposes the intersection of the crypto
> > >> capabilities of its underlying devices. This is so that in case a bio 
> > >> with
> > >> an encryption context is eventually mapped to an underlying device that
> > >> doesn't support that encryption context, the blk-crypto-fallback's cipher
> > >> tfms are allocated ahead of time by the call to 
> > >> blk_crypto_start_using_key.
> > >>
> > >> Each DM target can now also specify the "DM_TARGET_PASSES_CRYPTO" flag in
> > >> the target type features to opt-in to supporting passing through the
> > >> underlying inline encryption capabilities.  This flag is needed because 
> > >> it
> > >> doesn't make much semantic sense for certain targets like dm-crypt to
> > >> expose the underlying inline encryption capabilities to the upper layers.
> > >> Again, the DM exposes inline encryption capabilities of the underlying
> > >> devices only if all of them opt-in to passing through inline encryption
> > >> support.
> > >>
> > >> A keyslot manager is created for a table when it is loaded. However, the
> > >> mapped device's exposed capabilities *only* updated once the table is
> > >> swapped in (until the new table is swapped in, the mapped device 
> > >> continues
> > >> to expose the old table's crypto capabilities).
> > >>
> > >> This patch only allows the keyslot manager's capabilities to *expand*
> > >> because of table changes. Any attempt to load a new table that doesn't
> > >> support a crypto capability that the old table did is rejected.
> > >>
> > >> This patch also only exposes the intersection of the underlying device's
> > >> capabilities, which has the effect of causing en/decryption of a bio to
> > >> fall back to the kernel crypto API (if the fallback is enable

Re: [dm-devel] [PATCH v5 5/6] dm: add 'noexcl' option for dm-linear

2021-02-11 Thread Mike Snitzer

On Tue, Feb 09 2021 at  9:30am -0500,
Sergei Shtepa  wrote:

> The 'noexcl' option allow to open underlying block-device
> without FMODE_EXCL.
> 
> Signed-off-by: Sergei Shtepa 
> ---
>  drivers/md/dm-linear.c| 14 +-
>  drivers/md/dm-table.c | 14 --
>  drivers/md/dm.c   | 26 +++---
>  drivers/md/dm.h   |  2 +-
>  include/linux/device-mapper.h |  7 +++
>  5 files changed, 48 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index 00774b5d7668..b16d89802b9d 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -33,7 +33,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>   char dummy;
>   int ret;
>  
> - if (argc != 2) {
> + if ((argc < 2) || (argc > 3)) {
>   ti->error = "Invalid argument count";
>   return -EINVAL;
>   }
> @@ -51,6 +51,18 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>   }
>   lc->start = tmp;
>  
> + ti->non_exclusive = false;
> + if (argc > 2) {
> + if (strcmp("noexcl", argv[2]) == 0)
> + ti->non_exclusive = true;
> + else if (strcmp("excl", argv[2]) == 0)
> + ti->non_exclusive = false;
> + else {
> + ti->error = "Invalid exclusive option";
> + return -EINVAL;
> + }
> + }
> +
>   ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), 
> >dev);
>   if (ret) {
>   ti->error = "Device lookup failed";
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 4acf2342f7ad..f020459465bd 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -322,7 +322,7 @@ static int device_area_is_invalid(struct dm_target *ti, 
> struct dm_dev *dev,
>   * device and not to touch the existing bdev field in case
>   * it is accessed concurrently.
>   */
> -static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
> +static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode, bool 
> non_exclusive,
>   struct mapped_device *md)
>  {
>   int r;
> @@ -330,8 +330,8 @@ static int upgrade_mode(struct dm_dev_internal *dd, 
> fmode_t new_mode,
>  
>   old_dev = dd->dm_dev;
>  
> - r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev,
> - dd->dm_dev->mode | new_mode, _dev);
> + r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev, dd->dm_dev->mode 
> | new_mode,
> + non_exclusive, _dev);
>   if (r)
>   return r;
>  
> @@ -387,7 +387,8 @@ int dm_get_device(struct dm_target *ti, const char *path, 
> fmode_t mode,
>   if (!dd)
>   return -ENOMEM;
>  
> - if ((r = dm_get_table_device(t->md, dev, mode, >dm_dev))) {
> + r = dm_get_table_device(t->md, dev, mode, ti->non_exclusive, 
> >dm_dev);
> + if (r) {
>   kfree(dd);
>   return r;
>   }
> @@ -396,8 +397,9 @@ int dm_get_device(struct dm_target *ti, const char *path, 
> fmode_t mode,
>   list_add(>list, >devices);
>   goto out;
>  
> - } else if (dd->dm_dev->mode != (mode | dd->dm_dev->mode)) {
> - r = upgrade_mode(dd, mode, t->md);
> + } else if ((dd->dm_dev->mode != (mode | dd->dm_dev->mode)) &&
> +(dd->dm_dev->non_exclusive != ti->non_exclusive)) {
> + r = upgrade_mode(dd, mode, ti->non_exclusive, t->md);
>   if (r)
>   return r;
>   }
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 00c41aa6d092..c25dcc2fdb89 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1117,33 +1117,44 @@ static void close_table_device(struct table_device 
> *td, struct mapped_device *md
>   if (!td->dm_dev.bdev)
>   return;
>  
> - bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
> - blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
> + if (td->dm_dev.non_exclusive)
> + blkdev_put(td->dm_dev.bdev, td->dm_dev.mode);
> + else {
> + bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
> + blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
> + }
> +
> +
> + blkdev_put(td->dm_dev.bdev, td->dm_dev.mode);
> +
>   put_dax(td->dm_dev.dax_dev);
>   td->dm_dev.bdev = NULL;
>   td->dm_dev.dax_dev = NULL;
> + td->dm_dev.non_exclusive = false;
>  }
>  
>  static struct table_device *find_table_device(struct list_head *l, dev_t dev,
> -   fmode_t mode)
> +   fmode_t mode, bool non_exclusive)
>  {
>   struct table_device *td;
>  
>   list_for_each_entry(td, l, list)
> -

Re: [dm-devel] [PATCH v2] dm era: only resize metadata in preresume

2021-02-11 Thread Mike Snitzer

On Thu, Feb 11 2021 at  9:22am -0500,
Nikos Tsironis  wrote:

> Metadata resize shouldn't happen in the ctr. The ctr loads a temporary
> (inactive) table that will only become active upon resume. That is why
> resize should always be done in terms of resume. Otherwise a load (ctr)
> whose inactive table never becomes active will incorrectly resize the
> metadata.
> 
> Also, perform the resize directly in preresume, instead of using the
> worker to do it.
> 
> The worker might run other metadata operations, e.g., it could start
> digestion, before resizing the metadata. These operations will end up
> using the old size.
> 
> This could lead to errors, like:
> 
>   device-mapper: era: metadata_digest_transcribe_writeset: dm_array_set_value 
> failed
>   device-mapper: era: process_old_eras: digest step failed, stopping digestion
> 
> The reason of the above error is that the worker started the digestion
> of the archived writeset using the old, larger size.
> 
> As a result, metadata_digest_transcribe_writeset tried to write beyond
> the end of the era array.
> 
> Fixes: eec40579d84873 ("dm: add era target")
> Cc: sta...@vger.kernel.org # v3.15+
> Signed-off-by: Nikos Tsironis 
> ---
>  drivers/md/dm-era-target.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)

Thanks, I replaced the patch I created (that used a worker to resize
from preresume) with this.

Now staged for 5.12.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v4 0/5] add support for inline encryption to device mapper

2021-02-10 Thread Mike Snitzer

On Mon, Feb 01 2021 at 12:10am -0500,
Satya Tangirala  wrote:

> This patch series adds support for inline encryption to the device mapper.
> 
> Patch 1 introduces the "passthrough" keyslot manager.
> 
> The regular keyslot manager is designed for inline encryption hardware that
> have only a small fixed number of keyslots. A DM device itself does not
> actually have only a small fixed number of keyslots - it doesn't actually
> have any keyslots in the first place, and programming an encryption context
> into a DM device doesn't make much semantic sense. It is possible for a DM
> device to set up a keyslot manager with some "sufficiently large" number of
> keyslots in its request queue, so that upper layers can use the inline
> encryption capabilities of the DM device's underlying devices, but the
> memory being allocated for the DM device's keyslots is a waste since they
> won't actually be used by the DM device.
> 
> The passthrough keyslot manager solves this issue - when the block layer
> sees that a request queue has a passthrough keyslot manager, it doesn't
> attempt to program any encryption context into the keyslot manager. The
> passthrough keyslot manager only allows the device to expose its inline
> encryption capabilities, and a way for upper layers to evict keys if
> necessary.
> 
> There also exist inline encryption hardware that can handle encryption
> contexts directly, and allow users to pass them a data request along with
> the encryption context (as opposed to inline encryption hardware that
> require users to first program a keyslot with an encryption context, and
> then require the users to pass the keyslot index with the data request).
> Such devices can also make use of the passthrough keyslot manager.
> 
> Patch 2 introduces some keyslot manager functions useful for the device
> mapper.
> 
> Patch 3 introduces the changes for inline encryption support for the device
> mapper. A DM device only exposes the intersection of the crypto
> capabilities of its underlying devices. This is so that in case a bio with
> an encryption context is eventually mapped to an underlying device that
> doesn't support that encryption context, the blk-crypto-fallback's cipher
> tfms are allocated ahead of time by the call to blk_crypto_start_using_key.
> 
> Each DM target can now also specify the "DM_TARGET_PASSES_CRYPTO" flag in
> the target type features to opt-in to supporting passing through the
> underlying inline encryption capabilities.  This flag is needed because it
> doesn't make much semantic sense for certain targets like dm-crypt to
> expose the underlying inline encryption capabilities to the upper layers.
> Again, the DM exposes inline encryption capabilities of the underlying
> devices only if all of them opt-in to passing through inline encryption
> support.
> 
> A keyslot manager is created for a table when it is loaded. However, the
> mapped device's exposed capabilities *only* updated once the table is
> swapped in (until the new table is swapped in, the mapped device continues
> to expose the old table's crypto capabilities).
> 
> This patch only allows the keyslot manager's capabilities to *expand*
> because of table changes. Any attempt to load a new table that doesn't
> support a crypto capability that the old table did is rejected.
> 
> This patch also only exposes the intersection of the underlying device's
> capabilities, which has the effect of causing en/decryption of a bio to
> fall back to the kernel crypto API (if the fallback is enabled) whenever
> any of the underlying devices doesn't support the encryption context of the
> bio - it might be possible to make the bio only fall back to the kernel
> crypto API if the bio's target underlying device doesn't support the bio's
> encryption context, but the use case may be uncommon enough in the first
> place not to warrant worrying about it right now.
> 
> Patch 4 makes DM evict a key from all its underlying devices when asked to
> evict a key.
> 
> Patch 5 makes some DM targets opt-in to passing through inline encryption
> support. It does not (yet) try to enable this option with dm-raid, since
> users can "hot add" disks to a raid device, which makes this not completely
> straightforward (we'll need to ensure that any "hot added" disks must have
> a superset of the inline encryption capabilities of the rest of the disks
> in the raid device, due to the way Patch 2 of this series works).
> 
> Changes v3 => v4:
>  - Allocate the memory for the ksm of the mapped device in
>dm_table_complete(), and install the ksm in the md queue in __bind()
>(as suggested by Mike). Also drop patch 5 from v3 since it's no longer
>needed.
>  - Some cleanups
> 
> Changes v2 => v3:
>  - Split up the main DM patch into 4 separate patches
>  - Removed the priv variable added to struct keyslot manager in v2
>  - Use a flag in target type features for opting-in to inline encryption
>support, instead of using

Re: [dm-devel] dm: fix deadlock when swapping to encrypted device

2021-02-10 Thread Mike Snitzer

On Wed, Feb 10 2021 at 11:50am -0500,
Mikulas Patocka  wrote:

> Hi
> 
> Here I'm sending the patch that fixes swapping to dm-crypt.
> 
> The logic that limits the number of in-progress I/Os was moved to generic 
> device mapper. A dm target can activate it by setting ti->limit_swap. The 
> actual limit can be set in /sys/module/dm_mod/parameters/swap_ios.
> 
> This patch only limits swap bios (those with REQ_SWAP set). I don't limit 
> other bios, because limiting them causes performance degradation due to 
> cache line bouncing when taking the semaphore - and there are no reports 
> that non-swap I/O on dm crypt causes deadlocks.
> 
> Mikulas
> 
> 
> 
> From: Mikulas Patocka 
> 
> The system would deadlock when swapping to a dm-crypt device. The reason
> is that for each incoming write bio, dm-crypt allocates memory that holds
> encrypted data. These excessive allocations exhaust all the memory and the
> result is either deadlock or OOM trigger.
> 
> This patch limits the number of in-flight swap bios, so that the memory
> consumed by dm-crypt is limited. The limit is enforced if the target set
> the "limit_swap" variable and if the bio has REQ_SWAP set.
> 
> Non-swap bios are not affected becuase taking the semaphore would cause
> performance degradation.
> 
> This is similar to request-based drivers - they will also block when the
> number of requests is over the limit.
> 
> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org
> 
> ---
>  drivers/md/dm-core.h  |4 ++
>  drivers/md/dm-crypt.c |1 
>  drivers/md/dm.c   |   61 
> ++
>  include/linux/device-mapper.h |5 +++
>  4 files changed, 71 insertions(+)
> 
> Index: linux-2.6/drivers/md/dm.c
> ===
> --- linux-2.6.orig/drivers/md/dm.c2021-02-10 15:04:53.0 +0100
> +++ linux-2.6/drivers/md/dm.c 2021-02-10 16:29:04.0 +0100

> @@ -1271,6 +1307,15 @@ static blk_qc_t __map_bio(struct dm_targ
>   atomic_inc(>io_count);
>   sector = clone->bi_iter.bi_sector;
>  
> + if (unlikely(swap_io_limit(ti, clone))) {
> + struct mapped_device *md = io->md;
> + int latch = get_swap_ios();
> + if (unlikely(latch != md->swap_ios)) {
> + __set_swap_io_limit(md, latch);
> + }

Don't need these curly braces...

> + down(>swap_ios_semaphore);
> + }
> +
>   r = ti->type->map(ti, clone);
>   switch (r) {
>   case DM_MAPIO_SUBMITTED:

> @@ -1814,6 +1868,10 @@ static struct mapped_device *alloc_dev(i
>   init_waitqueue_head(>eventq);
>   init_completion(>kobj_holder.completion);
>  
> + md->swap_ios = get_swap_ios();
> + sema_init(>swap_ios_semaphore, md->swap_ios);
> + mutex_init(>swap_ios_lock);
> +
>   md->disk->major = _major;
>   md->disk->first_minor = minor;
>   md->disk->fops = _blk_dops;

This is only applicable for bio-based DM.  But probably not worth
avoiding the setup for request-based...

> @@ -3097,6 +3155,9 @@ MODULE_PARM_DESC(reserved_bio_based_ios,
>  module_param(dm_numa_node, int, S_IRUGO | S_IWUSR);
>  MODULE_PARM_DESC(dm_numa_node, "NUMA node for DM device memory allocations");
>  
> +module_param(swap_ios, int, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(swap_ios, "The number of swap I/Os in flight");
> +

Can you please rename this to modparam to "swap_bios"?  And rename other
variables/members, etc (e.g. "swap_bios_semaphore", "swap_bios_lock",
etc)?

> Index: linux-2.6/include/linux/device-mapper.h
> ===
> --- linux-2.6.orig/include/linux/device-mapper.h  2020-11-25 
> 13:40:44.0 +0100
> +++ linux-2.6/include/linux/device-mapper.h   2021-02-10 15:52:54.0 
> +0100
> @@ -325,6 +325,11 @@ struct dm_target {
>* whether or not its underlying devices have support.
>*/
>   bool discards_supported:1;
> +
> + /*
> +  * Set if we need to limit the number of in-flight bios when swapping.
> +  */
> + bool limit_swap:1;
>  };
>  
>  void *dm_per_bio_data(struct bio *bio, size_t data_size);

Please rename to "limit_swap_bios".

Other than these nits this looks good to me.
Once you send v2 I can get it staged for 5.12.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 4/4] dm era: Remove unreachable resize operation in pre-resume function

2021-02-10 Thread Mike Snitzer

On Wed, Feb 10 2021 at  1:12P -0500,
Mike Snitzer  wrote:

> On Fri, Jan 22 2021 at 10:25am -0500,
> Nikos Tsironis  wrote:
> 
> > The device metadata are resized in era_ctr(), so the metadata resize
> > operation in era_preresume() never runs.
> > 
> > Also, note, that if the operation did ever run it would deadlock, since
> > the worker has not been started at this point.

It wouldn't have deadlocked, it'd have queued the work (see wake_worker)

> > 
> > Fixes: eec40579d84873 ("dm: add era target")
> > Cc: sta...@vger.kernel.org # v3.15+
> > Signed-off-by: Nikos Tsironis 
> > ---
> >  drivers/md/dm-era-target.c | 9 -
> >  1 file changed, 9 deletions(-)
> > 
> > diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
> > index 104fb110cd4e..c40e132e50cd 100644
> > --- a/drivers/md/dm-era-target.c
> > +++ b/drivers/md/dm-era-target.c
> > @@ -1567,15 +1567,6 @@ static int era_preresume(struct dm_target *ti)
> >  {
> > int r;
> > struct era *era = ti->private;
> > -   dm_block_t new_size = calc_nr_blocks(era);
> > -
> > -   if (era->nr_blocks != new_size) {
> > -   r = in_worker1(era, metadata_resize, _size);
> > -   if (r)
> > -   return r;
> > -
> > -   era->nr_blocks = new_size;
> > -   }
> >  
> > start_worker(era);
> >  
> > -- 
> > 2.11.0
> > 
> 
> Resize shouldn't actually happen in the ctr.  The ctr loads a temporary
> (inactive) table that will only become active upon resume.  That is why
> resize should always be done in terms of resume.
> 
> I'll look closer but ctr shouldn't do the actual resize, and the
> start_worker() should be moved above the resize code you've removed
> above.

Does this work for you?  If so I'll get it staged (like I've just
staged all your other dm-era fixes for 5.12).

 drivers/md/dm-era-target.c | 13 ++---
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index d0e75fd31c1e..ec198e9cdafb 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -1501,15 +1501,6 @@ static int era_ctr(struct dm_target *ti, unsigned argc, 
char **argv)
}
era->md = md;
 
-   era->nr_blocks = calc_nr_blocks(era);
-
-   r = metadata_resize(era->md, >nr_blocks);
-   if (r) {
-   ti->error = "couldn't resize metadata";
-   era_destroy(era);
-   return -ENOMEM;
-   }
-
era->wq = alloc_ordered_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM);
if (!era->wq) {
ti->error = "could not create workqueue for metadata object";
@@ -1583,6 +1574,8 @@ static int era_preresume(struct dm_target *ti)
struct era *era = ti->private;
dm_block_t new_size = calc_nr_blocks(era);
 
+   start_worker(era);
+
if (era->nr_blocks != new_size) {
r = in_worker1(era, metadata_resize, _size);
if (r)
@@ -1591,8 +1584,6 @@ static int era_preresume(struct dm_target *ti)
era->nr_blocks = new_size;
}
 
-   start_worker(era);
-
r = in_worker0(era, metadata_era_rollover);
if (r) {
DMERR("%s: metadata_era_rollover failed", __func__);

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 0/4] dm era: Various minor fixes

2021-02-10 Thread Mike Snitzer

On Wed, Feb 10 2021 at 12:56pm -0500,
Ming Hung Tsai  wrote:

> On Fri, Jan 22, 2021 at 11:30 PM Nikos Tsironis  wrote:
> >
> > While working on fixing the bugs that cause lost writes, for which I
> > have sent separate emails, I bumped into several other minor issues that
> > I fix in this patch set.
> >
> > In particular, this series of commits introduces the following fixes:
> >
> > 1. Add explicit check that the data block size hasn't changed
> > 2. Fix bitset memory leaks. The in-core bitmaps were never freed.
> > 3. Fix the writeset tree equality test function to use the right value
> >size.
> > 4. Remove unreachable resize operation in pre-resume function.
> >
> > More information about the fixes can be found in their commit messages.
> >
> > Nikos Tsironis (4):
> >   dm era: Verify the data block size hasn't changed
> >   dm era: Fix bitset memory leaks
> >   dm era: Use correct value size in equality function of writeset tree
> >   dm era: Remove unreachable resize operation in pre-resume function
> >
> >  drivers/md/dm-era-target.c | 27 ---
> >  1 file changed, 16 insertions(+), 11 deletions(-)
> 
> For the series, except 4/4 where I haven't tried other solutions.
> 
> Reviewed-by: Ming-Hung Tsai 

patchwork doesn't parse this, so it falls on me to backfill tags like
this.  In the future, please reply to each patch with your desired tag.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 4/4] dm era: Remove unreachable resize operation in pre-resume function

2021-02-10 Thread Mike Snitzer

On Fri, Jan 22 2021 at 10:25am -0500,
Nikos Tsironis  wrote:

> The device metadata are resized in era_ctr(), so the metadata resize
> operation in era_preresume() never runs.
> 
> Also, note, that if the operation did ever run it would deadlock, since
> the worker has not been started at this point.
> 
> Fixes: eec40579d84873 ("dm: add era target")
> Cc: sta...@vger.kernel.org # v3.15+
> Signed-off-by: Nikos Tsironis 
> ---
>  drivers/md/dm-era-target.c | 9 -
>  1 file changed, 9 deletions(-)
> 
> diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
> index 104fb110cd4e..c40e132e50cd 100644
> --- a/drivers/md/dm-era-target.c
> +++ b/drivers/md/dm-era-target.c
> @@ -1567,15 +1567,6 @@ static int era_preresume(struct dm_target *ti)
>  {
>   int r;
>   struct era *era = ti->private;
> - dm_block_t new_size = calc_nr_blocks(era);
> -
> - if (era->nr_blocks != new_size) {
> - r = in_worker1(era, metadata_resize, _size);
> - if (r)
> - return r;
> -
> - era->nr_blocks = new_size;
> - }
>  
>   start_worker(era);
>  
> -- 
> 2.11.0
> 

Resize shouldn't actually happen in the ctr.  The ctr loads a temporary
(inactive) table that will only become active upon resume.  That is why
resize should always be done in terms of resume.

I'll look closer but ctr shouldn't do the actual resize, and the
start_worker() should be moved above the resize code you've removed
above.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] dm-writecache: allow the underlying device to be shrunk

2021-02-09 Thread Mike Snitzer

On Tue, Feb 09 2021 at 10:56am -0500,
Mikulas Patocka  wrote:

> Allow shrinking the underlying data device (dm-writecache must be
> suspended when the device is shrunk).
> 
> This patch modifies dm-writecache, so that it doesn't attempt to write any
> data beyond the end of the data device.
> 
> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org
> 
> ---
>  drivers/md/dm-writecache.c |   18 ++
>  1 file changed, 18 insertions(+)
> 
> Index: linux-2.6/drivers/md/dm-writecache.c
> ===
> --- linux-2.6.orig/drivers/md/dm-writecache.c 2021-02-05 20:30:35.0 
> +0100
> +++ linux-2.6/drivers/md/dm-writecache.c  2021-02-09 16:50:36.0 
> +0100
> @@ -148,6 +148,7 @@ struct dm_writecache {
>   size_t metadata_sectors;
>   size_t n_blocks;
>   uint64_t seq_count;
> + uint64_t data_device_sectors;
>   void *block_start;
>   struct wc_entry *entries;
>   unsigned block_size;
> @@ -969,6 +970,8 @@ static void writecache_resume(struct dm_
>  
>   wc_lock(wc);
>  
> + wc->data_device_sectors = i_size_read(wc->dev->bdev->bd_inode) >> 
> SECTOR_SHIFT;
> +

I switched it to using bdev_nr_sectors() (and sector_t instead of uint64_t) 
instead.

>   if (WC_MODE_PMEM(wc)) {
>   persistent_memory_invalidate_cache(wc->memory_map, 
> wc->memory_map_size);
>   } else {
> @@ -1638,6 +1641,10 @@ static bool wc_add_block(struct writebac
>   void *address = memory_data(wc, e);
>  
>   persistent_memory_flush_cache(address, block_size);
> +
> + if (unlikely(wb->bio.bi_iter.bi_sector + bio_sectors(>bio) >= 
> wc->data_device_sectors))
> + return true;
> +

I updated it to use bio_end_sector()

Otherwise looks good.

diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index d9c0d41e29f3..844c4be11768 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -148,7 +148,7 @@ struct dm_writecache {
size_t metadata_sectors;
size_t n_blocks;
uint64_t seq_count;
-   uint64_t data_device_sectors;
+   sector_t data_device_sectors;
void *block_start;
struct wc_entry *entries;
unsigned block_size;
@@ -978,7 +978,7 @@ static void writecache_resume(struct dm_target *ti)

wc_lock(wc);

-   wc->data_device_sectors = i_size_read(wc->dev->bdev->bd_inode) >> 
SECTOR_SHIFT;
+   wc->data_device_sectors = bdev_nr_sectors(wc->dev->bdev);

if (WC_MODE_PMEM(wc)) {
persistent_memory_invalidate_cache(wc->memory_map, 
wc->memory_map_size);
@@ -1650,7 +1650,7 @@ static bool wc_add_block(struct writeback_struct *wb, 
struct wc_entry *e, gfp_t

persistent_memory_flush_cache(address, block_size);

-   if (unlikely(wb->bio.bi_iter.bi_sector + bio_sectors(>bio) >= 
wc->data_device_sectors))
+   if (unlikely(bio_end_sector(>bio) >= wc->data_device_sectors))
return true;

return bio_add_page(>bio, persistent_memory_page(address),


BUT, it should be noted bdev_nr_sectors is pretty recent, introduced
with commit a782483cc1f8 ("block: remove the nr_sects field in struct
hd_struct").  SO using bdev_nr_sectors creates problems with stable@... 
given that I'll revert back to open-coding it in the tsable@ patch and
then apply a quick change to use bdev_nr_sectors().

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] dm: fix iterate_device sanity check

2021-02-09 Thread Mike Snitzer

On Tue, Feb 09 2021 at  2:06am -0500,
JeffleXu  wrote:

> 
> 
> On 2/9/21 1:29 PM, Mike Snitzer wrote:
> > 
> > Hi, please see these commits that I've staged in linux-next via:
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next
> > 
> > 1141b9133777 dm table: fix iterate_devices based device capability checks
> > 0224c5e6fd07 dm table: fix DAX iterate_devices based device capability 
> > checks
> > 76b0e14be03f dm table: fix zoned iterate_devices based device capability 
> > checks
> > 55cdd7435e97 dm table: remove needless request_queue NULL pointer checks
> > 
> 
> Thanks. This series looks good to me.
> 
> I suddenly find that the semantics of patch 1 (1141b9133777 dm table:
> fix iterate_devices based device capability checks) is a little
> different with the original context.
> 
> - if (blk_queue_add_random(q) && dm_table_all_devices_attribute(t,
> device_is_not_random))
> + if (dm_table_any_dev_attr(t, device_is_not_random))
>   blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
> + else
> + blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
> 
> In the original context, QUEUE_FLAG_ADD_RANDOM will only be cleared, it
> won't be set, while it could be set after patch 1. But I could see no
> harm of setting QUEUE_FLAG_ADD_RANDOM flag though.
> 
> FYI. Currently only scsi devices are still using QUEUE_FLAG_ADD_RANDOM
> flag, as all non-rotational devices should not set this flag since
> commit b277da0a8a59 ("block: disable entropy contributions for nonrot
> devices").

I fixed it, thanks.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] dm: fix iterate_device sanity check

2021-02-08 Thread Mike Snitzer

On Mon, Feb 08 2021 at 10:17am -0500,
Mike Snitzer  wrote:

> On Fri, Feb 05 2021 at  9:03pm -0500,
> JeffleXu  wrote:
> 
> > 
> > 
> > On 2/6/21 2:39 AM, Mike Snitzer wrote:
> > > On Mon, Feb 01 2021 at 10:35pm -0500,
> > > Jeffle Xu  wrote:
> > > 
> > >> According to the definition of dm_iterate_devices_fn:
> > >>  * This function must iterate through each section of device used by the
> > >>  * target until it encounters a non-zero return code, which it then 
> > >> returns.
> > >>  * Returns zero if no callout returned non-zero.
> > >>
> > >> For some target type (e.g., dm-stripe), one call of iterate_devices() may
> > >> iterate multiple underlying devices internally, in which case a non-zero
> > >> return code returned by iterate_devices_callout_fn will stop the 
> > >> iteration
> > >> in advance.
> > >>
> > >> Thus if we want to ensure that _all_ underlying devices support some 
> > >> kind of
> > >> attribute, the iteration structure like dm_table_supports_nowait() 
> > >> should be
> > >> used, while the input iterate_devices_callout_fn should handle the 'not
> > >> support' semantics. On the opposite, the iteration structure like
> > >> dm_table_any_device_attribute() should be used if _any_ underlying device
> > >> supporting this attibute is sufficient. In this case, the input
> > >> iterate_devices_callout_fn should handle the 'support' semantics.
> > >>
> > >> Fixes: 545ed20e6df6 ("dm: add infrastructure for DAX support")
> > >> Fixes: c3c4555edd10 ("dm table: clear add_random unless all devices have 
> > >> it set")
> > >> Fixes: 4693c9668fdc ("dm table: propagate non rotational flag")
> > >> Cc: sta...@vger.kernel.org
> > >> Signed-off-by: Jeffle Xu 
> > > 
> > > Thanks for auditing and fixing this up.  It has been on my todo so
> > > you've really helped me out -- your changes look correct to me.
> > > 
> > > I've staged it for 5.12, the stable fix will likely need manual fixups
> > > depending on the stable tree... we'll just need to assist with
> > > backport(s) as needed.
> > 
> > I'm glad to help offer the stable backport. But I don't know which
> > kernel version the stable kernel is still being maintained. Also which
> > mailing list I should send to when I finished backporting?
> 
> All your v2 changes speak to needing more discipline in crafting
> individual stable@ fixes that are applicable to various kernels.. when
> all applied to mainline then it'd be the equivalent of your single
> monolithic patch.
> 
> But without splitting the changes into separate patches, for stable@'s
> benefit, we'll have a much more difficult time of shepherding the
> applicable changes into the disparate stable@ kernels.
> 
> I'll have a look at splitting your v2 up accordingly.

Hi, please see these commits that I've staged in linux-next via:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next

1141b9133777 dm table: fix iterate_devices based device capability checks
0224c5e6fd07 dm table: fix DAX iterate_devices based device capability checks
76b0e14be03f dm table: fix zoned iterate_devices based device capability checks
55cdd7435e97 dm table: remove needless request_queue NULL pointer checks

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] dm: fix iterate_device sanity check

2021-02-08 Thread Mike Snitzer

On Fri, Feb 05 2021 at  9:03pm -0500,
JeffleXu  wrote:

> 
> 
> On 2/6/21 2:39 AM, Mike Snitzer wrote:
> > On Mon, Feb 01 2021 at 10:35pm -0500,
> > Jeffle Xu  wrote:
> > 
> >> According to the definition of dm_iterate_devices_fn:
> >>  * This function must iterate through each section of device used by the
> >>  * target until it encounters a non-zero return code, which it then 
> >> returns.
> >>  * Returns zero if no callout returned non-zero.
> >>
> >> For some target type (e.g., dm-stripe), one call of iterate_devices() may
> >> iterate multiple underlying devices internally, in which case a non-zero
> >> return code returned by iterate_devices_callout_fn will stop the iteration
> >> in advance.
> >>
> >> Thus if we want to ensure that _all_ underlying devices support some kind 
> >> of
> >> attribute, the iteration structure like dm_table_supports_nowait() should 
> >> be
> >> used, while the input iterate_devices_callout_fn should handle the 'not
> >> support' semantics. On the opposite, the iteration structure like
> >> dm_table_any_device_attribute() should be used if _any_ underlying device
> >> supporting this attibute is sufficient. In this case, the input
> >> iterate_devices_callout_fn should handle the 'support' semantics.
> >>
> >> Fixes: 545ed20e6df6 ("dm: add infrastructure for DAX support")
> >> Fixes: c3c4555edd10 ("dm table: clear add_random unless all devices have 
> >> it set")
> >> Fixes: 4693c9668fdc ("dm table: propagate non rotational flag")
> >> Cc: sta...@vger.kernel.org
> >> Signed-off-by: Jeffle Xu 
> > 
> > Thanks for auditing and fixing this up.  It has been on my todo so
> > you've really helped me out -- your changes look correct to me.
> > 
> > I've staged it for 5.12, the stable fix will likely need manual fixups
> > depending on the stable tree... we'll just need to assist with
> > backport(s) as needed.
> 
> I'm glad to help offer the stable backport. But I don't know which
> kernel version the stable kernel is still being maintained. Also which
> mailing list I should send to when I finished backporting?

All your v2 changes speak to needing more discipline in crafting
individual stable@ fixes that are applicable to various kernels.. when
all applied to mainline then it'd be the equivalent of your single
monolithic patch.

But without splitting the changes into separate patches, for stable@'s
benefit, we'll have a much more difficult time of shepherding the
applicable changes into the disparate stable@ kernels.

I'll have a look at splitting your v2 up accordingly.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] dm: fix iterate_device sanity check

2021-02-05 Thread Mike Snitzer

On Mon, Feb 01 2021 at 10:35pm -0500,
Jeffle Xu  wrote:

> According to the definition of dm_iterate_devices_fn:
>  * This function must iterate through each section of device used by the
>  * target until it encounters a non-zero return code, which it then returns.
>  * Returns zero if no callout returned non-zero.
> 
> For some target type (e.g., dm-stripe), one call of iterate_devices() may
> iterate multiple underlying devices internally, in which case a non-zero
> return code returned by iterate_devices_callout_fn will stop the iteration
> in advance.
> 
> Thus if we want to ensure that _all_ underlying devices support some kind of
> attribute, the iteration structure like dm_table_supports_nowait() should be
> used, while the input iterate_devices_callout_fn should handle the 'not
> support' semantics. On the opposite, the iteration structure like
> dm_table_any_device_attribute() should be used if _any_ underlying device
> supporting this attibute is sufficient. In this case, the input
> iterate_devices_callout_fn should handle the 'support' semantics.
> 
> Fixes: 545ed20e6df6 ("dm: add infrastructure for DAX support")
> Fixes: c3c4555edd10 ("dm table: clear add_random unless all devices have it 
> set")
> Fixes: 4693c9668fdc ("dm table: propagate non rotational flag")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Jeffle Xu 

Thanks for auditing and fixing this up.  It has been on my todo so
you've really helped me out -- your changes look correct to me.

I've staged it for 5.12, the stable fix will likely need manual fixups
depending on the stable tree... we'll just need to assist with
backport(s) as needed.

Thanks again,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v4 2/6] block: add blk_interposer

2021-02-03 Thread Mike Snitzer

On Wed, Feb 03 2021 at 10:53am -0500,
Sergei Shtepa  wrote:

> blk_interposer allows to intercept bio requests, remap bio to another devices 
> or add new bios.
> 
> Signed-off-by: Sergei Shtepa 
> ---
>  block/bio.c   |  2 +
>  block/blk-core.c  | 33 
>  block/genhd.c | 82 +++
>  include/linux/blk_types.h |  6 ++-
>  include/linux/genhd.h | 18 +
>  5 files changed, 139 insertions(+), 2 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 1f2cc1fbe283..f6f135eb84b5 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -684,6 +684,8 @@ void __bio_clone_fast(struct bio *bio, struct bio 
> *bio_src)
>   bio_set_flag(bio, BIO_CLONED);
>   if (bio_flagged(bio_src, BIO_THROTTLED))
>   bio_set_flag(bio, BIO_THROTTLED);
> + if (bio_flagged(bio_src, BIO_INTERPOSED))
> + bio_set_flag(bio, BIO_INTERPOSED);
>   bio->bi_opf = bio_src->bi_opf;
>   bio->bi_ioprio = bio_src->bi_ioprio;
>   bio->bi_write_hint = bio_src->bi_write_hint;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7663a9b94b80..c84bc42ba88b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1032,6 +1032,32 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>   return ret;
>  }
>  
> +static blk_qc_t __submit_bio_interposed(struct bio *bio)
> +{
> + struct bio_list bio_list[2] = { };
> + blk_qc_t ret = BLK_QC_T_NONE;
> +
> + current->bio_list = bio_list;
> + if (likely(bio_queue_enter(bio) == 0)) {
> + struct gendisk *disk = bio->bi_disk;
> +
> + if (likely(blk_has_interposer(disk))) {
> + bio_set_flag(bio, BIO_INTERPOSED);
> + disk->interposer->ip_submit_bio(bio);
> + } else /* interposer was removed */
> + bio_list_add(>bio_list[0], bio);

style nit:

} else {
/* interposer was removed */
bio_list_add(>bio_list[0], bio);
}

> +
> + blk_queue_exit(disk->queue);
> + }
> + current->bio_list = NULL;
> +
> + /* Resubmit remaining bios */
> + while ((bio = bio_list_pop(_list[0])))
> + ret = submit_bio_noacct(bio);
> +
> + return ret;
> +}
> +
>  /**
>   * submit_bio_noacct - re-submit a bio to the block device layer for I/O
>   * @bio:  The bio describing the location in memory and on the device.
> @@ -1057,6 +1083,13 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
>   return BLK_QC_T_NONE;
>   }
>  
> + /*
> +  * Checking the BIO_INTERPOSED flag is necessary so that the bio
> +  * created by the blk_interposer do not get to it for processing.
> +  */
> + if (blk_has_interposer(bio->bi_disk) &&
> + !bio_flagged(bio, BIO_INTERPOSED))
> + return __submit_bio_interposed(bio);
>   if (!bio->bi_disk->fops->submit_bio)
>   return __submit_bio_noacct_mq(bio);
>   return __submit_bio_noacct(bio);
> diff --git a/block/genhd.c b/block/genhd.c
> index 419548e92d82..39785a3ef703 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -30,6 +30,7 @@
>  static struct kobject *block_depr;
>  
>  DECLARE_RWSEM(bdev_lookup_sem);
> +DEFINE_MUTEX(bdev_interposer_mutex);

Seems you're using this mutex to protect access to disk->interposer in
attach/detach.  This is to prevent attach/detach races to same device?

Thankfully attach/detach isn't in the bio submission fast path but it'd
be helpful to document what this mutex is protecting).

A storm of attach or detach will all hit this global mutex though...

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v4 3/6] block: add blk_mq_is_queue_frozen()

2021-02-03 Thread Mike Snitzer

On Wed, Feb 03 2021 at 10:53am -0500,
Sergei Shtepa  wrote:

> blk_mq_is_queue_frozen() allow to assert that the queue is frozen.
> 
> Signed-off-by: Sergei Shtepa 
> ---
>  block/blk-mq.c | 13 +
>  include/linux/blk-mq.h |  1 +
>  2 files changed, 14 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f285a9123a8b..924ec26fae5f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -161,6 +161,19 @@ int blk_mq_freeze_queue_wait_timeout(struct 
> request_queue *q,
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
>  
> +
> +bool blk_mq_is_queue_frozen(struct request_queue *q)
> +{
> + bool ret;
> +
> + mutex_lock(>mq_freeze_lock);
> + ret = percpu_ref_is_dying(>q_usage_counter) && 
> percpu_ref_is_zero(>q_usage_counter);
> + mutex_unlock(>mq_freeze_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_is_queue_frozen);
> +
>  /*
>   * Guarantee no request is in use, so we can change any data structure of
>   * the queue afterward.
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index d705b174d346..9d1e8c4e922e 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -525,6 +525,7 @@ void blk_freeze_queue_start(struct request_queue *q);
>  void blk_mq_freeze_queue_wait(struct request_queue *q);
>  int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
>unsigned long timeout);
> +bool blk_mq_is_queue_frozen(struct request_queue *q);
>  
>  int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
>  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int 
> nr_hw_queues);
> -- 
> 2.20.1
> 

This needs to come before patch 2 (since patch 2 uses it).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 1/2] dm crypt: replaced #if defined with IS_ENABLED

2021-02-02 Thread Mike Snitzer

On Fri, Jan 22 2021 at  3:43am -0500,
Ahmad Fatoum  wrote:

> IS_ENABLED(CONFIG_ENCRYPTED_KEYS) is true whether the option is built-in
> or a module, so use it instead of #if defined checking for each
> separately.
> 
> The other #if was to avoid a static function defined, but unused
> warning. As we now always build the callsite when the function
> is defined, we can remove that first #if guard.
> 
> Suggested-by: Arnd Bergmann 
> Signed-off-by: Ahmad Fatoum 
> ---
> Cc: Dmitry Baryshkov 
> ---
>  drivers/md/dm-crypt.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 8c874710f0bc..7eeb9248eda5 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -2436,7 +2436,6 @@ static int set_key_user(struct crypt_config *cc, struct 
> key *key)
>   return 0;
>  }
>  
> -#if defined(CONFIG_ENCRYPTED_KEYS) || defined(CONFIG_ENCRYPTED_KEYS_MODULE)
>  static int set_key_encrypted(struct crypt_config *cc, struct key *key)
>  {
>   const struct encrypted_key_payload *ekp;
> @@ -2452,7 +2451,6 @@ static int set_key_encrypted(struct crypt_config *cc, 
> struct key *key)
>  
>   return 0;
>  }
> -#endif /* CONFIG_ENCRYPTED_KEYS */
>  
>  static int crypt_set_keyring_key(struct crypt_config *cc, const char 
> *key_string)
>  {
> @@ -2482,11 +2480,10 @@ static int crypt_set_keyring_key(struct crypt_config 
> *cc, const char *key_string
>   } else if (!strncmp(key_string, "user:", key_desc - key_string + 1)) {
>   type = _type_user;
>   set_key = set_key_user;
> -#if defined(CONFIG_ENCRYPTED_KEYS) || defined(CONFIG_ENCRYPTED_KEYS_MODULE)
> - } else if (!strncmp(key_string, "encrypted:", key_desc - key_string + 
> 1)) {
> + } else if (IS_ENABLED(CONFIG_ENCRYPTED_KEYS) &&
> +!strncmp(key_string, "encrypted:", key_desc - key_string + 
> 1)) {
>   type = _type_encrypted;
>   set_key = set_key_encrypted;
> -#endif
>   } else {
>   return -EINVAL;
>   }
> -- 
> 2.30.0
> 

I could be mistaken but the point of the previous way used to enable
this code was to not compile the code at all.  How you have it, the
branch just isn't taken but the compiled code is left to bloat dm-crypt.

Why not leave this as is and follow same pattern in your next patch?

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 0/2] block: blk_interposer v3

2021-02-01 Thread Mike Snitzer

On Mon, Feb 01 2021 at  1:18pm -0500,
Sergei Shtepa  wrote:

> The 02/01/2021 18:45, Bart Van Assche wrote:
> > On 1/28/21 9:12 AM, Sergei Shtepa wrote:
> > > I`m ready to suggest the blk_interposer again.
> > > blk_interposer allows to intercept bio requests, remap bio to
> > > another devices or add new bios.
> > > 
> > > This version has support from device mapper.
> > > 
> > > For the dm-linear device creation command, the `noexcl` parameter
> > > has been added, which allows to open block devices without
> > > FMODE_EXCL mode. It allows to create dm-linear device on a block
> > > device with an already mounted file system.
> > > The new ioctl DM_DEV_REMAP allows to enable and disable bio
> > > interception.
> > > 
> > > Thus, it is possible to add the dm-device to the block layer stack
> > > without reconfiguring and rebooting.
> > 
> > What functionality does this driver provide that is not yet available in 
> > a RAID level 1 (mirroring) driver + a custom dm driver? My understanding 
> > is that there are already two RAID level 1 drivers in the kernel tree 
> > and that both driver support sending bio's to two different block devices.
> > 
> > Thanks,
> > 
> > Bart.
> 
> Hi Bart.
> 
> The proposed patch is not realy aimed at RAID1.
> 
> Creating a new dm device in the non-FMODE_EXCL mode and then remaping bio
> requests from the regular block device to the new DM device using
> the blk_interposer will allow to use device mapper for regular devices.
> For dm-linear, there is not much benefit from using blk_interposer.
> This is a good and illustrative example. Later, using blk-interposer,
> it will be possible to connect the dm-cache "on the fly" without having
> to reboot and/or reconfigure.
> My intention is to let users use dm-snap to create snapshots of any device.
> blk-interposer will allow to add new features to Device Mapper.
> 
> As per Daniel's advice I want to add a documentation, I'm working on it now.
> The documentation will also contain a description of new features that
> blk_interposer will add to Device Mapper

More Documentation is fine, but the code needs to be improved and fully
formed before you start trying to polish with Documentation --
definitely don't put time to Documentation that is speculative!

You'd do well to focus on an implementation that doesn't require an
extra clone if interposed device will use DM (DM core already handles
cloning all incoming bios).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2 0/6] dm: support IO polling for bio-based dm device

2021-01-31 Thread Mike Snitzer

On Wed, Jan 27 2021 at 10:06pm -0500,
JeffleXu  wrote:

> 
> 
> On 1/28/21 1:19 AM, Mike Snitzer wrote:
> > On Mon, Jan 25 2021 at  7:13am -0500,
> > Jeffle Xu  wrote:
> > 
> >> Since currently we have no simple but efficient way to implement the
> >> bio-based IO polling in the split-bio tracking style, this patch set
> >> turns to the original implementation mechanism that iterates and
> >> polls all underlying hw queues in polling mode. One optimization is
> >> introduced to mitigate the race of one hw queue among multiple polling
> >> instances.
> >>
> >> I'm still open to the split bio tracking mechanism, if there's
> >> reasonable way to implement it.
> >>
> >>
> >> [Performance Test]
> >> The performance is tested by fio (engine=io_uring) 4k randread on
> >> dm-linear device. The dm-linear device is built upon nvme devices,
> >> and every nvme device has one polling hw queue (nvme.poll_queues=1).
> >>
> >> Test Case  | IOPS in IRQ mode | IOPS in polling mode | Diff
> >>| (hipri=0)| (hipri=1)|
> >> --- |  |  | 
> >> 
> >> 3 target nvme, num_jobs = 1 | 198k| 276k | 
> >> ~40%
> >> 3 target nvme, num_jobs = 3 | 608k| 705k | 
> >> ~16%
> >> 6 target nvme, num_jobs = 6 | 1197k   | 1347k| 
> >> ~13%
> >> 3 target nvme, num_jobs = 6 | 1285k   | 1293k| 
> >> ~0%
> >>
> >> As the number of polling instances (num_jobs) increases, the
> >> performance improvement decreases, though it's still positive
> >> compared to the IRQ mode.
> > 
> > I think there is serious room for improvement for DM's implementation;
> > but the block changes for this are all we'd need for DM in the longrun
> > anyway (famous last words).
> 
> Agreed.
> 
> 
> > So on a block interface level I'm OK with
> > block patches 1-3.
> > 
> > I don't see why patch 5 is needed (said the same in reply to it; but I
> > just saw your reason below..).
> > 
> > Anyway, I can pick up DM patches 4 and 6 via linux-dm.git if Jens picks
> > up patches 1-3. Jens, what do you think?
> 
> cc Jens.
> 
> Also I will send a new version later, maybe some refactor on patch5 and
> some typo modifications.

Thinking further, there is no benefit to Jens picking up the block core
changes until the DM changes are ready.  While I think the refactoring
to expose the blk_poll (in patch 3) that supports blk-mq and bio-based
is reasonable -- Christoph correctly points out there is extra branching
that blk-mq must tolerate as implemented.  So even that needs followup
work as suggested here:
https://www.redhat.com/archives/dm-devel/2021-January/msg00397.html

Also, your followup about oversights in the the latest bio-based DM io
polling implementation speaks to all of this needing more time:
https://www.redhat.com/archives/dm-devel/2021-January/msg00436.html

You advocating going back to what is effectively the first RFC patchset
you proposed (with its underwhelming bio-based polling performance)
isn't a strong indication these changes are ready, or that we even have
a patch forward for how to make bio-based IO polling be worthwhile.

So: I retract my question to Jens about whether he'd pick up the block
core changes (while I think those are close, the corresponding DM
changes aren't).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2 3/6] block: add iopoll method to support bio-based IO polling

2021-01-27 Thread Mike Snitzer

On Mon, Jan 25 2021 at  7:13am -0500,
Jeffle Xu  wrote:

> ->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
> callback to struct request_queue") to support bio-based queues such as
> nvme multipath, but was later removed in commit 529262d56dbe ("block:
> remove ->poll_fn").
> 
> Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
> block_device_operations") restore the possibility of bio-based IO
> polling support by adding an ->iopoll method to gendisk->fops.
> Elevate bulk of blk_mq_poll() implementation to blk_poll() and reduce
> blk_mq_poll() to blk-mq specific code that is called from blk_poll().
> 
> Signed-off-by: Jeffle Xu 
> Suggested-by: Mike Snitzer 

Reviewed-by: Mike Snitzer 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2 5/6] block: add QUEUE_FLAG_POLL_CAP flag

2021-01-27 Thread Mike Snitzer

On Mon, Jan 25 2021 at  7:13am -0500,
Jeffle Xu  wrote:

> Introduce QUEUE_FLAG_POLL_CAP flag representing if the request queue
> capable of polling or not.
> 
> Signed-off-by: Jeffle Xu 

Why are you adding QUEUE_FLAG_POLL_CAP?  Doesn't seem as though DM or
anything else actually needs it.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2 2/6] block: add queue_to_disk() to get gendisk from request_queue

2021-01-27 Thread Mike Snitzer

On Mon, Jan 25 2021 at  7:13am -0500,
Jeffle Xu  wrote:

> Sometimes we need to get the corresponding gendisk from request_queue.
> 
> It is preferred that block drivers store private data in
> gendisk->private_data rather than request_queue->queuedata, e.g. see:
> commit c4a59c4e5db3 ("dm: stop using ->queuedata").
> 
> So if only request_queue is given, we need to get its corresponding
> gendisk to get the private data stored in that gendisk.
> 
> Signed-off-by: Jeffle Xu 
> Review-by: Mike Snitzer 

^typo

Reviewed-by: Mike Snitzer 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2 0/6] dm: support IO polling for bio-based dm device

2021-01-27 Thread Mike Snitzer

On Mon, Jan 25 2021 at  7:13am -0500,
Jeffle Xu  wrote:

> Since currently we have no simple but efficient way to implement the
> bio-based IO polling in the split-bio tracking style, this patch set
> turns to the original implementation mechanism that iterates and
> polls all underlying hw queues in polling mode. One optimization is
> introduced to mitigate the race of one hw queue among multiple polling
> instances.
> 
> I'm still open to the split bio tracking mechanism, if there's
> reasonable way to implement it.
> 
> 
> [Performance Test]
> The performance is tested by fio (engine=io_uring) 4k randread on
> dm-linear device. The dm-linear device is built upon nvme devices,
> and every nvme device has one polling hw queue (nvme.poll_queues=1).
> 
> Test Case | IOPS in IRQ mode | IOPS in polling mode | Diff
>   | (hipri=0)| (hipri=1)|
> --- |  |  | 
> 3 target nvme, num_jobs = 1 | 198k   | 276k | ~40%
> 3 target nvme, num_jobs = 3 | 608k   | 705k | ~16%
> 6 target nvme, num_jobs = 6 | 1197k  | 1347k| ~13%
> 3 target nvme, num_jobs = 6 | 1285k  | 1293k| ~0%
> 
> As the number of polling instances (num_jobs) increases, the
> performance improvement decreases, though it's still positive
> compared to the IRQ mode.

I think there is serious room for improvement for DM's implementation;
but the block changes for this are all we'd need for DM in the longrun
anyway (famous last words). So on a block interface level I'm OK with
block patches 1-3.

I don't see why patch 5 is needed (said the same in reply to it; but I
just saw your reason below..).

Anyway, I can pick up DM patches 4 and 6 via linux-dm.git if Jens picks
up patches 1-3. Jens, what do you think?

> [Optimization]
> To mitigate the race when iterating all the underlying hw queues, one
> flag is maintained on a per-hw-queue basis. This flag is used to
> indicate whether this polling hw queue currently being polled on or
> not. Every polling hw queue is exclusive to one polling instance, i.e.,
> the polling instance will skip this polling hw queue if this hw queue
> currently is being polled by another polling instance, and start
> polling on the next hw queue.
> 
> This per-hw-queue flag map is currently maintained in dm layer. In
> the table load phase, a table describing all underlying polling hw
> queues is built and stored in 'struct dm_table'. It is safe when
> reloading the mapping table.
> 
> 
> changes since v1:
> - patch 1,2,4 is the same as v1 and have already been reviewed
> - patch 3 is refactored a bit on the basis of suggestions from
> Mike Snitzer.
> - patch 5 is newly added and introduces one new queue flag
> representing if the queue is capable of IO polling. This mainly
> simplifies the logic in queue_poll_store().

Ah OK, don't see why we want to eat a queue flag for that though!

> - patch 6 implements the core mechanism supporting IO polling.
> The sanity check checking if the dm device supports IO polling is
> also folded into this patch, and the queue flag will be cleared if
> it doesn't support, in case of table reloading.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [git pull] device mapper fixes for 5.11-rc5

2021-01-22 Thread Mike Snitzer

Hi Linus,

The following changes since commit 19c329f6808995b142b3966301f217c831e7cf31:

  Linux 5.11-rc4 (2021-01-17 16:37:05 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-5.11/dm-fixes-2

for you to fetch changes up to 809b1e4945774c9ec5619a8f4e2189b7b3833c0c:

  dm: avoid filesystem lookup in dm_get_dev_t() (2021-01-21 15:06:45 -0500)

Please pull, thanks.
Mike


- Fix DM integrity crash if "recalculate" used without "internal_hash"

- Fix DM integrity "recalculate" support to prevent recalculating
  checksums if we use internal_hash or journal_hash with a key
  (e.g. HMAC). Use of crypto as a means to prevent malicious
  corruption requires further changes and was never a design goal for
  dm-integrity's primary usecase of detecting accidental corruption.

- Fix a benign dm-crypt copy-and-paste bug introduced as part of a
  fix that was merged for 5.11-rc4.

- Fix DM core's dm_get_device() to avoid filesystem lookup to get
  block device (if possible).


Hannes Reinecke (1):
  dm: avoid filesystem lookup in dm_get_dev_t()

Ignat Korchagin (1):
  dm crypt: fix copy and paste bug in crypt_alloc_req_aead

Mikulas Patocka (2):
  dm integrity: fix a crash if "recalculate" used without "internal_hash"
  dm integrity: conditionally disable "recalculate" feature

 .../admin-guide/device-mapper/dm-integrity.rst | 12 ++--
 drivers/md/dm-crypt.c  |  6 ++--
 drivers/md/dm-integrity.c  | 32 --
 drivers/md/dm-table.c  | 15 --
 4 files changed, 54 insertions(+), 11 deletions(-)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 5/5] crypto: remove Salsa20 stream cipher algorithm

2021-01-21 Thread Mike Snitzer

On Thu, Jan 21 2021 at  1:09pm -0500,
Ard Biesheuvel  wrote:

> On Thu, 21 Jan 2021 at 19:05, Eric Biggers  wrote:
> >
> > On Thu, Jan 21, 2021 at 02:07:33PM +0100, Ard Biesheuvel wrote:
> > > Salsa20 is not used anywhere in the kernel, is not suitable for disk
> > > encryption, and widely considered to have been superseded by ChaCha20.
> > > So let's remove it.
> > >
> > > Signed-off-by: Ard Biesheuvel 
> > > ---
> > >  Documentation/admin-guide/device-mapper/dm-integrity.rst |4 +-
> > >  crypto/Kconfig   |   12 -
> > >  crypto/Makefile  |1 -
> > >  crypto/salsa20_generic.c |  212 
> > >  crypto/tcrypt.c  |   11 +-
> > >  crypto/testmgr.c |6 -
> > >  crypto/testmgr.h | 1162 
> > > 
> > >  7 files changed, 3 insertions(+), 1405 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/device-mapper/dm-integrity.rst 
> > > b/Documentation/admin-guide/device-mapper/dm-integrity.rst
> > > index 4e6f504474ac..d56112e2e354 100644
> > > --- a/Documentation/admin-guide/device-mapper/dm-integrity.rst
> > > +++ b/Documentation/admin-guide/device-mapper/dm-integrity.rst
> > > @@ -143,8 +143,8 @@ recalculate
> > >  journal_crypt:algorithm(:key)(the key is optional)
> > >   Encrypt the journal using given algorithm to make sure that the
> > >   attacker can't read the journal. You can use a block cipher here
> > > - (such as "cbc(aes)") or a stream cipher (for example "chacha20",
> > > - "salsa20" or "ctr(aes)").
> > > + (such as "cbc(aes)") or a stream cipher (for example "chacha20"
> > > + or "ctr(aes)").
> >
> > You should check with the dm-integrity maintainers how likely it is that 
> > people
> > are using salsa20 with dm-integrity.  It's possible that people are using 
> > it,
> > especially since the documentation says that dm-integrity can use a stream
> > cipher and specifically gives salsa20 as an example.
> >
> 
> Good point - cc'ed them now.
> 

No problem here, if others don't find utility in salsa20 then
dm-integrity certainly isn't the hold-out.

Acked-by:  Mike Snitzer 

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2] dm: avoid filesystem lookup in dm_get_dev_t()

2021-01-21 Thread Mike Snitzer

On Thu, Jan 21 2021 at 12:53pm -0500,
Christoph Hellwig  wrote:

> Looks good,
> 
> Reviewed-by: Christoph Hellwig 
> 
> Mike, Jens - can we make sure this goes in before branching off the
> block branch for 5.12?  I have some work pending that would otherwise
> conflict.

Sure, I'll do my part to get this fix staged now and sent to Linus
(likely tomorrow) for 5.11-rc5.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] dm: avoid filesystem lookup in dm_get_dev_t()

2021-01-21 Thread Mike Snitzer

On Thu, Jan 21 2021 at 10:02am -0500,
Martin Wilck  wrote:

> On Thu, 2020-12-10 at 18:11 +0100, Martin Wilck wrote:
> > On Thu, 2020-12-10 at 10:24 +0100, Hannes Reinecke wrote:
> > > dm_get_dev_t() is just used to convert an arbitrary 'path' string
> > > into a dev_t. It doesn't presume that the device is present; that
> > > check will be done later, as the only caller is dm_get_device(),
> > > which does a dm_get_table_device() later on, which will properly
> > > open the device.
> > > So if the path string already _is_ in major:minor representation
> > > we can convert it directly, avoiding a recursion into the
> > > filesystem
> > > to lookup the block device.
> > > This avoids a hang in multipath_message() when the filesystem is
> > > inaccessible.
> > > 
> > > Signed-off-by: Hannes Reinecke 
> > 
> > Ack, although I believe the scsi/__GENKSYMS__ part doesn't belong
> > here.
> > 
> > Note that this is essentially a revert/fix of 644bda6f3460 ("dm
> > table:
> > fall back to getting device using name_to_dev_t()"). Added the author
> > of that patch to CC.
> 
> Mike, do you need anything more to apply this one? Do you want a
> cleaned-up resend?

It got hung up with Christoph correctly requesting more discussion, last
linux-block/lkml mail on the associated thread I kicked off is here:
https://lkml.org/lkml/2020/12/23/76

Basically if Hannes or yourself would like to review that thread and
send a relevant v2 it'd really help move this forward.  I'm bogged down
with too many competing tasks.  You guys may be able to act on this line
of development/fixing quicker than I'll get to it.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [RFC PATCH 00/37] block: introduce bio_init_fields()

2021-01-19 Thread Mike Snitzer

On Tue, Jan 19 2021 at 12:05am -0500,
Chaitanya Kulkarni  wrote:

> Hi,
> 
> This is a *compile only RFC* which adds a generic helper to initialize
> the various fields of the bio that is repeated all the places in
> file-systems, block layer, and drivers.
> 
> The new helper allows callers to initialize various members such as
> bdev, sector, private, end io callback, io priority, and write hints.
> 
> The objective of this RFC is to only start a discussion, this it not 
> completely tested at all. Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â 
> Following diff shows code level benefits of this helper :-
> Â 38 files changed, 124 insertions(+), 236 deletions(-)


Please no... this is just obfuscation.

Adding yet another field to set would create a cascade of churn
throughout kernel (and invariably many callers won't need the new field
initialized, so you keep passing 0 for more and more fields).

Nacked-by: Mike Snitzer 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [git pull] device mapper fixes for 5.11-rc4

2021-01-15 Thread Mike Snitzer

Hi Linus,

The following changes since commit e71ba9452f0b5b2e8dc8aa5445198cd9214a6a62:

  Linux 5.11-rc2 (2021-01-03 15:55:30 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-5.11/dm-fixes-1

for you to fetch changes up to c87a95dc28b1431c7e77e2c0c983cf37698089d2:

  dm crypt: defer decryption to a tasklet if interrupts disabled
  (2021-01-14 09:54:37 -0500)

Please pull, thanks.
Mike


- Fix DM-raid's raid1 discard limits so discards work.

- Select missing Kconfig dependencies for DM integrity and zoned
  targets.

- 4 fixes for DM crypt target's support to optionally bypass
  kcryptd workqueues.

- Fix DM snapshot merge supports missing data flushes before
  committing metadata.

- Fix DM integrity data device flushing when external metadata is
  used.

- Fix DM integrity's maximum number of supported constructor arguments
  that user can request when creating an integrity device.

- Eliminate DM core ioctl logging noise when an ioctl is issued
  without required CAP_SYS_RAWIO permission.


Akilesh Kailash (1):
  dm snapshot: flush merged data before committing metadata

Anthony Iliopoulos (1):
  dm integrity: select CRYPTO_SKCIPHER

Arnd Bergmann (1):
  dm zoned: select CONFIG_CRC32

Ignat Korchagin (4):
  dm crypt: do not wait for backlogged crypto request completion in softirq
  dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
  dm crypt: do not call bio_endio() from the dm-crypt tasklet
  dm crypt: defer decryption to a tasklet if interrupts disabled

Mike Snitzer (2):
  dm raid: fix discard limits for raid1
  dm: eliminate potential source of excessive kernel log noise

Mikulas Patocka (2):
  dm integrity: fix flush with external metadata device
  dm integrity: fix the maximum number of arguments

 drivers/md/Kconfig|   2 +
 drivers/md/dm-bufio.c |   6 ++
 drivers/md/dm-crypt.c | 170 +-
 drivers/md/dm-integrity.c |  62 +
 drivers/md/dm-raid.c  |   6 +-
 drivers/md/dm-snap.c  |  24 +++
 drivers/md/dm.c   |   2 +-
 include/linux/dm-bufio.h  |   1 +
 8 files changed, 239 insertions(+), 34 deletions(-)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v3 5/6] dm: Verify inline encryption capabilities of new table when it is loaded

2021-01-14 Thread Mike Snitzer

On Tue, Dec 29 2020 at  3:55am -0500,
Satya Tangirala  wrote:

> DM only allows the table to be swapped if the new table's inline encryption
> capabilities are a superset of the old table's. We only check that this
> constraint is true when the table is actually swapped in (in
> dm_swap_table()). But this allows a user to load an unacceptable table
> without any complaint from DM, only for DM to throw an error when the
> device is resumed, and the table is swapped in.
> 
> This patch makes DM verify the inline encryption capabilities of the new
> table when the table is loaded. DM continues to verify and use the
> capabilities at the time of table swap, since the capabilities of
> underlying child devices can expand during the time between the table load
> and table swap (which in turn can cause the capabilities of this parent
> device to expand as well).
> 
> Signed-off-by: Satya Tangirala 
> ---
>  drivers/md/dm-ioctl.c |  8 
>  drivers/md/dm.c   | 25 +
>  drivers/md/dm.h   | 19 +++
>  3 files changed, 52 insertions(+)
> 
> diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> index 5e306bba4375..055a3c745243 100644
> --- a/drivers/md/dm-ioctl.c
> +++ b/drivers/md/dm-ioctl.c
> @@ -1358,6 +1358,10 @@ static int table_load(struct file *filp, struct 
> dm_ioctl *param, size_t param_si
>   goto err_unlock_md_type;
>   }
>  
> + r = dm_verify_inline_encryption(md, t);
> + if (r)
> + goto err_unlock_md_type;
> +
>   if (dm_get_md_type(md) == DM_TYPE_NONE) {
>   /* Initial table load: acquire type of table. */
>   dm_set_md_type(md, dm_table_get_type(t));
> @@ -2115,6 +2119,10 @@ int __init dm_early_create(struct dm_ioctl *dmi,
>   if (r)
>   goto err_destroy_table;
>  
> + r = dm_verify_inline_encryption(md, t);
> + if (r)
> + goto err_destroy_table;
> +
>   md->type = dm_table_get_type(t);
>   /* setup md->queue to reflect md's type (may block) */
>   r = dm_setup_md_queue(md, t);
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index b8844171d8e4..04322de34d29 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -2094,6 +2094,31 @@ dm_construct_keyslot_manager(struct mapped_device *md, 
> struct dm_table *t)
>   return ksm;
>  }
>  
> +/**
> + * dm_verify_inline_encryption() - Verifies that the current keyslot manager 
> of
> + *  the mapped_device can be replaced by the
> + *  keyslot manager of a given dm_table.
> + * @md: The mapped_device
> + * @t: The dm_table
> + *
> + * In particular, this function checks that the keyslot manager that will be
> + * constructed for the dm_table will support a superset of the capabilities 
> that
> + * the current keyslot manager of the mapped_device supports.
> + *
> + * Return: 0 if the table's keyslot_manager can replace the current keyslot
> + *  manager of the mapped_device. Negative value otherwise.
> + */
> +int dm_verify_inline_encryption(struct mapped_device *md, struct dm_table *t)
> +{
> + struct blk_keyslot_manager *ksm = dm_construct_keyslot_manager(md, t);
> +
> + if (IS_ERR(ksm))
> + return PTR_ERR(ksm);
> + dm_destroy_keyslot_manager(ksm);
> +
> + return 0;
> +}
> +
>  static void dm_update_keyslot_manager(struct request_queue *q,
> struct blk_keyslot_manager *ksm)
>  {


There shouldn't be any need to bolt on ksm verification in terms of a
temporary ksm.  If you run with my suggestions I just provided in review
of patch 3: dm_table_complete()'s setup of the ksm should also
implicitly validate it.

So this patch, and extra dm_verify_inline_encryption() interface,
shouldn't be needed.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v3 3/6] dm: add support for passing through inline crypto support

2021-01-14 Thread Mike Snitzer

On Tue, Dec 29 2020 at  3:55am -0500,
Satya Tangirala  wrote:

> Update the device-mapper core to support exposing the inline crypto
> support of the underlying device(s) through the device-mapper device.
> 
> This works by creating a "passthrough keyslot manager" for the dm
> device, which declares support for encryption settings which all
> underlying devices support.  When a supported setting is used, the bio
> cloning code handles cloning the crypto context to the bios for all the
> underlying devices.  When an unsupported setting is used, the blk-crypto
> fallback is used as usual.
> 
> Crypto support on each underlying device is ignored unless the
> corresponding dm target opts into exposing it.  This is needed because
> for inline crypto to semantically operate on the original bio, the data
> must not be transformed by the dm target.  Thus, targets like dm-linear
> can expose crypto support of the underlying device, but targets like
> dm-crypt can't.  (dm-crypt could use inline crypto itself, though.)
> 
> A DM device's table can only be changed if the "new" inline encryption
> capabilities are a (*not* necessarily strict) superset of the "old" inline
> encryption capabilities.  Attempts to make changes to the table that result
> in some inline encryption capability becoming no longer supported will be
> rejected.
> 
> For the sake of clarity, key eviction from underlying devices will be
> handled in a future patch.
> 
> Co-developed-by: Eric Biggers 
> Signed-off-by: Eric Biggers 
> Signed-off-by: Satya Tangirala 
> ---
>  drivers/md/dm.c | 164 +++-
>  include/linux/device-mapper.h   |   6 ++
>  include/linux/keyslot-manager.h |   8 ++
>  3 files changed, 177 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index b3c3c8b4cb42..13b9c8e2e21b 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DM_MSG_PREFIX "core"
>  
> @@ -1718,6 +1719,8 @@ static const struct dax_operations dm_dax_ops;
>  
>  static void dm_wq_work(struct work_struct *work);
>  
> +static void dm_destroy_inline_encryption(struct request_queue *q);
> +
>  static void cleanup_mapped_device(struct mapped_device *md)
>  {
>   if (md->wq)
> @@ -1739,8 +1742,10 @@ static void cleanup_mapped_device(struct mapped_device 
> *md)
>   put_disk(md->disk);
>   }
>  
> - if (md->queue)
> + if (md->queue) {
> + dm_destroy_inline_encryption(md->queue);
>   blk_cleanup_queue(md->queue);
> + }
>  
>   cleanup_srcu_struct(>io_barrier);
>  
> @@ -1937,6 +1942,150 @@ static void event_callback(void *context)
>   dm_issue_global_event();
>  }
>  
> +#ifdef CONFIG_BLK_INLINE_ENCRYPTION
> +
> +struct dm_keyslot_manager {
> + struct blk_keyslot_manager ksm;
> + struct mapped_device *md;
> +};
> +
> +static int device_intersect_crypto_modes(struct dm_target *ti,
> +  struct dm_dev *dev, sector_t start,
> +  sector_t len, void *data)
> +{
> + struct blk_keyslot_manager *parent = data;
> + struct blk_keyslot_manager *child = bdev_get_queue(dev->bdev)->ksm;
> +
> + blk_ksm_intersect_modes(parent, child);
> + return 0;
> +}
> +
> +static void dm_destroy_keyslot_manager(struct blk_keyslot_manager *ksm)
> +{
> + struct dm_keyslot_manager *dksm = container_of(ksm,
> +struct 
> dm_keyslot_manager,
> +ksm);
> +
> + if (!ksm)
> + return;
> +
> + blk_ksm_destroy(ksm);
> + kfree(dksm);
> +}
> +
> +/*
> + * Constructs and returns a keyslot manager that represents the crypto
> + * capabilities of the devices described by the dm_table. However, if the
> + * constructed keyslot manager does not support a superset of the crypto
> + * capabilities supported by the current keyslot manager of the 
> mapped_device,
> + * it returns an error instead, since we don't support restricting crypto
> + * capabilities on table changes. Finally, if the constructed keyslot manager
> + * doesn't actually support any crypto modes at all, it just returns NULL.
> + */
> +static struct blk_keyslot_manager *
> +dm_construct_keyslot_manager(struct mapped_device *md, struct dm_table *t)
> +{
> + struct dm_keyslot_manager *dksm;
> + struct blk_keyslot_manager *ksm;
> + struct dm_target *ti;
> + unsigned int i;
> + bool ksm_is_empty = true;
> +
> + dksm = kmalloc(sizeof(*dksm), GFP_KERNEL);
> + if (!dksm)
> + return ERR_PTR(-ENOMEM);
> + dksm->md = md;
> +
> + ksm = >ksm;
> + blk_ksm_init_passthrough(ksm);
> + ksm->max_dun_bytes_supported = UINT_MAX;
> + memset(ksm->crypto_modes_supported, 0xFF,
> +sizeof(ksm->crypto_modes_supported));
> +
> + for (i = 0; i <

Re: [dm-devel] [PATCH v3 2/6] block: keyslot-manager: Introduce functions for device mapper support

2021-01-14 Thread Mike Snitzer

On Tue, Dec 29 2020 at  3:55am -0500,
Satya Tangirala  wrote:

> Introduce blk_ksm_update_capabilities() to update the capabilities of
> a keyslot manager (ksm) in-place. The pointer to a ksm in a device's
> request queue may not be easily replaced, because upper layers like
> the filesystem might access it (e.g. for programming keys/checking
> capabilities) at the same time the device wants to replace that
> request queue's ksm (and free the old ksm's memory). This function
> allows the device to update the capabilities of the ksm in its request
> queue directly.
> 
> Also introduce blk_ksm_is_superset() which checks whether one ksm's
> capabilities are a (not necessarily strict) superset of another ksm's.
> The blk-crypto framework requires that crypto capabilities that were
> advertised when a bio was created continue to be supported by the
> device until that bio is ended - in practice this probably means that
> a device's advertised crypto capabilities can *never* "shrink" (since
> there's no synchronization between bio creation and when a device may
> want to change its advertised capabilities) - so a previously
> advertised crypto capability must always continue to be supported.
> This function can be used to check that a new ksm is a valid
> replacement for an old ksm.
> 
> Signed-off-by: Satya Tangirala 
> ---
>  block/keyslot-manager.c | 91 +
>  include/linux/keyslot-manager.h |  9 
>  2 files changed, 100 insertions(+)
> 
> diff --git a/block/keyslot-manager.c b/block/keyslot-manager.c
> index ac7ce83a76e8..f13ab7410eca 100644
> --- a/block/keyslot-manager.c
> +++ b/block/keyslot-manager.c
> @@ -424,6 +424,97 @@ void blk_ksm_unregister(struct request_queue *q)
>   q->ksm = NULL;
>  }
>  
> +/**
> + * blk_ksm_intersect_modes() - restrict supported modes by child device
> + * @parent: The keyslot manager for parent device
> + * @child: The keyslot manager for child device, or NULL
> + *
> + * Clear any crypto mode support bits in @parent that aren't set in @child.
> + * If @child is NULL, then all parent bits are cleared.
> + *
> + * Only use this when setting up the keyslot manager for a layered device,
> + * before it's been exposed yet.
> + */
> +void blk_ksm_intersect_modes(struct blk_keyslot_manager *parent,
> +  const struct blk_keyslot_manager *child)
> +{
> + if (child) {
> + unsigned int i;
> +
> + parent->max_dun_bytes_supported =
> + min(parent->max_dun_bytes_supported,
> + child->max_dun_bytes_supported);
> + for (i = 0; i < ARRAY_SIZE(child->crypto_modes_supported);
> +  i++) {
> + parent->crypto_modes_supported[i] &=
> + child->crypto_modes_supported[i];
> + }
> + } else {
> + parent->max_dun_bytes_supported = 0;
> + memset(parent->crypto_modes_supported, 0,
> +sizeof(parent->crypto_modes_supported));
> + }
> +}
> +EXPORT_SYMBOL_GPL(blk_ksm_intersect_modes);
> +
> +/**
> + * blk_ksm_is_superset() - Check if a KSM supports a superset of crypto modes
> + *  and DUN bytes that another KSM supports. Here,
> + *  "superset" refers to the mathematical meaning of the
> + *  word - i.e. if two KSMs have the *same* capabilities,
> + *  they *are* considered supersets of each other.
> + * @ksm_superset: The KSM that we want to verify is a superset
> + * @ksm_subset: The KSM that we want to verify is a subset
> + *
> + * Return: True if @ksm_superset supports a superset of the crypto modes and 
> DUN
> + *  bytes that @ksm_subset supports.
> + */
> +bool blk_ksm_is_superset(struct blk_keyslot_manager *ksm_superset,
> +  struct blk_keyslot_manager *ksm_subset)
> +{
> + int i;
> +
> + if (!ksm_subset)
> + return true;
> +
> + if (!ksm_superset)
> + return false;
> +
> + for (i = 0; i < ARRAY_SIZE(ksm_superset->crypto_modes_supported); i++) {
> + if (ksm_subset->crypto_modes_supported[i] &
> + (~ksm_superset->crypto_modes_supported[i])) {
> + return false;
> + }
> + }
> +
> + if (ksm_subset->max_dun_bytes_supported >
> + ksm_superset->max_dun_bytes_supported) {
> + return false;
> + }
> +
> + return true;
> +}
> +EXPORT_SYMBOL_GPL(blk_ksm_is_superset);
> +
> +/**
> + * blk_ksm_update_capabilities() - Update the restrictions of a KSM to those 
> of
> + *  another KSM
> + * @target_ksm: The KSM whose restrictions to update.
> + * @reference_ksm: The KSM to whose restrictions this function will update
> + *  @target_ksm's restrictions to,
> + */
> +void blk_ksm_update_capabilities(struct blk_keyslot_manager *target_ksm,
> +

Re: [dm-devel] [PATCH RFC 6/7] block: track cookies of split bios for bio-based device

2021-01-14 Thread Mike Snitzer

On Thu, Jan 14 2021 at  4:16am -0500,
JeffleXu  wrote:

> 
> 
> On 1/13/21 12:13 AM, Mike Snitzer wrote:
> > On Tue, Jan 12 2021 at 12:46am -0500,
> > JeffleXu  wrote:
> > 
> >>
> >>
> >> On 1/9/21 1:26 AM, Mike Snitzer wrote:
> >>> On Thu, Jan 07 2021 at 10:08pm -0500,
> >>> JeffleXu  wrote:
> >>>
> >>>> Thanks for reviewing.
> >>>>
> >>>>
> >>>> On 1/8/21 6:18 AM, Mike Snitzer wrote:
> >>>>> On Wed, Dec 23 2020 at  6:26am -0500,
> >>>>> Jeffle Xu  wrote:
> >>>>>
> >>>>>> This is actuaaly the core when supporting iopoll for bio-based device.
> >>>>>>
> >>>>>> A list is maintained in the top bio (the original bio submitted to dm
> >>>>>> device), which is used to maintain all valid cookies of split bios. The
> >>>>>> IO polling routine will actually iterate this list and poll on
> >>>>>> corresponding hardware queues of the underlying mq devices.
> >>>>>>
> >>>>>> Signed-off-by: Jeffle Xu 
> >>>>>
> >>>>> Like I said in response to patch 4 in this series: please fold patch 4
> >>>>> into this patch and _really_ improve this patch header.
> >>>>>
> >>>>> In particular, the (ab)use of bio_inc_remaining() needs be documented in
> >>>>> this patch header very well.
> >>>>>
> >>>>> But its use could easily be why you're seeing a performance hit (coupled
> >>>>> with the extra spinlock locking and list management used).  Just added
> >>>>> latency and contention across CPUs.
> >>>>
> >>>> Indeed bio_inc_remaining() is abused here and the code seems quite hacky
> >>>> here.
> >>>>
> >>>> Actually I'm regarding implementing the split bio tracking mechanism in
> >>>> a recursive way you had ever suggested. That is, the split bios could be
> >>>> maintained in an array, which is allocated with 'struct dm_io'. This way
> >>>> the overhead of spinlock protecting the >bi_plist may be omitted
> >>>> here. Also the lifetime management may be simplified somehow. But the
> >>>> block core needs to fetch the per-bio private data now, just like what
> >>>> you had ever suggested before.
> >>>>
> >>>> How do you think, Mike?
> >>>
> >>> Yes, using per-bio-data is a requirement (we cannot bloat 'struct bio').
> >>
> >> Agreed. Then MD will need some refactor to support IO polling, if
> >> possible, since just like I mentioned in patch 0 before, MD doesn't
> >> allocate extra clone bio, and just re-uses the original bio structure.
> >>
> >>
> >>>
> >>> As for using an array, how would you index the array?  
> >>
> >> The 'array' here is not an array of 'struct blk_mq_hw_ctx *' maintained
> >> in struct dm_table as you mentioned. Actually what I mean is to maintain
> >> an array of struct dm_poll_data (or something like that, e.g. just
> >> struct blk_mq_hw_ctx *) in per-bio private data. The size of the array
> >> just equals the number of the target devices.
> >>
> >> For example, for the following device stack,
> >>
> >>>>
> >>>> Suppose we have the following device stack hierarchy, that is, dm0 is
> >>>> stacked on dm1, while dm1 is stacked on nvme0 and nvme1.
> >>>>
> >>>> dm0
> >>>> dm1
> >>>> nvme0  nvme1
> >>>>
> >>>>
> >>>> Then the bio graph is like:
> >>>>
> >>>>
> >>>>++
> >>>>|bio0(to dm0)|
> >>>>++
> >>>>  ^
> >>>>  | orig_bio
> >>>>++
> >>>>|struct dm_io A  |
> >>>> ++ bi_private  --
> >>>> |bio3(to dm1)|>|bio1(to dm1)|
> >>>> ++ ++
&g

Re: [dm-devel] dm-integrity: fix the maximum number of arguments

2021-01-12 Thread Mike Snitzer

On Tue, Jan 12 2021 at  2:54pm -0500,
Mikulas Patocka  wrote:

> Advance the maximum number of arguments to 15.
> 
> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org# v4.19+
> 
> ---
>  drivers/md/dm-integrity.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6/drivers/md/dm-integrity.c
> ===
> --- linux-2.6.orig/drivers/md/dm-integrity.c  2021-01-12 20:45:23.0 
> +0100
> +++ linux-2.6/drivers/md/dm-integrity.c   2021-01-12 20:46:15.0 
> +0100
> @@ -3792,7 +3792,7 @@ static int dm_integrity_ctr(struct dm_ta
>   unsigned extra_args;
>   struct dm_arg_set as;
>   static const struct dm_arg _args[] = {
> - {0, 9, "Invalid number of feature args"},
> + {0, 15, "Invalid number of feature args"},
>   };
>   unsigned journal_sectors, interleave_sectors, buffer_sectors, 
> journal_watermark, sync_msec;
>   bool should_write_sb;

Can you please expand on which args weren't accounted for?
Which commit introduced the problem? (I'd like a "Fixes:" reference)

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH RFC 6/7] block: track cookies of split bios for bio-based device

2021-01-12 Thread Mike Snitzer

On Tue, Jan 12 2021 at 12:46am -0500,
JeffleXu  wrote:

> 
> 
> On 1/9/21 1:26 AM, Mike Snitzer wrote:
> > On Thu, Jan 07 2021 at 10:08pm -0500,
> > JeffleXu  wrote:
> > 
> >> Thanks for reviewing.
> >>
> >>
> >> On 1/8/21 6:18 AM, Mike Snitzer wrote:
> >>> On Wed, Dec 23 2020 at  6:26am -0500,
> >>> Jeffle Xu  wrote:
> >>>
> >>>> This is actuaaly the core when supporting iopoll for bio-based device.
> >>>>
> >>>> A list is maintained in the top bio (the original bio submitted to dm
> >>>> device), which is used to maintain all valid cookies of split bios. The
> >>>> IO polling routine will actually iterate this list and poll on
> >>>> corresponding hardware queues of the underlying mq devices.
> >>>>
> >>>> Signed-off-by: Jeffle Xu 
> >>>
> >>> Like I said in response to patch 4 in this series: please fold patch 4
> >>> into this patch and _really_ improve this patch header.
> >>>
> >>> In particular, the (ab)use of bio_inc_remaining() needs be documented in
> >>> this patch header very well.
> >>>
> >>> But its use could easily be why you're seeing a performance hit (coupled
> >>> with the extra spinlock locking and list management used).  Just added
> >>> latency and contention across CPUs.
> >>
> >> Indeed bio_inc_remaining() is abused here and the code seems quite hacky
> >> here.
> >>
> >> Actually I'm regarding implementing the split bio tracking mechanism in
> >> a recursive way you had ever suggested. That is, the split bios could be
> >> maintained in an array, which is allocated with 'struct dm_io'. This way
> >> the overhead of spinlock protecting the >bi_plist may be omitted
> >> here. Also the lifetime management may be simplified somehow. But the
> >> block core needs to fetch the per-bio private data now, just like what
> >> you had ever suggested before.
> >>
> >> How do you think, Mike?
> > 
> > Yes, using per-bio-data is a requirement (we cannot bloat 'struct bio').
> 
> Agreed. Then MD will need some refactor to support IO polling, if
> possible, since just like I mentioned in patch 0 before, MD doesn't
> allocate extra clone bio, and just re-uses the original bio structure.
> 
> 
> > 
> > As for using an array, how would you index the array?  
> 
> The 'array' here is not an array of 'struct blk_mq_hw_ctx *' maintained
> in struct dm_table as you mentioned. Actually what I mean is to maintain
> an array of struct dm_poll_data (or something like that, e.g. just
> struct blk_mq_hw_ctx *) in per-bio private data. The size of the array
> just equals the number of the target devices.
> 
> For example, for the following device stack,
> 
> >>
> >> Suppose we have the following device stack hierarchy, that is, dm0 is
> >> stacked on dm1, while dm1 is stacked on nvme0 and nvme1.
> >>
> >> dm0
> >> dm1
> >> nvme0  nvme1
> >>
> >>
> >> Then the bio graph is like:
> >>
> >>
> >>++
> >>|bio0(to dm0)|
> >>++
> >>  ^
> >>  | orig_bio
> >>++
> >>|struct dm_io A  |
> >> ++ bi_private  --
> >> |bio3(to dm1)|>|bio1(to dm1)|
> >> ++ ++
> >> ^^
> >> | ->orig_bio | ->orig_bio
> >> ++ ++
> >> |struct dm_io| |struct dm_io B  |
> >> -- --
> >> |bio2(to nvme0)  | |bio4(to nvme1)  |
> >> ++ ++
> >>
> 
> An array of struct blk_mq_hw_ctx * is maintained in struct dm_io B.
> 
> 
> struct blk_mq_hw_ctx * hctxs[2];
> 
> The array size is two since dm1 maps to two target devices (i.e. nvme0
> and nvme1). Then hctxs[0] points to the hw queue of nvme0, while
> hctxs[1] points to the hw queue of nvme1.

Both

Re: [dm-devel] dm-integrity: Fix flush with external metadata device

2021-01-08 Thread Mike Snitzer

On Fri, Jan 08 2021 at 11:12am -0500,
Mikulas Patocka  wrote:

> 
> 
> On Mon, 4 Jan 2021, Mike Snitzer wrote:
> 
> > On Sun, Dec 20 2020 at  8:02am -0500,
> > Lukas Straub  wrote:
> > 
> > > With an external metadata device, flush requests aren't passed down
> > > to the data device.
> > > 
> > > Fix this by issuing flush in the right places: In integrity_commit
> > > when not in journal mode, in do_journal_write after writing the
> > > contents of the journal to the disk and in dm_integrity_postsuspend.
> > > 
> > > Signed-off-by: Lukas Straub 
> > > ---
> > >  drivers/md/dm-integrity.c | 8 
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
> > > index 5a7a1b90e671..a26ed65869f6 100644
> > > --- a/drivers/md/dm-integrity.c
> > > +++ b/drivers/md/dm-integrity.c
> > > @@ -2196,6 +2196,8 @@ static void integrity_commit(struct work_struct *w)
> > >   if (unlikely(ic->mode != 'J')) {
> > >   spin_unlock_irq(>endio_wait.lock);
> > >   dm_integrity_flush_buffers(ic);
> > > + if (ic->meta_dev)
> > > + blkdev_issue_flush(ic->dev->bdev, GFP_NOIO);
> > >   goto release_flush_bios;
> > >   }
> > >  
> > > @@ -2410,6 +2412,9 @@ static void do_journal_write(struct dm_integrity_c 
> > > *ic, unsigned write_start,
> > >   wait_for_completion_io();
> > >  
> > >   dm_integrity_flush_buffers(ic);
> > > + if (ic->meta_dev)
> > > + blkdev_issue_flush(ic->dev->bdev, GFP_NOIO);
> > > +
> > >  }
> > >  
> > >  static void integrity_writer(struct work_struct *w)
> > > @@ -2949,6 +2954,9 @@ static void dm_integrity_postsuspend(struct 
> > > dm_target *ti)
> > >  #endif
> > >   }
> > >  
> > > + if (ic->meta_dev)
> > > + blkdev_issue_flush(ic->dev->bdev, GFP_NOIO);
> > > +
> > >   BUG_ON(!RB_EMPTY_ROOT(>in_progress));
> > >  
> > >   ic->journal_uptodate = true;
> > > -- 
> > > 2.20.1
> > 
> > 
> > Seems like a pretty bad oversight... but shouldn't you also make sure to
> > flush the data device _before_ the metadata is flushed?
> > 
> > Mike
> 
> I think, ordering is not a problem.
> 
> A disk may flush its cache spontaneously anytime, so it doesn't matter in 
> which order do we flush them. Similarly a dm-bufio buffer may be flushed 
> anytime - if the machine is running out of memory and a dm-bufio shrinker 
> is called.
> 
> I'll send another patch for this - I've created a patch that flushes the 
> metadata device cache and data device cache in parallel, so that 
> performance degradation is reduced.
> 
> My patch also doesn't use GFP_NOIO allocation - which can in theory 
> deadlock if we are swapping on dm-integrity device.

OK, I see your patch, but my concern about ordering was more to do with
crash consistency.  What if metadata is updated but data isn't?

On the surface, your approach of issuing the flushes in parallel seems
to expose us to inconsistency upon recovery from a crash.
If that isn't the case please explain why not.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH RFC 3/7] block: add iopoll method for non-mq device

2021-01-08 Thread Mike Snitzer

On Thu, Jan 07 2021 at 10:24pm -0500,
JeffleXu  wrote:

> 
> 
> On 1/8/21 5:47 AM, Mike Snitzer wrote:
> > On Wed, Dec 23 2020 at  6:26am -0500,
> > Jeffle Xu  wrote:
> > 
> >> ->poll_fn is introduced in commit ea435e1b9392 ("block: add a poll_fn
> >> callback to struct request_queue") for supporting non-mq queues such as
> >> nvme multipath, but removed in commit 529262d56dbe ("block: remove
> >> ->poll_fn").
> >>
> >> To add support of IO polling for non-mq device, this method need to be
> >> back. Since commit c62b37d96b6e ("block: move ->make_request_fn to
> >> struct block_device_operations") has moved all callbacks into struct
> >> block_device_operations in gendisk, we also add the new method named
> >> ->iopoll in block_device_operations.
> > 
> > Please update patch subject and header to:
> > 
> > block: add iopoll method to support bio-based IO polling
> > 
> > ->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
> > callback to struct request_queue") to support bio-based queues such as
> > nvme multipath, but was later removed in commit 529262d56dbe ("block:
> > remove ->poll_fn").
> > 
> > Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
> > block_device_operations") restore the possibility of bio-based IO
> > polling support by adding an ->iopoll method to gendisk->fops.
> > Elevate bulk of blk_mq_poll() implementation to blk_poll() and reduce
> > blk_mq_poll() to blk-mq specific code that is called from blk_poll().
> > 
> >> Signed-off-by: Jeffle Xu 
> >> ---
> >>  block/blk-core.c   | 79 ++
> >>  block/blk-mq.c | 70 +
> >>  include/linux/blk-mq.h |  3 ++
> >>  include/linux/blkdev.h |  1 +
> >>  4 files changed, 92 insertions(+), 61 deletions(-)
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 96e5fcd7f071..2f5c51ce32e3 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -1131,6 +1131,85 @@ blk_qc_t submit_bio(struct bio *bio)
> >>  }
> >>  EXPORT_SYMBOL(submit_bio);
> >>  
> >> +static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
> >> +{
> >> +  struct blk_mq_hw_ctx *hctx;
> >> +
> >> +  /* TODO: bio-based device doesn't support hybrid poll. */
> >> +  if (!queue_is_mq(q))
> >> +  return false;
> >> +
> >> +  hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> >> +  if (blk_mq_poll_hybrid(q, hctx, cookie))
> >> +  return true;
> >> +
> >> +  hctx->poll_considered++;
> >> +  return false;
> >> +}
> > 
> > I don't see where you ever backfill bio-based hybrid support (in
> > the following patches in this series, so it is lingering TODO).
> 
> Yes the hybrid polling is not implemented and thus bypassed for
> bio-based device currently.
> 
> > 
> >> +
> >> +/**
> >> + * blk_poll - poll for IO completions
> >> + * @q:  the queue
> >> + * @cookie: cookie passed back at IO submission time
> >> + * @spin: whether to spin for completions
> >> + *
> >> + * Description:
> >> + *Poll for completions on the passed in queue. Returns number of
> >> + *completed entries found. If @spin is true, then blk_poll will 
> >> continue
> >> + *looping until at least one completion is found, unless the task is
> >> + *otherwise marked running (or we need to reschedule).
> >> + */
> >> +int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> >> +{
> >> +  long state;
> >> +
> >> +  if (!blk_qc_t_valid(cookie) ||
> >> +  !test_bit(QUEUE_FLAG_POLL, >queue_flags))
> >> +  return 0;
> >> +
> >> +  if (current->plug)
> >> +  blk_flush_plug_list(current->plug, false);
> >> +
> >> +  /*
> >> +   * If we sleep, have the caller restart the poll loop to reset
> >> +   * the state. Like for the other success return cases, the
> >> +   * caller is responsible for checking if the IO completed. If
> >> +   * the IO isn't complete, we'll get called again and will go
> >> +   * straight to the busy poll loop. If specified not to spin,
> >> +   * we also should n

Re: [dm-devel] [PATCH RFC 6/7] block: track cookies of split bios for bio-based device

2021-01-08 Thread Mike Snitzer

On Thu, Jan 07 2021 at 10:08pm -0500,
JeffleXu  wrote:

> Thanks for reviewing.
> 
> 
> On 1/8/21 6:18 AM, Mike Snitzer wrote:
> > On Wed, Dec 23 2020 at  6:26am -0500,
> > Jeffle Xu  wrote:
> > 
> >> This is actuaaly the core when supporting iopoll for bio-based device.
> >>
> >> A list is maintained in the top bio (the original bio submitted to dm
> >> device), which is used to maintain all valid cookies of split bios. The
> >> IO polling routine will actually iterate this list and poll on
> >> corresponding hardware queues of the underlying mq devices.
> >>
> >> Signed-off-by: Jeffle Xu 
> > 
> > Like I said in response to patch 4 in this series: please fold patch 4
> > into this patch and _really_ improve this patch header.
> > 
> > In particular, the (ab)use of bio_inc_remaining() needs be documented in
> > this patch header very well.
> > 
> > But its use could easily be why you're seeing a performance hit (coupled
> > with the extra spinlock locking and list management used).  Just added
> > latency and contention across CPUs.
> 
> Indeed bio_inc_remaining() is abused here and the code seems quite hacky
> here.
> 
> Actually I'm regarding implementing the split bio tracking mechanism in
> a recursive way you had ever suggested. That is, the split bios could be
> maintained in an array, which is allocated with 'struct dm_io'. This way
> the overhead of spinlock protecting the >bi_plist may be omitted
> here. Also the lifetime management may be simplified somehow. But the
> block core needs to fetch the per-bio private data now, just like what
> you had ever suggested before.
> 
> How do you think, Mike?

Yes, using per-bio-data is a requirement (we cannot bloat 'struct bio').

As for using an array, how would you index the array?  blk-mq is able to
use an array (with cookie to hctx index translation) because there are a
finite number of fixed hctx for the life of the device.  But with
stacked bio-based DM devices, each DM table associated with a DM device
can change via table reload.  Any reloads should flush outstanding IO,
but there are cases where no flushing occurs (e.g. dm-multipath when no
paths are available, _but_ in that instance, there wouldn't be any
mapping that results in a blk-mq hctx endpoint).

All the DM edge cases aside, you need to ensure that the lifetime of the
per-bio-data that holds the 'struct node' (that you correctly detailed
needing below) doesn't somehow get used _after_ the hctx and/or cookie
are no longer valid.  So to start we'll need some BUG_ON() to validate
the lifetime is correct.

> Besides the lifetime management is quite annoying to me. As long as the
> tracking object representing a valid split bio) is dynamically
> allocated, no matter it's embedded directly in 'struct bio' (in this
> patch), or allocated with 'struct dm_io', the lifetime management of the
> tracking object comes in. Here the so called tracking object is
> something like
> 
> struct node {
> struct blk_mq_hw_ctx *hctx;
> blk_qc_t cookie;
> };

Needs a better name, think I had 'struct dm_poll_data'

> Actually currently the tracking objects are all allocated with 'struct
> bio', then the lifetime management of the tracking objects is actually
> equivalent to lifetime management of bio. Since the returned cookie is
> actually a pointer to the bio, the refcount of this bio must be
> incremented, since we release a reference to this bio through the
> returned cookie, in which case the abuse of the refcount trick seems
> unavoidable? Unless we allocate the tracking object individually, then
> the returned cookie is actually pointing to the tracking object, and the
> refcount is individually maintained for the tracking object.

The refcounting and lifetime of the per-bio-data should all work as is.
Would hope you can avoid extra bio_inc_remaining().. that infratsructure
is way too tightly coupled to bio_chain()'ing, etc.

The challenge you have is the array that would point at these various
per-bio-data needs to be rooted somewhere (you put it in the topmost
original bio with the current patchset).  But why not manage that as
part of 'struct mapped_device'?  It'd need proper management at DM table
reload boundaries and such but it seems like the most logical place to
put the array.  But again, this array needs to be dynamic.. so thinking
further, maybe a better model would be to have a fixed array in 'struct
dm_table' for each hctx associated with a blk_mq _data_ device directly
used/managed by that dm_table?

And ideally, access to these arrays should be as lockless as possible
(rcu, or whatever) so that scaling to many cpus isn't a problem.

> >> ---
> >>  block/bio.c   |  8 
> >>  blo

Re: [dm-devel] [PATCH RFC 6/7] block: track cookies of split bios for bio-based device

2021-01-07 Thread Mike Snitzer

On Wed, Dec 23 2020 at  6:26am -0500,
Jeffle Xu  wrote:

> This is actuaaly the core when supporting iopoll for bio-based device.
> 
> A list is maintained in the top bio (the original bio submitted to dm
> device), which is used to maintain all valid cookies of split bios. The
> IO polling routine will actually iterate this list and poll on
> corresponding hardware queues of the underlying mq devices.
> 
> Signed-off-by: Jeffle Xu 

Like I said in response to patch 4 in this series: please fold patch 4
into this patch and _really_ improve this patch header.

In particular, the (ab)use of bio_inc_remaining() needs be documented in
this patch header very well.

But its use could easily be why you're seeing a performance hit (coupled
with the extra spinlock locking and list management used).  Just added
latency and contention across CPUs.

> ---
>  block/bio.c   |  8 
>  block/blk-core.c  | 84 ++-
>  include/linux/blk_types.h | 39 ++
>  3 files changed, 129 insertions(+), 2 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 1f2cc1fbe283..ca6d1a7ee196 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -284,6 +284,10 @@ void bio_init(struct bio *bio, struct bio_vec *table,
>  
>   bio->bi_io_vec = table;
>   bio->bi_max_vecs = max_vecs;
> +
> + INIT_LIST_HEAD(>bi_plist);
> + INIT_LIST_HEAD(>bi_pnode);
> + spin_lock_init(>bi_plock);
>  }
>  EXPORT_SYMBOL(bio_init);
>  
> @@ -689,6 +693,7 @@ void __bio_clone_fast(struct bio *bio, struct bio 
> *bio_src)
>   bio->bi_write_hint = bio_src->bi_write_hint;
>   bio->bi_iter = bio_src->bi_iter;
>   bio->bi_io_vec = bio_src->bi_io_vec;
> + bio->bi_root = bio_src->bi_root;
>  
>   bio_clone_blkg_association(bio, bio_src);
>   blkcg_bio_issue_init(bio);
> @@ -1425,6 +1430,8 @@ void bio_endio(struct bio *bio)
>   if (bio->bi_disk)
>   rq_qos_done_bio(bio->bi_disk->queue, bio);
>  
> + bio_del_poll_list(bio);
> +
>   /*
>* Need to have a real endio function for chained bios, otherwise
>* various corner cases will break (like stacking block devices that
> @@ -1446,6 +1453,7 @@ void bio_endio(struct bio *bio)
>   blk_throtl_bio_endio(bio);
>   /* release cgroup info */
>   bio_uninit(bio);
> +
>   if (bio->bi_end_io)
>   bio->bi_end_io(bio);
>  }
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2f5c51ce32e3..5a332af01939 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -960,12 +960,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  {
>   struct bio_list bio_list_on_stack[2];
>   blk_qc_t ret = BLK_QC_T_NONE;
> + bool iopoll;
> + struct bio *root;
>  
>   BUG_ON(bio->bi_next);
>  
>   bio_list_init(_list_on_stack[0]);
>   current->bio_list = bio_list_on_stack;
>  
> + iopoll = test_bit(QUEUE_FLAG_POLL, >bi_disk->queue->queue_flags);
> + iopoll = iopoll && (bio->bi_opf & REQ_HIPRI);
> +
> + if (iopoll) {
> + bio->bi_root = root = bio;
> + /*
> +  * We need to pin root bio here since there's a reference from
> +  * the returned cookie. bio_get() is not enough since the whole
> +  * bio and the corresponding kiocb/dio may have already
> +  * completed and thus won't call blk_poll() at all, in which
> +  * case the pairing bio_put() in blk_bio_poll() won't be called.
> +  * The side effect of bio_inc_remaining() is that, the whole bio
> +  * won't complete until blk_poll() called.
> +  */
> + bio_inc_remaining(root);
> + }
> +
>   do {
>   struct request_queue *q = bio->bi_disk->queue;
>   struct bio_list lower, same;
> @@ -979,7 +998,18 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>   bio_list_on_stack[1] = bio_list_on_stack[0];
>   bio_list_init(_list_on_stack[0]);
>  
> - ret = __submit_bio(bio);
> + if (iopoll) {
> + /* See the comments of above bio_inc_remaining(). */
> + bio_inc_remaining(bio);
> + bio->bi_cookie = __submit_bio(bio);
> +
> + if (blk_qc_t_valid(bio->bi_cookie))
> + bio_add_poll_list(bio);
> +
> + bio_endio(bio);
> + } else {
> + ret = __submit_bio(bio);
> + }
>  
>   /*
>* Sort new bios into those for a lower level and those for the
> @@ -1002,7 +1032,11 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>   } while ((bio = bio_list_pop(_list_on_stack[0])));
>  
>   current->bio_list = NULL;
> - return ret;
> +
> + if (iopoll)
> + return (blk_qc_t)root;
> +
> + return BLK_QC_T_NONE;
>  }
>  
>  static blk_qc_t

Re: [dm-devel] [PATCH RFC 5/7] dm: always return BLK_QC_T_NONE for bio-based device

2021-01-07 Thread Mike Snitzer

On Wed, Dec 23 2020 at  6:26am -0500,
Jeffle Xu  wrote:

> Currently the returned cookie of bio-based device is not used at all.
> 
> In the following patches, bio-based device will actually return a
> pointer to a specific object as the returned cookie.

In the following patch, ...

> Signed-off-by: Jeffle Xu 

Reviewed-by: Mike Snitzer 

> ---
>  drivers/md/dm.c | 26 ++
>  1 file changed, 10 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 5b2f371ec4bb..03c2b867acaa 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1252,14 +1252,13 @@ void dm_accept_partial_bio(struct bio *bio, unsigned 
> n_sectors)
>  }
>  EXPORT_SYMBOL_GPL(dm_accept_partial_bio);
>  
> -static blk_qc_t __map_bio(struct dm_target_io *tio)
> +static void __map_bio(struct dm_target_io *tio)
>  {
>   int r;
>   sector_t sector;
>   struct bio *clone = >clone;
>   struct dm_io *io = tio->io;
>   struct dm_target *ti = tio->ti;
> - blk_qc_t ret = BLK_QC_T_NONE;
>  
>   clone->bi_end_io = clone_endio;
>  
> @@ -1278,7 +1277,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
>   case DM_MAPIO_REMAPPED:
>   /* the bio has been remapped so dispatch it */
>   trace_block_bio_remap(clone, bio_dev(io->orig_bio), sector);
> - ret = submit_bio_noacct(clone);
> + submit_bio_noacct(clone);
>   break;
>   case DM_MAPIO_KILL:
>   free_tio(tio);
> @@ -1292,8 +1291,6 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
>   DMWARN("unimplemented target map return value: %d", r);
>   BUG();
>   }
> -
> - return ret;
>  }
>  
>  static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len)
> @@ -1380,7 +1377,7 @@ static void alloc_multiple_bios(struct bio_list *blist, 
> struct clone_info *ci,
>   }
>  }
>  
> -static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
> +static void __clone_and_map_simple_bio(struct clone_info *ci,
>  struct dm_target_io *tio, unsigned 
> *len)
>  {
>   struct bio *clone = >clone;
> @@ -1391,7 +1388,7 @@ static blk_qc_t __clone_and_map_simple_bio(struct 
> clone_info *ci,
>   if (len)
>   bio_setup_sector(clone, ci->sector, *len);
>  
> - return __map_bio(tio);
> + __map_bio(tio);
>  }
>  
>  static void __send_duplicate_bios(struct clone_info *ci, struct dm_target 
> *ti,
> @@ -1405,7 +1402,7 @@ static void __send_duplicate_bios(struct clone_info 
> *ci, struct dm_target *ti,
>  
>   while ((bio = bio_list_pop())) {
>   tio = container_of(bio, struct dm_target_io, clone);
> - (void) __clone_and_map_simple_bio(ci, tio, len);
> + __clone_and_map_simple_bio(ci, tio, len);
>   }
>  }
>  
> @@ -1450,7 +1447,7 @@ static int __clone_and_map_data_bio(struct clone_info 
> *ci, struct dm_target *ti,
>   free_tio(tio);
>   return r;
>   }
> - (void) __map_bio(tio);
> + __map_bio(tio);
>  
>   return 0;
>  }
> @@ -1565,11 +1562,10 @@ static void init_clone_info(struct clone_info *ci, 
> struct mapped_device *md,
>  /*
>   * Entry point to split a bio into clones and submit them to the targets.
>   */
> -static blk_qc_t __split_and_process_bio(struct mapped_device *md,
> +static void __split_and_process_bio(struct mapped_device *md,
>   struct dm_table *map, struct bio *bio)
>  {
>   struct clone_info ci;
> - blk_qc_t ret = BLK_QC_T_NONE;
>   int error = 0;
>  
>   init_clone_info(, md, map, bio);
> @@ -1613,7 +1609,7 @@ static blk_qc_t __split_and_process_bio(struct 
> mapped_device *md,
>  
>   bio_chain(b, bio);
>   trace_block_split(b, bio->bi_iter.bi_sector);
> - ret = submit_bio_noacct(bio);
> + submit_bio_noacct(bio);
>   break;
>   }
>   }
> @@ -1621,13 +1617,11 @@ static blk_qc_t __split_and_process_bio(struct 
> mapped_device *md,
>  
>   /* drop the extra reference count */
>   dec_pending(ci.io, errno_to_blk_status(error));
> - return ret;
>  }
>  
>  static blk_qc_t dm_submit_bio(struct bio *bio)
>  {
>   struct mapped_device *md = bio->bi_disk->private_data;
> - blk_qc_t ret = BLK_QC_T_NONE;
>   int srcu_idx;
>   struct dm_table *map;
>  
> @@ -16

Re: [dm-devel] [PATCH RFC 4/7] block: define blk_qc_t as uintptr_t

2021-01-07 Thread Mike Snitzer

On Wed, Dec 23 2020 at  6:26am -0500,
Jeffle Xu  wrote:

> To support iopoll for bio-based device, the returned cookie is actually
> a pointer to an implementation specific object, e.g. an object
> maintaining all split bios.
> 
> In such case, blk_qc_t should be large enough to contain one pointer,
> for which uintptr_t is used here. Since uintptr_t is actually an integer
> type in essence, there's no need of type casting in the original mq
> path, while type casting is indeed needed in bio-based routine.
> 
> Signed-off-by: Jeffle Xu 
> ---
>  include/linux/types.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/types.h b/include/linux/types.h
> index da5ca7e1bea9..f6301014a459 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -126,7 +126,7 @@ typedef u64 sector_t;
>  typedef u64 blkcnt_t;
>  
>  /* cookie used for IO polling */
> -typedef unsigned int blk_qc_t;
> +typedef uintptr_t blk_qc_t;
>  
>  /*
>   * The type of an index into the pagecache.

I'd just fold this into patch 6.. not seeing benefit to having this be
separate.

Patch 6's header needs a lot more detail anyway so..

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH RFC 3/7] block: add iopoll method for non-mq device

2021-01-07 Thread Mike Snitzer

On Wed, Dec 23 2020 at  6:26am -0500,
Jeffle Xu  wrote:

> ->poll_fn is introduced in commit ea435e1b9392 ("block: add a poll_fn
> callback to struct request_queue") for supporting non-mq queues such as
> nvme multipath, but removed in commit 529262d56dbe ("block: remove
> ->poll_fn").
> 
> To add support of IO polling for non-mq device, this method need to be
> back. Since commit c62b37d96b6e ("block: move ->make_request_fn to
> struct block_device_operations") has moved all callbacks into struct
> block_device_operations in gendisk, we also add the new method named
> ->iopoll in block_device_operations.

Please update patch subject and header to:

block: add iopoll method to support bio-based IO polling

->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
callback to struct request_queue") to support bio-based queues such as
nvme multipath, but was later removed in commit 529262d56dbe ("block:
remove ->poll_fn").

Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
block_device_operations") restore the possibility of bio-based IO
polling support by adding an ->iopoll method to gendisk->fops.
Elevate bulk of blk_mq_poll() implementation to blk_poll() and reduce
blk_mq_poll() to blk-mq specific code that is called from blk_poll().

> Signed-off-by: Jeffle Xu 
> ---
>  block/blk-core.c   | 79 ++
>  block/blk-mq.c | 70 +
>  include/linux/blk-mq.h |  3 ++
>  include/linux/blkdev.h |  1 +
>  4 files changed, 92 insertions(+), 61 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 96e5fcd7f071..2f5c51ce32e3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1131,6 +1131,85 @@ blk_qc_t submit_bio(struct bio *bio)
>  }
>  EXPORT_SYMBOL(submit_bio);
>  
> +static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
> +{
> + struct blk_mq_hw_ctx *hctx;
> +
> + /* TODO: bio-based device doesn't support hybrid poll. */
> + if (!queue_is_mq(q))
> + return false;
> +
> + hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> + if (blk_mq_poll_hybrid(q, hctx, cookie))
> + return true;
> +
> + hctx->poll_considered++;
> + return false;
> +}

I don't see where you ever backfill bio-based hybrid support (in
the following patches in this series, so it is lingering TODO).

> +
> +/**
> + * blk_poll - poll for IO completions
> + * @q:  the queue
> + * @cookie: cookie passed back at IO submission time
> + * @spin: whether to spin for completions
> + *
> + * Description:
> + *Poll for completions on the passed in queue. Returns number of
> + *completed entries found. If @spin is true, then blk_poll will continue
> + *looping until at least one completion is found, unless the task is
> + *otherwise marked running (or we need to reschedule).
> + */
> +int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +{
> + long state;
> +
> + if (!blk_qc_t_valid(cookie) ||
> + !test_bit(QUEUE_FLAG_POLL, >queue_flags))
> + return 0;
> +
> + if (current->plug)
> + blk_flush_plug_list(current->plug, false);
> +
> + /*
> +  * If we sleep, have the caller restart the poll loop to reset
> +  * the state. Like for the other success return cases, the
> +  * caller is responsible for checking if the IO completed. If
> +  * the IO isn't complete, we'll get called again and will go
> +  * straight to the busy poll loop. If specified not to spin,
> +  * we also should not sleep.
> +  */
> + if (spin && blk_poll_hybrid(q, cookie))
> + return 1;
> +
> + state = current->state;
> + do {
> + int ret;
> + struct gendisk *disk = queue_to_disk(q);
> +
> + if (disk->fops->iopoll)
> + ret = disk->fops->iopoll(q, cookie);
> + else
> + ret = blk_mq_poll(q, cookie);

Really don't like that blk-mq is needlessly getting gendisk and checking
disk->fops->iopoll.

This is just to give an idea, whitespace damaged due to coding in mail
client, but why not remove above blk_poll_hybrid() and do:

struct blk_mq_hw_ctx *hctx = NULL;
struct gendisk *disk = NULL;
...

if (queue_is_mq(q)) {
/*
 * If we sleep, have the caller restart the poll loop to reset
 * the state. Like for the other success return cases, the
 * caller is responsible for checking if the IO completed. If
 * the IO isn't complete, we'll get called again and will go
 * straight to the busy poll loop. If specified not to spin,
 * we also should not sleep.
 */
hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
if (spin && blk_mq_poll_hybrid(q, hctx, cookie))
return 1;
hctx->poll_considered++;   
} else {
disk = queue_to_disk(q);
}

do {
int ret;

if

Re: [dm-devel] [PATCH RFC 2/7] block: add helper function fetching gendisk from queue

2021-01-07 Thread Mike Snitzer

On Wed, Dec 23 2020 at  6:26am -0500,
Jeffle Xu  wrote:

> Sometimes we need to get the corresponding gendisk from request_queue.
> 
> One such use case is that, the block device driver had ever stored the
> same private data both in queue->queuedata and gendisk->private_data,
> while nowadays gendisk->private_data is more preferable in such case,
> e.g. commit c4a59c4e5db3 ("dm: stop using ->queuedata"). So if only
> request_queue given, we need to get the corresponding gendisk from
> queue, to get the private data stored in gendisk.
> 
> Signed-off-by: Jeffle Xu 
> ---
>  include/linux/blkdev.h   | 2 ++
>  include/trace/events/kyber.h | 6 +++---
>  2 files changed, 5 insertions(+), 3 deletions(-)

Looks good, but please update the patch subject and header to be:

block: add queue_to_disk() to get gendisk from request_queue

Sometimes we need to get the corresponding gendisk from request_queue.

It is preferred that block drivers store private data in
gendisk->private_data rather than request_queue->queuedata, e.g. see:
commit c4a59c4e5db3 ("dm: stop using ->queuedata").

So if only request_queue is given, we need to get its corresponding
gendisk to get the private data stored in that gendisk.

Signed-off-by: Jeffle Xu 
Review-by: Mike Snitzer 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH RFC 1/7] block: move definition of blk_qc_t to types.h

2021-01-07 Thread Mike Snitzer

On Wed, Dec 23 2020 at  6:26am -0500,
Jeffle Xu  wrote:

> So that kiocb.ki_cookie can be defined as blk_qc_t, which will enforce
> the encapsulation.
> 
> Signed-off-by: Jeffle Xu 
> ---
>  include/linux/blk_types.h | 2 +-
>  include/linux/fs.h| 2 +-
>  include/linux/types.h | 3 +++
>  3 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 866f74261b3b..2e05244fc16d 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -532,7 +532,7 @@ static inline int op_stat_group(unsigned int op)
>   return op_is_write(op);
>  }
>  
> -typedef unsigned int blk_qc_t;
> +/* Macros for blk_qc_t */
>  #define BLK_QC_T_NONE-1U
>  #define BLK_QC_T_SHIFT   16
>  #define BLK_QC_T_INTERNAL(1U << 31)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ad4cf1bae586..58db714c4834 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -330,7 +330,7 @@ struct kiocb {
>   u16 ki_hint;
>   u16 ki_ioprio; /* See linux/ioprio.h */
>   union {
> - unsigned intki_cookie; /* for ->iopoll */
> + blk_qc_tki_cookie; /* for ->iopoll */
>   struct wait_page_queue  *ki_waitq; /* for async buffered IO */
>   };
>  
> diff --git a/include/linux/types.h b/include/linux/types.h
> index a147977602b5..da5ca7e1bea9 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -125,6 +125,9 @@ typedef s64   int64_t;
>  typedef u64 sector_t;
>  typedef u64 blkcnt_t;
>  
> +/* cookie used for IO polling */
> +typedef unsigned int blk_qc_t;
> +
>  /*
>   * The type of an index into the pagecache.
>   */
> -- 
> 2.27.0
> 

Unfortunate that you cannot just include blk_types.h in fs.h; but
vma_is_dax() ruins that for us since commit baabda2614245 ("mm: always
enable thp for dax mappings").

Reviewed-by: Mike Snitzer 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] dm-integrity: Fix flush with external metadata device

2021-01-04 Thread Mike Snitzer

On Sun, Dec 20 2020 at  8:02am -0500,
Lukas Straub  wrote:

> With an external metadata device, flush requests aren't passed down
> to the data device.
> 
> Fix this by issuing flush in the right places: In integrity_commit
> when not in journal mode, in do_journal_write after writing the
> contents of the journal to the disk and in dm_integrity_postsuspend.
> 
> Signed-off-by: Lukas Straub 
> ---
>  drivers/md/dm-integrity.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
> index 5a7a1b90e671..a26ed65869f6 100644
> --- a/drivers/md/dm-integrity.c
> +++ b/drivers/md/dm-integrity.c
> @@ -2196,6 +2196,8 @@ static void integrity_commit(struct work_struct *w)
>   if (unlikely(ic->mode != 'J')) {
>   spin_unlock_irq(>endio_wait.lock);
>   dm_integrity_flush_buffers(ic);
> + if (ic->meta_dev)
> + blkdev_issue_flush(ic->dev->bdev, GFP_NOIO);
>   goto release_flush_bios;
>   }
>  
> @@ -2410,6 +2412,9 @@ static void do_journal_write(struct dm_integrity_c *ic, 
> unsigned write_start,
>   wait_for_completion_io();
>  
>   dm_integrity_flush_buffers(ic);
> + if (ic->meta_dev)
> + blkdev_issue_flush(ic->dev->bdev, GFP_NOIO);
> +
>  }
>  
>  static void integrity_writer(struct work_struct *w)
> @@ -2949,6 +2954,9 @@ static void dm_integrity_postsuspend(struct dm_target 
> *ti)
>  #endif
>   }
>  
> + if (ic->meta_dev)
> + blkdev_issue_flush(ic->dev->bdev, GFP_NOIO);
> +
>   BUG_ON(!RB_EMPTY_ROOT(>in_progress));
>  
>   ic->journal_uptodate = true;
> -- 
> 2.20.1


Seems like a pretty bad oversight... but shouldn't you also make sure to
flush the data device _before_ the metadata is flushed?

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] dm snap : add sanity checks to snapshot_ctr

2021-01-04 Thread Mike Snitzer

On Fri, Dec 25 2020 at  1:48am -0500,
Defang Bo  wrote:

> Similar to commit<70de2cbd>,there should be a check for argc and argv to 
> prevent Null pointer dereferencing
> when the dm_get_device invoked twice on the same device path with differnt 
> mode.
> 
> Signed-off-by: Defang Bo 
> ---
>  drivers/md/dm-snap.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
> index 4668b2c..dccce8b 100644
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -1258,6 +1258,13 @@ static int snapshot_ctr(struct dm_target *ti, unsigned 
> int argc, char **argv)
>  
>   as.argc = argc;
>   as.argv = argv;
> +
> + if (!strcmp(argv[0], argv[1])) {
> + ti->error = "Error setting metadata or data device";
> + r = -EINVAL;
> + goto bad;
> + }
> +
>   dm_consume_args(, 4);
>   r = parse_snapshot_features(, s, ti);
>   if (r)
> -- 
> 2.7.4
> 

We already have this later in snapshot_ctr:

if (cow_dev && cow_dev == origin_dev) {
ti->error = "COW device cannot be the same as origin device";
r = -EINVAL;
goto bad_cow;
}

Which happens before the 2nd dm_get_device() for the cow device.  So
I'm not seeing how you could experience the NULL pointer you say is
possible.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] dm-raid: set discard_granularity non-zero if possible

2021-01-04 Thread Mike Snitzer

On Wed, Dec 16 2020 at  2:53pm -0500,
Stephan Bärwolf  wrote:

> Hi
> 
> I hope this address is the right place for this patch.
> It is supposed to fix the triggering of block/blklib.c:51 WARN_ON_ONCE(..) 
> when using LVM2 raid1 with SSD-PVs.
> Since commit b35fd7422c2f8e04496f5a770bd4e1a205414b3f and without this 
> patchthere are tons of printks logging "Error: discard_granularity is 0." to 
> kmsg.
> Also there is no discard/TRIM happening anymore...
> 
> This is a rough patch for WARNING-issue
> 
> "block/blk-lib.c:51 __blkdev_issue_discard+0x1f6/0x250"
> [...] "Error: discard_granularity is 0." [...]
> introduced in commit b35fd7422c2f8e04496f5a770bd4e1a205414b3f
> ("block: check queue's limits.discard_granularity in 
> __blkdev_issue_discard()")
> 
> in conjunction with LVM2 raid1 volumes on discardable (SSD) backing.
> It seems until now, LVM-raid1 reported "discard_granularity" as 0,
> as well as "max_discard_sectors" as 0. (see "lsblk --discard").
> 
> The idea here is to fix the issue by calculating "max_discard_sectors"
> as the minimum over all involved block devices. (We use the meta-data
> for this to work here.)
> For calculating the "discard_granularity" we would have to calculate the
> lcm (least common multiple) of all discard_granularities of all involved
> block devices and finally round up to next power of 2.
> 
> However, since all "discard_granularity" are powers of 2, this algorithm
> will simplify to just determining the maximum and filtering for "0"-cases.
> 
> Signed-off-by: Stephan Baerwolf 
> ---
> drivers/md/dm-raid.c | 32 ++--
> 1 file changed, 30 insertions(+), 2 deletions(-)
> 
> 
> 

> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 8d2b835d7a10..4c769fd93ced 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -3734,8 +3734,36 @@ static void raid_io_hints(struct dm_target *ti, struct 
> queue_limits *limits)
>* RAID0/4/5/6 don't and process large discard bios properly.
>*/
>   if (rs_is_raid1(rs) || rs_is_raid10(rs)) {
> - limits->discard_granularity = chunk_size_bytes;
> - limits->max_discard_sectors = rs->md.chunk_sectors;

The above should be: if (rs_is_raid0(rs) || rs_is_raid10(rs)) {

And this was previous;y fixed with commit e0910c8e4f87bb9 but later
reverted due to various late MD discard reverts at the end of the 5.10
release.

So all said, I think the the proper fix (without all sorts of
open-coding to get limits to properly stack) is to change
raid_io_hints()'s rs_is_raid1() call to rs_is_raid0().

I'll get a fix queued up.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 0/4] Fix order when split bio and send remaining back to itself

2020-12-30 Thread Mike Snitzer

On Tue, Dec 29 2020 at  4:18am -0500,
dannyshih  wrote:

> From: Danny Shih 
> 
> We found out that split bios might handle not in order when a big bio
> had split by blk_queue_split() and also split in stacking block device,
> such as md device because chunk size boundary limit.
> 
> Stacking block device normally use submit_bio_noacct() add the remaining
> bio to current->bio_list's tail after they split original bio. Therefore,
> when bio split first time, the last part of bio was add to bio_list.
> After then, when bio split second time, the middle part of bio was add to
> bio_list. Results that the middle part is now behind the last part of bio.
> 
> For example:
>   There is a RAID0 md device, with max_sectors_kb = 2 KB,
>   and chunk_size = 1 KB
> 
>   1. a read bio come to md device wants to read 0-7 KB
>   2. In blk_queue_split(), bio split into (0-1), (2-7),
>  and send (2-7) back to md device
> 
>  current->bio_list = bio_list_on_stack[0]: (md 2-7)
>   3. RAID0 split bio (0-1) into (0) and (1), since chunk size is 1 KB
>  and send (1) back to md device
> 
>  bio_list_on_stack[0]: (md 2-7) -> (md 1)
>   4. remap and send (0) to lower layer device
> 
>  bio_list_on_stack[0]: (md 2-7) -> (md 1) -> (lower 0)
>   5. __submit_bio_noacct() sorting bio let lower bio handle firstly
>  bio_list_on_stack[0]: (lower 0) -> (md 2-7) -> (md 1)
>  pop (lower 0)
>  move bio_list_on_stack[0] to bio_list_on_stack[1]
> 
>  bio_list_on_stack[1]: (md 2-7) -> (md 1)
>   6. after handle lower bio, it handle (md 2-7) firstly, and split
>  in blk_queue_split() into (2-3), (4-7), send (4-7) back
> 
>  bio_list_on_stack[0]: (md 4-7)
>  bio_list_on_stack[1]: (md 1)
>   7. RAID0 split bio (2-3) into (2) and (3) and send (3) back
> 
>  bio_list_on_stack[0]: (md 4-7) -> (md 3)
>  bio_list_on_stack[1]: (md 1)
>   ...
>   In the end, the split bio handle's order will become
>   0 -> 2 -> 4 -> 6 -> 7 -> 5 -> 3 -> 1
> 
> Reverse the order of same queue bio when sorting bio in
> __submit_bio_noacct() can solve this issue, but it might influence
> too much. So we provide alternative version of submit_bio_noacct(),
> named submit_bio_noacct_add_head(), for the case which need to add bio
> to the head of current->bio_list. And replace submit_bio_noacct() with
> submit_bio_noacct_add_head() in block device layer when we want to
> split bio and send remaining back to itself.

Ordering aside, you cannot split more than once.  So your proposed fix
to insert at head isn't valid because you're still implicitly allocating
more than one bio from the bioset which could cause deadlock in a low
memory situation.

I had to deal with a comparable issue with DM core not too long ago, see
this commit:

commit ee1dfad5325ff1cfb2239e564cd411b3bfe8667a
Author: Mike Snitzer 
Date:   Mon Sep 14 13:04:19 2020 -0400

dm: fix bio splitting and its bio completion order for regular IO

dm_queue_split() is removed because __split_and_process_bio() _must_
handle splitting bios to ensure proper bio submission and completion
ordering as a bio is split.

Otherwise, multiple recursive calls to ->submit_bio will cause multiple
split bios to be allocated from the same ->bio_split mempool at the same
time. This would result in deadlock in low memory conditions because no
progress could be made (only one bio is available in ->bio_split
mempool).

This fix has been verified to still fix the loss of performance, due
to excess splitting, that commit 120c9257f5f1 provided.

Fixes: 120c9257f5f1 ("Revert "dm: always call blk_queue_split() in 
dm_process_bio()"")
Cc: sta...@vger.kernel.org # 5.0+, requires custom backport due to 5.9 
changes
Reported-by: Ming Lei 
Signed-off-by: Mike Snitzer 

Basically you cannot split the same bio more than once without
recursing.  Your elaborate documentation shows things going wrong quite
early in step 3.  That additional split and recursing back to MD
shouldn't happen before the first bio split completes.

Seems the proper fix is to disallow max_sectors_kb to be imposed, via
blk_queue_split(), if MD has further splitting constraints, via
chunk_sectors, that negate max_sectors_kb anyway.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [git pull] device mapper fix for 5.11-rc2

2020-12-28 Thread Mike Snitzer

Hi Linus,

The following changes since commit b77709237e72d6467fb27bfbad163f7221ecd648:

  dm cache: simplify the return expression of load_mapping() (2020-12-22 
09:54:48 -0500)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-5.11/dm-fix

for you to fetch changes up to 48b0777cd93dbd800d3966b6f5c34714aad5c203:

  Revert "dm crypt: export sysfs of kcryptd workqueue" (2020-12-28 16:13:52 
-0500)

Please pull, thanks.
Mike


Revert WQ_SYSFS change that broke reencryption (and all other
functionality that requires reloading a dm-crypt DM table).

----
Mike Snitzer (1):
  Revert "dm crypt: export sysfs of kcryptd workqueue"

 drivers/md/dm-crypt.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [git pull] device mapper changes for 5.11

2020-12-22 Thread Mike Snitzer

Hi Linus,

The following changes since commit 65f33b35722952fa076811d5686bfd8a611a80fa:

  block: fix incorrect branching in blk_max_size_offset() (2020-12-04 17:27:42 
-0500)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-5.11/dm-changes

for you to fetch changes up to b77709237e72d6467fb27bfbad163f7221ecd648:

  dm cache: simplify the return expression of load_mapping() (2020-12-22 
09:54:48 -0500)

Please pull, thanks.
Mike


- Add DM verity support for signature verification with 2nd keyring.

- Fix DM verity to skip verity work if IO completes with error while
  system is shutting down.

- Add new DM multipath "IO affinity" path selector that maps IO
  destined to a given path to a specific CPU based on user provided
  mapping.

- Rename DM multipath path selector source files to have "dm-ps"
  prefix.

- Add REQ_NOWAIT support to some other simple DM targets that don't
  block in more elaborate ways waiting for IO.

- Export DM crypt's kcryptd workqueue via sysfs (WQ_SYSFS).

- Fix error return code in DM's target_message() if empty message is
  received.

- A handful of other small cleanups.


Antonio Quartulli (1):
  dm ebs: avoid double unlikely() notation when using IS_ERR()

Hyeongseok Kim (1):
  dm verity: skip verity work if I/O error when system is shutting down

Jeffle Xu (3):
  dm: remove unnecessary current->bio_list check when submitting split bio
  dm: add support for REQ_NOWAIT to various targets
  dm crypt: export sysfs of kcryptd workqueue

Mickaël Salaün (1):
  dm verity: Add support for signature verification with 2nd keyring

Mike Christie (1):
  dm mpath: add IO affinity path selector

Mike Snitzer (1):
  dm: rename multipath path selector source files to have "dm-ps" prefix

Qinglang Miao (1):
  dm ioctl: fix error return code in target_message

Rikard Falkeborn (1):
  dm crypt: Constify static crypt_iv_operations

Zheng Yongjun (1):
  dm cache: simplify the return expression of load_mapping()

 Documentation/admin-guide/device-mapper/verity.rst |   7 +-
 drivers/md/Kconfig |  22 +-
 drivers/md/Makefile|  20 +-
 drivers/md/dm-cache-target.c   |   7 +-
 drivers/md/dm-crypt.c  |  13 +-
 drivers/md/dm-ebs-target.c |   2 +-
 drivers/md/dm-ioctl.c  |   1 +
 ...vice-time.c => dm-ps-historical-service-time.c} |   0
 drivers/md/dm-ps-io-affinity.c | 272 +
 .../md/{dm-queue-length.c => dm-ps-queue-length.c} |   0
 .../md/{dm-round-robin.c => dm-ps-round-robin.c}   |   0
 .../md/{dm-service-time.c => dm-ps-service-time.c} |   0
 drivers/md/dm-stripe.c |   2 +-
 drivers/md/dm-switch.c |   1 +
 drivers/md/dm-unstripe.c   |   1 +
 drivers/md/dm-verity-target.c  |  12 +-
 drivers/md/dm-verity-verify-sig.c  |   9 +-
 drivers/md/dm-zero.c   |   1 +
 drivers/md/dm.c|   2 +-
 19 files changed, 345 insertions(+), 27 deletions(-)
 rename drivers/md/{dm-historical-service-time.c => 
dm-ps-historical-service-time.c} (100%)
 create mode 100644 drivers/md/dm-ps-io-affinity.c
 rename drivers/md/{dm-queue-length.c => dm-ps-queue-length.c} (100%)
 rename drivers/md/{dm-round-robin.c => dm-ps-round-robin.c} (100%)
 rename drivers/md/{dm-service-time.c => dm-ps-service-time.c} (100%)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

[dm-devel] DM's filesystem lookup in dm_get_dev_t() [was: Re: linux-next: manual merge of the device-mapper tree with Linus' tree]

2020-12-22 Thread Mike Snitzer

[added linux-block and dm-devel, if someone replies to this email to
continue "proper discussion" _please_ at least drop sfr and linux-next
from Cc]

On Tue, Dec 22 2020 at  8:15am -0500,
Christoph Hellwig  wrote:

> Mike, Hannes,
> 
> I think this patch is rather harmful.  Why does device mapper even
> mix file system path with a dev_t and all the other weird forms
> parsed by name_to_dev_t, which was supposed to be be for the early
> init code where no file system is available.

OK, I'll need to revisit (unless someone beats me to it) because this
could've easily been a blind-spot for me when the dm-init code went in.
Any dm-init specific enabling interface shouldn't be used by more
traditional DM interfaces.  So Hannes' change might be treating symptom
rather than the core problem (which would be better treated by factoring
out dm-init requirements for a name_to_dev_t()-like interface?).

DM has supported passing maj:min and blockdev names on DM table lines
forever... so we'll need to be very specific about where/why things
regressed.

> Can we please kick off a proper discussion for this on the linux-block
> list?

Sure, done. But I won't drive that discussion in the near-term. I need
to take some time off for a few weeks.

In the meantime I'll drop Hannes' patch for 5.11; I'm open to an
alternative fix that I'd pickup during 5.11-rcX.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

2020-12-22 Thread Mike Snitzer

On Tue, Dec 22 2020 at  8:32am -0500,
Christoph Hellwig  wrote:

> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
> > I haven't gotten a whole lot of feedback, so I'm inclined to at least have 
> > some
> > reasonable performance numbers before bothering with a v2.
> 
> FYI, my other main worry beside duplicating nbd is that device mapper
> really is a stacked interface that sits on top of other block device.
> Turning this into something else that just pipes data to userspace
> seems very strange.

I agree.  Only way I'd be interested is if it somehow tackled enabling
much more efficient IO.  Earlier discussion in this thread mentioned
that zero-copy and low overhead wasn't a priority (because it is hard,
etc).  But the hard work has already been done with io_uring.  If
dm-user had a prereq of leaning heavily on io_uring and also enabled IO
polling for bio-based then there may be a win to supporting it.

But unless lower latency (or some other more significant win) is made
possible I just don't care to prop up an unnatural DM bolt-on.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [git pull] 2 reverts for 5.11 to fix v5.10 MD regression

2020-12-14 Thread Mike Snitzer

Hi Linus,

The following changes since commit 2c85ebc57b3e1817b6ce1a6b703928e113a90442:

  Linux 5.10 (2020-12-13 14:41:30 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-5.11/revert-problem-v5.10-raid-changes

for you to fetch changes up to 0941e3b0653fef1ea68287f6a948c6c68a45c9ba:

  Revert "dm raid: fix discard limits for raid1 and raid10" (2020-12-14 
12:12:08 -0500)

Please pull, thanks.
Mike


A cascade of MD reverts occurred late in the v5.10-rcX cycle due to MD
raid10 discard optimizations having introduced potential for
corruption. Those reverts exposed a dm-raid.c compiler warning that
wasn't ever knowingly introduced. That min_not_zero() type mismatch
warning was thought to be safely fixed simply by changing 'struct
mddev' to use 'unsigned int' rather than int for chunk_sectors members
in that struct. I proposed either using a cast local to dm-raid.c but
thought changing the type to 'unsigned int' more correct. While that
may be, not enough testing was paired with code review associated with
making that change. As such we were left exposed and the result was a
report that with v5.10 btrfs on MD RAID6 failed to mount:
https://lkml.org/lkml/2020/12/14/7

Given that report, it is justified to simply revert these offending
commits. stable@ has already taken steps to revert these for 5.10.1 -
this pull request just makes sure mainline does so too.

----
Mike Snitzer (2):
  Revert "md: change mddev 'chunk_sectors' from int to unsigned"
  Revert "dm raid: fix discard limits for raid1 and raid10"

 drivers/md/dm-raid.c | 12 +---
 drivers/md/md.h  |  4 ++--
 2 files changed, 7 insertions(+), 9 deletions(-)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] Linux 5.10

2020-12-14 Thread Mike Snitzer

On Mon, Dec 14 2020 at 11:02am -0500,
Mike Snitzer  wrote:

> On Mon, Dec 14 2020 at 12:52am -0500,
> Greg KH  wrote:
> 
> > On Mon, Dec 14, 2020 at 12:31:47AM -0500, Dave Jones wrote:
> > > On Sun, Dec 13, 2020 at 03:03:29PM -0800, Linus Torvalds wrote:
> > >  > Ok, here it is - 5.10 is tagged and pushed out.
> > >  > 
> > >  > I pretty much always wish that the last week was even calmer than it
> > >  > was, and that's true here too. There's a fair amount of fixes in here,
> > >  > including a few last-minute reverts for things that didn't get fixed,
> > >  > but nothing makes me go "we need another week".
> > > 
> > > ...
> > > 
> > >  > Mike Snitzer (1):
> > >  >   md: change mddev 'chunk_sectors' from int to unsigned
> > > 
> > > Seems to be broken.  This breaks mounting my raid6 partition:
> > > 
> > > [   87.290698] attempt to access beyond end of device
> > >md0: rw=4096, want=13996467328, limit=6261202944
> > > [   87.293371] attempt to access beyond end of device
> > >md0: rw=4096, want=13998564480, limit=6261202944
> > > [   87.296045] BTRFS warning (device md0): couldn't read tree root
> > > [   87.300056] BTRFS error (device md0): open_ctree failed
> > > 
> > > Reverting it goes back to the -rc7 behaviour where it mounts fine.
> > 
> > If the developer/maintainer(s) agree, I can revert this and push out a
> > 5.10.1, just let me know.
> 
> Yes, these should be reverted from 5.10 via 5.10.1:
> 
> e0910c8e4f87 dm raid: fix discard limits for raid1 and raid10
> f075cfb1dc59 md: change mddev 'chunk_sectors' from int to unsigned

Sorry, f075cfb1dc59 was my local commit id, the corresponding upstream
commit as staged by Jens is:

6ffeb1c3f82 md: change mddev 'chunk_sectors' from int to unsigned

So please revert:
6ffeb1c3f822 md: change mddev 'chunk_sectors' from int to unsigned
and then revert:
e0910c8e4f87 dm raid: fix discard limits for raid1 and raid10

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 2007 matches

Mail list logo