from:"Mike Snitzer"

Re: [PATCH 077/141] dm raid: Fix fall-through warnings for Clang

2021-04-20 Thread Mike Snitzer

On Tue, Apr 20 2021 at  4:15pm -0400,
Gustavo A. R. Silva  wrote:

> Hi all,
> 
> Friendly ping: who can take this, please?
> 
> Thanks
> --
> Gustavo
> 
> On 11/20/20 12:35, Gustavo A. R. Silva wrote:
> > In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
> > by explicitly adding a break statement instead of letting the code fall
> > through to the next case.
> > 
> > Link: https://github.com/KSPP/linux/issues/115
> > Signed-off-by: Gustavo A. R. Silva 
> > ---
> >  drivers/md/dm-raid.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> > index 9c1f7c4de65b..e98af0b9d00c 100644
> > --- a/drivers/md/dm-raid.c
> > +++ b/drivers/md/dm-raid.c
> > @@ -1854,6 +1854,7 @@ static int rs_check_takeover(struct raid_set *rs)
> > ((mddev->layout == ALGORITHM_PARITY_N && mddev->new_layout 
> > == ALGORITHM_PARITY_N) ||
> >  __within_range(mddev->new_layout, 
> > ALGORITHM_LEFT_ASYMMETRIC, ALGORITHM_RIGHT_SYMMETRIC)))
> > return 0;
> > +   break;
> >  
> > default:
> > break;
> > 
> 

I've picked it up for 5.13, thanks.

Mike

Re: [PATCH v8 0/4] block device interposer

2021-04-09 Thread Mike Snitzer

On Fri, Apr 09 2021 at  7:48am -0400,
Sergei Shtepa  wrote:

> I think I'm ready to suggest the next version of block device interposer
> (blk_interposer). It allows to redirect bio requests to other block
> devices.
> 
> In this series of patches, I reviewed the process of attaching and
> detaching device mapper via blk_interposer.
> 
> Now the dm-target is attached to the interposed block device when the
> interposer dm-target is fully ready to accept requests, and the interposed
> block device queue is locked, and the file system on it is frozen.
> The detaching is also performed when the file system on the interposed
> block device is in a frozen state, the queue is locked, and the interposer
> dm-target is suspended.
> 
> To make it possible to lock the receipt of new bio requests without locking
> the processing of bio requests that the interposer creates, I had to change
> the submit_bio_noacct() function and add a lock. To minimize the impact of
> locking, I chose percpu_rw_sem. I tried to do without a new lock, but I'm
> afraid it's impossible.
> 
> Checking the operation of the interposer, I did not limit myself to
> a simple dm-linear. When I experimented with dm-era, I noticed that it
> accepts two block devices. Since Mike was against changing the logic in
> the dm-targets itself to support the interrupter, I decided to add the
> [interpose] option to the block device path.
> 
>  echo "0 ${DEV_SZ} era ${META} [interpose]${DEV} ${BLK_SZ}" | \
>   dmsetup create dm-era --interpose
> 
> I believe this option can replace the DM_INTERPOSE_FLAG flag. Of course,
> we can assume that if the device cannot be opened with the FMODE_EXCL,
> then it is considered an interposed device, but it seems to me that
> algorithm is unsafe. I hope to get Mike's opinion on this.
> 
> I have successfully tried taking snapshots. But I ran into a problem
> when I removed origin-target:
> [   49.031156] [ cut here ]
> [   49.031180] kernel BUG at block/bio.c:1476!
> [   49.031198] invalid opcode:  [#1] SMP NOPTI
> [   49.031213] CPU: 9 PID: 636 Comm: dmsetup Tainted: GE 
> 5.12.0-rc6-ip+ #52
> [   49.031235] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
> VirtualBox 12/01/2006
> [   49.031257] RIP: 0010:bio_split+0x74/0x80
> [   49.031273] Code: 89 c7 e8 5f 56 03 00 41 8b 74 24 28 48 89 ef e8 12 ea ff 
> ff f6 45 15 01 74 08 66 41 81 4c 24 14 00 01 4c 89 e0 5b 5d 41 5c c3 <0f> 0b 
> 0f 0b 0f 0b 45 31 e4 eb ed 90 0f 1f 44 00 00 39 77 28 76 05
> [   49.031322] RSP: 0018:9a6100993ab0 EFLAGS: 00010246
> [   49.031337] RAX: 0008 RBX:  RCX: 
> 8e26938f96d8
> [   49.031357] RDX: 0c00 RSI:  RDI: 
> 8e26937d1300
> [   49.031375] RBP: 8e2692ddc000 R08:  R09: 
> 
> [   49.031394] R10: 8e2692b1de00 R11: 8e2692b1de58 R12: 
> 8e26937d1300
> [   49.031413] R13: 8e2692ddcd18 R14: 8e2691d22140 R15: 
> 8e26937d1300
> [   49.031432] FS:  7efffa6e7800() GS:8e269bc8() 
> knlGS:
> [   49.031453] CS:  0010 DS:  ES:  CR0: 80050033
> [   49.031470] CR2: 7efffa96cda0 CR3: 000114bd CR4: 
> 000506e0
> [   49.031490] Call Trace:
> [   49.031501]  dm_submit_bio+0x383/0x500 [dm_mod]
> [   49.031522]  submit_bio_noacct+0x370/0x770
> [   49.031537]  submit_bh_wbc+0x160/0x190
> [   49.031550]  __sync_dirty_buffer+0x65/0x130
> [   49.031564]  ext4_commit_super+0xbc/0x120 [ext4]
> [   49.031602]  ext4_freeze+0x54/0x80 [ext4]
> [   49.031631]  freeze_super+0xc8/0x160
> [   49.031643]  freeze_bdev+0xb2/0xc0
> [   49.031654]  lock_bdev_fs+0x1c/0x30 [dm_mod]
> [   49.031671]  __dm_suspend+0x2b9/0x3b0 [dm_mod]
> [   49.032095]  dm_suspend+0xed/0x160 [dm_mod]
> [   49.032496]  ? __find_device_hash_cell+0x5b/0x2a0 [dm_mod]
> [   49.032897]  ? remove_all+0x30/0x30 [dm_mod]
> [   49.033299]  dev_remove+0x4c/0x1c0 [dm_mod]
> [   49.033679]  ctl_ioctl+0x1a5/0x470 [dm_mod]
> [   49.034067]  dm_ctl_ioctl+0xa/0x10 [dm_mod]
> [   49.034432]  __x64_sys_ioctl+0x83/0xb0
> [   49.034785]  do_syscall_64+0x33/0x80
> [   49.035139]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> When suspend is executed for origin-target before the interposer is
> being detached, in the origin_map() function the value of the
> o->split_binary variable is zero, since no snapshots were connected to it.
> I think that if no snapshots are connected, then it does not make sense
> to split the bio request into parts.

The dm-snapshot code requires careful order of operations.  You say you
removed the origin target.. please show exactly what you did.  Your 4th
patch shouldn't be tied to this patchset. Can be dealt with
independently.

> Changes summary for this patchset v7:
>   * The attaching and detaching to interposed device moved to
> __dm_suspend() and __dm_resume() functions.

Why? Those hooks are inherently more constrained.  And in the

Re: md/dm-mpath: check whether all pgpaths have same uuid in multipath_ctr()

2021-03-25 Thread Mike Snitzer

On Wed, Mar 24 2021 at  9:21pm -0400,
Zhiqiang Liu  wrote:

> 
> 
> On 2021/3/22 22:22, Mike Snitzer wrote:
> > On Mon, Mar 22 2021 at  4:11am -0400,
> > Christoph Hellwig  wrote:
> > 
> >> On Sat, Mar 20, 2021 at 03:19:23PM +0800, Zhiqiang Liu wrote:
> >>> From: Zhiqiang Liu 
> >>>
> >>> When we make IO stress test on multipath device, there will
> >>> be a metadata err because of wrong path. In the test, we
> >>> concurrent execute 'iscsi device login|logout' and
> >>> 'multipath -r' command with IO stress on multipath device.
> >>> In some case, systemd-udevd may have not time to process
> >>> uevents of iscsi device logout|login, and then 'multipath -r'
> >>> command triggers multipathd daemon calls ioctl to load table
> >>> with incorrect old device info from systemd-udevd.
> >>> Then, one iscsi path may be incorrectly attached to another
> >>> multipath which has different uuid. Finally, the metadata err
> >>> occurs when umounting filesystem to down write metadata on
> >>> the iscsi device which is actually not owned by the multipath
> >>> device.
> >>>
> >>> So we need to check whether all pgpaths of one multipath have
> >>> the same uuid, if not, we should throw a error.
> >>>
> >>> Signed-off-by: Zhiqiang Liu 
> >>> Signed-off-by: lixiaokeng 
> >>> Signed-off-by: linfeilong 
> >>> Signed-off-by: Wubo 
> >>> ---
> >>>  drivers/md/dm-mpath.c   | 52 +
> >>>  drivers/scsi/scsi_lib.c |  1 +
> >>>  2 files changed, 53 insertions(+)
> >>>
> >>> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> >>> index bced42f082b0..f0b995784b53 100644
> >>> --- a/drivers/md/dm-mpath.c
> >>> +++ b/drivers/md/dm-mpath.c
> >>> @@ -24,6 +24,7 @@
> >>>  #include 
> >>>  #include 
> >>>  #include 
> >>> +#include 
> >>>  #include 
> >>>  #include 
> >>>
> >>> @@ -1169,6 +1170,45 @@ static int parse_features(struct dm_arg_set *as, 
> >>> struct multipath *m)
> >>>   return r;
> >>>  }
> >>>
> >>> +#define SCSI_VPD_LUN_ID_PREFIX_LEN 4
> >>> +#define MPATH_UUID_PREFIX_LEN 7
> >>> +static int check_pg_uuid(struct priority_group *pg, char *md_uuid)
> >>> +{
> >>> + char pgpath_uuid[DM_UUID_LEN] = {0};
> >>> + struct request_queue *q;
> >>> + struct pgpath *pgpath;
> >>> + struct scsi_device *sdev;
> >>> + ssize_t count;
> >>> + int r = 0;
> >>> +
> >>> + list_for_each_entry(pgpath, >pgpaths, list) {
> >>> + q = bdev_get_queue(pgpath->path.dev->bdev);
> >>> + sdev = scsi_device_from_queue(q);
> >>
> >> Common dm-multipath code should never poke into scsi internals.  This
> >> is something for the device handler to check.  It probably also won't
> >> work for all older devices.
> > 
> > Definitely.
> > 
> > But that aside, userspace (multipathd) _should_ be able to do extra
> > validation, _before_ pushing down a new table to the kernel, rather than
> > forcing the kernel to do it.
> 
> As your said, it is better to do extra validation in userspace (multipathd).
> However, in some cases, the userspace cannot see the real-time present devices
> info as Martin (committer of multipath-tools) said.
> In addition, the kernel can see right device info in the table at any time,
> so the uuid check in kernel can ensure one multipath is composed with paths 
> mapped to
> the same device.
> 
> Considering the severity of the wrong path in multipath, I think it worths 
> more
> checking.

As already said: this should be fixable in userspace.  Please work with
multipath-tools developers to address this.

Mike

Re: md/dm-mpath: check whether all pgpaths have same uuid in multipath_ctr()

2021-03-22 Thread Mike Snitzer

On Mon, Mar 22 2021 at  4:11am -0400,
Christoph Hellwig  wrote:

> On Sat, Mar 20, 2021 at 03:19:23PM +0800, Zhiqiang Liu wrote:
> > From: Zhiqiang Liu 
> > 
> > When we make IO stress test on multipath device, there will
> > be a metadata err because of wrong path. In the test, we
> > concurrent execute 'iscsi device login|logout' and
> > 'multipath -r' command with IO stress on multipath device.
> > In some case, systemd-udevd may have not time to process
> > uevents of iscsi device logout|login, and then 'multipath -r'
> > command triggers multipathd daemon calls ioctl to load table
> > with incorrect old device info from systemd-udevd.
> > Then, one iscsi path may be incorrectly attached to another
> > multipath which has different uuid. Finally, the metadata err
> > occurs when umounting filesystem to down write metadata on
> > the iscsi device which is actually not owned by the multipath
> > device.
> > 
> > So we need to check whether all pgpaths of one multipath have
> > the same uuid, if not, we should throw a error.
> > 
> > Signed-off-by: Zhiqiang Liu 
> > Signed-off-by: lixiaokeng 
> > Signed-off-by: linfeilong 
> > Signed-off-by: Wubo 
> > ---
> >  drivers/md/dm-mpath.c   | 52 +
> >  drivers/scsi/scsi_lib.c |  1 +
> >  2 files changed, 53 insertions(+)
> > 
> > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> > index bced42f082b0..f0b995784b53 100644
> > --- a/drivers/md/dm-mpath.c
> > +++ b/drivers/md/dm-mpath.c
> > @@ -24,6 +24,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> > 
> > @@ -1169,6 +1170,45 @@ static int parse_features(struct dm_arg_set *as, 
> > struct multipath *m)
> > return r;
> >  }
> > 
> > +#define SCSI_VPD_LUN_ID_PREFIX_LEN 4
> > +#define MPATH_UUID_PREFIX_LEN 7
> > +static int check_pg_uuid(struct priority_group *pg, char *md_uuid)
> > +{
> > +   char pgpath_uuid[DM_UUID_LEN] = {0};
> > +   struct request_queue *q;
> > +   struct pgpath *pgpath;
> > +   struct scsi_device *sdev;
> > +   ssize_t count;
> > +   int r = 0;
> > +
> > +   list_for_each_entry(pgpath, >pgpaths, list) {
> > +   q = bdev_get_queue(pgpath->path.dev->bdev);
> > +   sdev = scsi_device_from_queue(q);
> 
> Common dm-multipath code should never poke into scsi internals.  This
> is something for the device handler to check.  It probably also won't
> work for all older devices.

Definitely.

But that aside, userspace (multipathd) _should_ be able to do extra
validation, _before_ pushing down a new table to the kernel, rather than
forcing the kernel to do it.

Re: [PATCH v7 2/3] block: add bdev_interposer

2021-03-17 Thread Mike Snitzer

On Wed, Mar 17 2021 at  2:14pm -0400,
Sergei Shtepa  wrote:

> The 03/17/2021 18:04, Mike Snitzer wrote:
> > On Wed, Mar 17 2021 at  8:22am -0400,
> > Sergei Shtepa  wrote:
> > 
> > > The 03/17/2021 06:03, Ming Lei wrote:
> > > > On Tue, Mar 16, 2021 at 07:35:44PM +0300, Sergei Shtepa wrote:
> > > > > The 03/16/2021 11:09, Ming Lei wrote:
> > > > > > On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > > > > > > bdev_interposer allows to redirect bio requests to another 
> > > > > > > devices.
> > > > > > > 
> > > > > > > Signed-off-by: Sergei Shtepa 
> > > > > > > ---
> > > > > > >  block/bio.c   |  2 ++
> > > > > > >  block/blk-core.c  | 57 
> > > > > > > +++
> > > > > > >  block/genhd.c | 54 
> > > > > > > +
> > > > > > >  include/linux/blk_types.h |  3 +++
> > > > > > >  include/linux/blkdev.h|  9 +++
> > > > > > >  5 files changed, 125 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > > > index a1c4d2900c7a..0bfbf06475ee 100644
> > > > > > > --- a/block/bio.c
> > > > > > > +++ b/block/bio.c
> > > > > > > @@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct 
> > > > > > > bio *bio_src)
> > > > > > >   bio_set_flag(bio, BIO_THROTTLED);
> > > > > > >   if (bio_flagged(bio_src, BIO_REMAPPED))
> > > > > > >   bio_set_flag(bio, BIO_REMAPPED);
> > > > > > > + if (bio_flagged(bio_src, BIO_INTERPOSED))
> > > > > > > + bio_set_flag(bio, BIO_INTERPOSED);
> > > > > > >   bio->bi_opf = bio_src->bi_opf;
> > > > > > >   bio->bi_ioprio = bio_src->bi_ioprio;
> > > > > > >   bio->bi_write_hint = bio_src->bi_write_hint;
> > > > > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > > > > index fc60ff208497..da1abc4c27a9 100644
> > > > > > > --- a/block/blk-core.c
> > > > > > > +++ b/block/blk-core.c
> > > > > > > @@ -1018,6 +1018,55 @@ static blk_qc_t 
> > > > > > > __submit_bio_noacct_mq(struct bio *bio)
> > > > > > >   return ret;
> > > > > > >  }
> > > > > > >  
> > > > > > > +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> > > > > > > +{
> > > > > > > + blk_qc_t ret = BLK_QC_T_NONE;
> > > > > > > + struct bio_list bio_list[2] = { };
> > > > > > > + struct gendisk *orig_disk;
> > > > > > > +
> > > > > > > + if (current->bio_list) {
> > > > > > > + bio_list_add(>bio_list[0], bio);
> > > > > > > + return BLK_QC_T_NONE;
> > > > > > > + }
> > > > > > > +
> > > > > > > + orig_disk = bio->bi_bdev->bd_disk;
> > > > > > > + if (unlikely(bio_queue_enter(bio)))
> > > > > > > + return BLK_QC_T_NONE;
> > > > > > > +
> > > > > > > + current->bio_list = bio_list;
> > > > > > > +
> > > > > > > + do {
> > > > > > > + struct block_device *interposer = 
> > > > > > > bio->bi_bdev->bd_interposer;
> > > > > > > +
> > > > > > > + if (unlikely(!interposer)) {
> > > > > > > + /* interposer was removed */
> > > > > > > + bio_list_add(>bio_list[0], bio);
> > > > > > > + break;
> > > > > > > + }
> > > > > > > + /* assign bio to interposer device */
> > > > > > > + bio_set_dev(bio, interposer);
> > > > > > > + bio_set_flag(bio, BIO_INTERPOSED);
> > > > > > > +
> > > > > > > + if (!submit_bio_checks(bio))
> > > > > > > + break;
>

Re: [PATCH v7 2/3] block: add bdev_interposer

2021-03-17 Thread Mike Snitzer

On Tue, Mar 16 2021 at 11:03pm -0400,
Ming Lei  wrote:

> On Tue, Mar 16, 2021 at 07:35:44PM +0300, Sergei Shtepa wrote:
> > The 03/16/2021 11:09, Ming Lei wrote:
> > > On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > > > bdev_interposer allows to redirect bio requests to another devices.
> > > > 
> > > > Signed-off-by: Sergei Shtepa 

...

> > > > +
> > > > +int bdev_interposer_attach(struct block_device *original,
> > > > +  struct block_device *interposer)
> > > > +{
> > > > +   int ret = 0;
> > > > +
> > > > +   if (WARN_ON(((!original) || (!interposer
> > > > +   return -EINVAL;
> > > > +   /*
> > > > +* interposer should be simple, no a multi-queue device
> > > > +*/
> > > > +   if (!interposer->bd_disk->fops->submit_bio)
> > > > +   return -EINVAL;
> > > > +
> > > > +   if (WARN_ON(!blk_mq_is_queue_frozen(original->bd_disk->queue)))
> > > > +   return -EPERM;
> > > 
> > > The original request queue may become live now...
> > 
> > Yes.
> > I will remove the blk_mq_is_queue_frozen() function and use a different
> > approach.
> 
> Looks what attach and detach needs is that queue is kept as frozen state
> instead of being froze simply at the beginning of the two functions, so
> you can simply call freeze/unfreeze inside the two functions.
> 
> But what if 'original' isn't a MQ queue?  queue usage counter is just
> grabed when calling ->submit_bio(), and queue freeze doesn't guarantee there
> isn't any io activity, is that a problem for bdev_interposer use case?

Right, I raised the same concern here:
https://listman.redhat.com/archives/dm-devel/2021-March/msg00135.html
(toward bottom inlined after dm_disk_{freeze,unfreeze}

Anyway, this certainly needs to be addressed.

Mike

Re: [PATCH v7 2/3] block: add bdev_interposer

2021-03-17 Thread Mike Snitzer

On Wed, Mar 17 2021 at  8:22am -0400,
Sergei Shtepa  wrote:

> The 03/17/2021 06:03, Ming Lei wrote:
> > On Tue, Mar 16, 2021 at 07:35:44PM +0300, Sergei Shtepa wrote:
> > > The 03/16/2021 11:09, Ming Lei wrote:
> > > > On Fri, Mar 12, 2021 at 06:44:54PM +0300, Sergei Shtepa wrote:
> > > > > bdev_interposer allows to redirect bio requests to another devices.
> > > > > 
> > > > > Signed-off-by: Sergei Shtepa 
> > > > > ---
> > > > >  block/bio.c   |  2 ++
> > > > >  block/blk-core.c  | 57 
> > > > > +++
> > > > >  block/genhd.c | 54 +
> > > > >  include/linux/blk_types.h |  3 +++
> > > > >  include/linux/blkdev.h|  9 +++
> > > > >  5 files changed, 125 insertions(+)
> > > > > 
> > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > index a1c4d2900c7a..0bfbf06475ee 100644
> > > > > --- a/block/bio.c
> > > > > +++ b/block/bio.c
> > > > > @@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio 
> > > > > *bio_src)
> > > > >   bio_set_flag(bio, BIO_THROTTLED);
> > > > >   if (bio_flagged(bio_src, BIO_REMAPPED))
> > > > >   bio_set_flag(bio, BIO_REMAPPED);
> > > > > + if (bio_flagged(bio_src, BIO_INTERPOSED))
> > > > > + bio_set_flag(bio, BIO_INTERPOSED);
> > > > >   bio->bi_opf = bio_src->bi_opf;
> > > > >   bio->bi_ioprio = bio_src->bi_ioprio;
> > > > >   bio->bi_write_hint = bio_src->bi_write_hint;
> > > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > > index fc60ff208497..da1abc4c27a9 100644
> > > > > --- a/block/blk-core.c
> > > > > +++ b/block/blk-core.c
> > > > > @@ -1018,6 +1018,55 @@ static blk_qc_t __submit_bio_noacct_mq(struct 
> > > > > bio *bio)
> > > > >   return ret;
> > > > >  }
> > > > >  
> > > > > +static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> > > > > +{
> > > > > + blk_qc_t ret = BLK_QC_T_NONE;
> > > > > + struct bio_list bio_list[2] = { };
> > > > > + struct gendisk *orig_disk;
> > > > > +
> > > > > + if (current->bio_list) {
> > > > > + bio_list_add(>bio_list[0], bio);
> > > > > + return BLK_QC_T_NONE;
> > > > > + }
> > > > > +
> > > > > + orig_disk = bio->bi_bdev->bd_disk;
> > > > > + if (unlikely(bio_queue_enter(bio)))
> > > > > + return BLK_QC_T_NONE;
> > > > > +
> > > > > + current->bio_list = bio_list;
> > > > > +
> > > > > + do {
> > > > > + struct block_device *interposer = 
> > > > > bio->bi_bdev->bd_interposer;
> > > > > +
> > > > > + if (unlikely(!interposer)) {
> > > > > + /* interposer was removed */
> > > > > + bio_list_add(>bio_list[0], bio);
> > > > > + break;
> > > > > + }
> > > > > + /* assign bio to interposer device */
> > > > > + bio_set_dev(bio, interposer);
> > > > > + bio_set_flag(bio, BIO_INTERPOSED);
> > > > > +
> > > > > + if (!submit_bio_checks(bio))
> > > > > + break;
> > > > > + /*
> > > > > +  * Because the current->bio_list is initialized,
> > > > > +  * the submit_bio callback will always return 
> > > > > BLK_QC_T_NONE.
> > > > > +  */
> > > > > + interposer->bd_disk->fops->submit_bio(bio);
> > > > 
> > > > Given original request queue may become live when calling attach() and
> > > > detach(), see below comment. bdev_interposer_detach() may be run
> > > > when running ->submit_bio(), meantime the interposer device is
> > > > gone during the period, then kernel oops.
> > > 
> > > I think that since the bio_queue_enter() function was called,
> > > q->q_usage_counter will not allow the critical code in the attach/detach
> > > functions to be executed, which is located between the blk_freeze_queue
> > > and blk_unfreeze_queue calls.
> > > Please correct me if I'm wrong.
> > > 
> > > > 
> > > > > + } while (false);
> > > > > +
> > > > > + current->bio_list = NULL;
> > > > > +
> > > > > + blk_queue_exit(orig_disk->queue);
> > > > > +
> > > > > + /* Resubmit remaining bios */
> > > > > + while ((bio = bio_list_pop(_list[0])))
> > > > > + ret = submit_bio_noacct(bio);
> > > > > +
> > > > > + return ret;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * submit_bio_noacct - re-submit a bio to the block device layer for 
> > > > > I/O
> > > > >   * @bio:  The bio describing the location in memory and on the 
> > > > > device.
> > > > > @@ -1029,6 +1078,14 @@ static blk_qc_t __submit_bio_noacct_mq(struct 
> > > > > bio *bio)
> > > > >   */
> > > > >  blk_qc_t submit_bio_noacct(struct bio *bio)
> > > > >  {
> > > > > + /*
> > > > > +  * Checking the BIO_INTERPOSED flag is necessary so that the bio
> > > > > +  * created by the bdev_interposer do not get to it for 
> > > > > processing.
> > > > > +  */
> >

Re: [PATCH v7 1/3] block: add blk_mq_is_queue_frozen()

2021-03-12 Thread Mike Snitzer

On Fri, Mar 12 2021 at 10:44am -0500,
Sergei Shtepa  wrote:

> blk_mq_is_queue_frozen() allow to assert that the queue is frozen.
> 
> Signed-off-by: Sergei Shtepa 
> ---
>  block/blk-mq.c | 13 +
>  include/linux/blk-mq.h |  1 +
>  2 files changed, 14 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d4d7c1caa439..2f188a865024 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -161,6 +161,19 @@ int blk_mq_freeze_queue_wait_timeout(struct 
> request_queue *q,
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
>  
> +bool blk_mq_is_queue_frozen(struct request_queue *q)
> +{
> + bool frozen;
> +
> + mutex_lock(>mq_freeze_lock);
> + frozen = percpu_ref_is_dying(>q_usage_counter) &&
> +  percpu_ref_is_zero(>q_usage_counter);
> + mutex_unlock(>mq_freeze_lock);
> +
> + return frozen;
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_is_queue_frozen);
> +

This is returning a frozen state that is immediately stale.  I don't
think any code calling this is providing the guarantees you think it
does due to the racey nature of this state once the mutex is dropped.

>  /*
>   * Guarantee no request is in use, so we can change any data structure of
>   * the queue afterward.
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 2c473c9b8990..6f01971abf7b 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -533,6 +533,7 @@ void blk_freeze_queue_start(struct request_queue *q);
>  void blk_mq_freeze_queue_wait(struct request_queue *q);
>  int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
>unsigned long timeout);
> +bool blk_mq_is_queue_frozen(struct request_queue *q);
>  
>  int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
>  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int 
> nr_hw_queues);
> -- 
> 2.20.1
>

Re: [PATCH v7 3/3] dm: add DM_INTERPOSED_FLAG

2021-03-12 Thread Mike Snitzer

On Fri, Mar 12 2021 at 10:44am -0500,
Sergei Shtepa  wrote:

> DM_INTERPOSED_FLAG allow to create DM targets on "the fly".
> Underlying block device opens without a flag FMODE_EXCL.
> DM target receives bio from the original device via
> bdev_interposer.
> 
> Signed-off-by: Sergei Shtepa 
> ---
>  drivers/md/dm-core.h  |  3 ++
>  drivers/md/dm-ioctl.c | 13 
>  drivers/md/dm-table.c | 61 +--
>  drivers/md/dm.c   | 38 +++---
>  include/linux/device-mapper.h |  1 +
>  include/uapi/linux/dm-ioctl.h |  6 
>  6 files changed, 101 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index 5953ff2bd260..9eae419c7b18 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -114,6 +114,9 @@ struct mapped_device {
>   bool init_tio_pdu:1;
>  
>   struct srcu_struct io_barrier;
> +
> + /* attach target via block-layer interposer */
> + bool is_interposed;
>  };

This flag is a mix of uses.  First it is used to store that
DM_INTERPOSED_FLAG was provided as input param during load.

And the same 'is_interposed' name is used in 'struct dm_dev'.

To me this state should be elevated to block core -- awkward for every
driver that might want to use blk-interposer to be sprinkling state
around its core structures.

So I'd prefer you:
1) rename 'struct mapped_device' to 'interpose' _and_ add it just after
   "bool init_tio_pdu:1;" with "bool interpose:1;" -- this reflects
   interpose was requested during load.
2) bdev_interposer_attach() should be made to set some block core state
   that is able to be tested to check if a device is_interposed.
3) Don't add an 'is_interposed' flag to 'struct dm_dev'

>  
>  void disable_discard(struct mapped_device *md);
> diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> index 5e306bba4375..2b4c9557c283 100644
> --- a/drivers/md/dm-ioctl.c
> +++ b/drivers/md/dm-ioctl.c
> @@ -1267,6 +1267,11 @@ static inline fmode_t get_mode(struct dm_ioctl *param)
>   return mode;
>  }
>  
> +static inline bool get_interposer_flag(struct dm_ioctl *param)
> +{
> + return (param->flags & DM_INTERPOSED_FLAG);
> +}
> +

As I mention at the end: rename to DM_INTERPOSE_FLAG

>  static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
>  struct dm_target_spec **spec, char **target_params)
>  {
> @@ -1293,6 +1298,10 @@ static int populate_table(struct dm_table *table,
>   DMWARN("populate_table: no targets specified");
>   return -EINVAL;
>   }
> + if (table->md->is_interposed && (param->target_count != 1)) {
> + DMWARN("%s: with interposer should be specified only one 
> target", __func__);

This error/warning reads very awkwardly. Maybe?:
"%s: interposer requires only a single target be specified"

> + return -EINVAL;
> + }
>  
>   for (i = 0; i < param->target_count; i++) {
>  
> @@ -1338,6 +1347,8 @@ static int table_load(struct file *filp, struct 
> dm_ioctl *param, size_t param_si
>   if (!md)
>   return -ENXIO;
>  
> + md->is_interposed = get_interposer_flag(param);
> +
>   r = dm_table_create(, get_mode(param), param->target_count, md);
>   if (r)
>   goto err;
> @@ -2098,6 +2109,8 @@ int __init dm_early_create(struct dm_ioctl *dmi,
>   if (r)
>   goto err_hash_remove;
>  
> + md->is_interposed = get_interposer_flag(dmi);
> +
>   /* add targets */
>   for (i = 0; i < dmi->target_count; i++) {
>   r = dm_table_add_target(t, spec_array[i]->target_type,
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 95391f78b8d5..f6e2eb3f8949 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -225,12 +225,13 @@ void dm_table_destroy(struct dm_table *t)
>  /*
>   * See if we've already got a device in the list.
>   */
> -static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev)
> +static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev, 
> bool is_interposed)

Think in make more sense to internalize the need to consider
md->interpose here.

So:

static struct dm_dev_internal *find_device(struct dm_table *t, dev_t dev)
{
struct list_head *l = >devices;
bool is_interposed = t->md->interpose;
...

>  {
>   struct dm_dev_internal *dd;
>  
>   list_for_each_entry (dd, l, list)
> - if (dd->dm_dev->bdev->bd_dev == dev)
> + if ((dd->dm_dev->bdev->bd_dev == dev) &&
> + (dd->dm_dev->is_interposed == is_interposed))
>   return dd;

But why must this extra state be used/tested?  Seems like quite a deep
embedding of interposer into dm_dev_internal.. feels unnecessary.

>  
>   return NULL;
> @@ -358,6 +359,18 @@ dev_t dm_get_dev_t(const char *path)
>  }
>  EXPORT_SYMBOL_GPL(dm_get_dev_t);
>  
> +static

Re: dm: remove unneeded variable 'sz'

2021-03-09 Thread Mike Snitzer

On Tue, Mar 09 2021 at  4:32am -0500,
Yang Li  wrote:

> Fix the following coccicheck warning:
> ./drivers/md/dm-ps-service-time.c:85:10-12: Unneeded variable: "sz".
> Return "0" on line 105
> 
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 

This type of change gets proposed regaularly.  Would appreciate it if
you could fix coccicheck to not get this wrong.  The local 'sz' variable
is used by the DMEMIT macro (as the earlier reply to this email informed
you).

Also, had you tried to compile the code with your patch applied you'd
have quickly realized your patch wasn't correct.

Mike


> ---
>  drivers/md/dm-ps-service-time.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-ps-service-time.c b/drivers/md/dm-ps-service-time.c
> index 9cfda66..12dd5ce 100644
> --- a/drivers/md/dm-ps-service-time.c
> +++ b/drivers/md/dm-ps-service-time.c
> @@ -82,7 +82,6 @@ static void st_destroy(struct path_selector *ps)
>  static int st_status(struct path_selector *ps, struct dm_path *path,
>status_type_t type, char *result, unsigned maxlen)
>  {
> - unsigned sz = 0;
>   struct path_info *pi;
>  
>   if (!path)
> @@ -102,7 +101,7 @@ static int st_status(struct path_selector *ps, struct 
> dm_path *path,
>   }
>   }
>  
> - return sz;
> + return 0;
>  }
>  
>  static int st_add_path(struct path_selector *ps, struct dm_path *path,
> -- 
> 1.8.3.1
>

Re: [PATCH v5 5/6] dm: add 'noexcl' option for dm-linear

2021-02-15 Thread Mike Snitzer

On Mon, Feb 15 2021 at  5:34am -0500,
Sergei Shtepa  wrote:

> The 02/12/2021 19:06, Mike Snitzer wrote:
> > On Fri, Feb 12 2021 at  6:34am -0500,
> > Sergei Shtepa  wrote:
> > 
> > > The 02/11/2021 20:51, Mike Snitzer wrote:
> > > > On Tue, Feb 09 2021 at  9:30am -0500,
> > > > Sergei Shtepa  wrote:
> > > > 
> > > > > The 'noexcl' option allow to open underlying block-device
> > > > > without FMODE_EXCL.
> > > > > 
> > > > > Signed-off-by: Sergei Shtepa 
> > > > > ---
> > > > >  drivers/md/dm-linear.c| 14 +-
> > > > >  drivers/md/dm-table.c | 14 --
> > > > >  drivers/md/dm.c   | 26 +++---
> > > > >  drivers/md/dm.h   |  2 +-
> > > > >  include/linux/device-mapper.h |  7 +++
> > > > >  5 files changed, 48 insertions(+), 15 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> > > > > index 00774b5d7668..b16d89802b9d 100644
> > > > > --- a/drivers/md/dm-linear.c
> > > > > +++ b/drivers/md/dm-linear.c
> > > > > @@ -33,7 +33,7 @@ static int linear_ctr(struct dm_target *ti, 
> > > > > unsigned int argc, char **argv)
> > > > >   char dummy;
> > > > >   int ret;
> > > > >  
> > > > > - if (argc != 2) {
> > > > > + if ((argc < 2) || (argc > 3)) {
> > > > >   ti->error = "Invalid argument count";
> > > > >   return -EINVAL;
> > > > >   }
> > > > > @@ -51,6 +51,18 @@ static int linear_ctr(struct dm_target *ti, 
> > > > > unsigned int argc, char **argv)
> > > > >   }
> > > > >   lc->start = tmp;
> > > > >  
> > > > > + ti->non_exclusive = false;
> > > > > + if (argc > 2) {
> > > > > + if (strcmp("noexcl", argv[2]) == 0)
> > > > > + ti->non_exclusive = true;
> > > > > + else if (strcmp("excl", argv[2]) == 0)
> > > > > + ti->non_exclusive = false;
> > > > > + else {
> > > > > + ti->error = "Invalid exclusive option";
> > > > > + return -EINVAL;
> > > > > + }
> > > > > + }
> > > > > +
> > > > >   ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), 
> > > > > >dev);
> > > > >   if (ret) {
> > > > >   ti->error = "Device lookup failed";
> > > > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > > > > index 4acf2342f7ad..f020459465bd 100644
> > > > > --- a/drivers/md/dm-table.c
> > > > > +++ b/drivers/md/dm-table.c
> > > > > @@ -322,7 +322,7 @@ static int device_area_is_invalid(struct 
> > > > > dm_target *ti, struct dm_dev *dev,
> > > > >   * device and not to touch the existing bdev field in case
> > > > >   * it is accessed concurrently.
> > > > >   */
> > > > > -static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
> > > > > +static int upgrade_mode(struct dm_dev_internal *dd, fmode_t 
> > > > > new_mode, bool non_exclusive,
> > > > >   struct mapped_device *md)
> > > > >  {
> > > > >   int r;
> > > > > @@ -330,8 +330,8 @@ static int upgrade_mode(struct dm_dev_internal 
> > > > > *dd, fmode_t new_mode,
> > > > >  
> > > > >   old_dev = dd->dm_dev;
> > > > >  
> > > > > - r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev,
> > > > > - dd->dm_dev->mode | new_mode, _dev);
> > > > > + r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev, 
> > > > > dd->dm_dev->mode | new_mode,
> > > > > + non_exclusive, _dev);
> > > > >   if (r)
> > > > >   return r;
> > > > >  
> > > > > @@ -387,7 +387,8 @@ int dm_get_device(struct dm_target *t

Re: [PATCH v5 4/6] dm: new ioctl DM_DEV_REMAP_CMD

2021-02-12 Thread Mike Snitzer

On Tue, Feb 09 2021 at  9:30am -0500,
Sergei Shtepa  wrote:

> New ioctl DM_DEV_REMAP_CMD allow to remap bio requests
> from regular block device to dm device.

I really dislike the (ab)use of "REMAP" for this. DM is and always has
been about remapping IO.  Would prefer DM_DEV_INTERPOSE_CMD

Similarly, all places documenting "remap" or variables with "remap"
changed to "interpose".

Also, any chance you'd be open to putting all these interposer specific
changes in dm-interposer.[ch] ?
(the various internal structs for DM core _should_ be available via dm-core.h)

Mike

Re: [PATCH v5 5/6] dm: add 'noexcl' option for dm-linear

2021-02-12 Thread Mike Snitzer

On Fri, Feb 12 2021 at  6:34am -0500,
Sergei Shtepa  wrote:

> The 02/11/2021 20:51, Mike Snitzer wrote:
> > On Tue, Feb 09 2021 at  9:30am -0500,
> > Sergei Shtepa  wrote:
> > 
> > > The 'noexcl' option allow to open underlying block-device
> > > without FMODE_EXCL.
> > > 
> > > Signed-off-by: Sergei Shtepa 
> > > ---
> > >  drivers/md/dm-linear.c| 14 +-
> > >  drivers/md/dm-table.c | 14 --
> > >  drivers/md/dm.c   | 26 +++---
> > >  drivers/md/dm.h   |  2 +-
> > >  include/linux/device-mapper.h |  7 +++
> > >  5 files changed, 48 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> > > index 00774b5d7668..b16d89802b9d 100644
> > > --- a/drivers/md/dm-linear.c
> > > +++ b/drivers/md/dm-linear.c
> > > @@ -33,7 +33,7 @@ static int linear_ctr(struct dm_target *ti, unsigned 
> > > int argc, char **argv)
> > >   char dummy;
> > >   int ret;
> > >  
> > > - if (argc != 2) {
> > > + if ((argc < 2) || (argc > 3)) {
> > >   ti->error = "Invalid argument count";
> > >   return -EINVAL;
> > >   }
> > > @@ -51,6 +51,18 @@ static int linear_ctr(struct dm_target *ti, unsigned 
> > > int argc, char **argv)
> > >   }
> > >   lc->start = tmp;
> > >  
> > > + ti->non_exclusive = false;
> > > + if (argc > 2) {
> > > + if (strcmp("noexcl", argv[2]) == 0)
> > > + ti->non_exclusive = true;
> > > + else if (strcmp("excl", argv[2]) == 0)
> > > + ti->non_exclusive = false;
> > > + else {
> > > + ti->error = "Invalid exclusive option";
> > > + return -EINVAL;
> > > + }
> > > + }
> > > +
> > >   ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), 
> > > >dev);
> > >   if (ret) {
> > >   ti->error = "Device lookup failed";
> > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > > index 4acf2342f7ad..f020459465bd 100644
> > > --- a/drivers/md/dm-table.c
> > > +++ b/drivers/md/dm-table.c
> > > @@ -322,7 +322,7 @@ static int device_area_is_invalid(struct dm_target 
> > > *ti, struct dm_dev *dev,
> > >   * device and not to touch the existing bdev field in case
> > >   * it is accessed concurrently.
> > >   */
> > > -static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
> > > +static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode, 
> > > bool non_exclusive,
> > >   struct mapped_device *md)
> > >  {
> > >   int r;
> > > @@ -330,8 +330,8 @@ static int upgrade_mode(struct dm_dev_internal *dd, 
> > > fmode_t new_mode,
> > >  
> > >   old_dev = dd->dm_dev;
> > >  
> > > - r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev,
> > > - dd->dm_dev->mode | new_mode, _dev);
> > > + r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev, dd->dm_dev->mode 
> > > | new_mode,
> > > + non_exclusive, _dev);
> > >   if (r)
> > >   return r;
> > >  
> > > @@ -387,7 +387,8 @@ int dm_get_device(struct dm_target *ti, const char 
> > > *path, fmode_t mode,
> > >   if (!dd)
> > >   return -ENOMEM;
> > >  
> > > - if ((r = dm_get_table_device(t->md, dev, mode, >dm_dev))) {
> > > + r = dm_get_table_device(t->md, dev, mode, ti->non_exclusive, 
> > > >dm_dev);
> > > + if (r) {
> > >   kfree(dd);
> > >   return r;
> > >   }
> > > @@ -396,8 +397,9 @@ int dm_get_device(struct dm_target *ti, const char 
> > > *path, fmode_t mode,
> > >   list_add(>list, >devices);
> > >   goto out;
> > >  
> > > - } else if (dd->dm_dev->mode != (mode | dd->dm_dev->mode)) {
> > > - r = upgrade_mode(dd, mode, t->md);
> > > + } else if ((dd->dm_dev->mode != (mode | dd->dm_dev->mode)) &&
> > > +(dd->dm_dev->non_exclusive != ti-&g

Re: [PATCH v4 0/5] add support for inline encryption to device mapper

2021-02-11 Thread Mike Snitzer

On Thu, Feb 11 2021 at  6:01pm -0500,
Satya Tangirala  wrote:

> On Wed, Feb 10, 2021 at 12:59:59PM -0700, Jens Axboe wrote:
> > On 2/10/21 12:33 PM, Mike Snitzer wrote:
> > > On Mon, Feb 01 2021 at 12:10am -0500,
> > > Satya Tangirala  wrote:
> > > 
> > >> This patch series adds support for inline encryption to the device 
> > >> mapper.
> > >>
> > >> Patch 1 introduces the "passthrough" keyslot manager.
> > >>
> > >> The regular keyslot manager is designed for inline encryption hardware 
> > >> that
> > >> have only a small fixed number of keyslots. A DM device itself does not
> > >> actually have only a small fixed number of keyslots - it doesn't actually
> > >> have any keyslots in the first place, and programming an encryption 
> > >> context
> > >> into a DM device doesn't make much semantic sense. It is possible for a 
> > >> DM
> > >> device to set up a keyslot manager with some "sufficiently large" number 
> > >> of
> > >> keyslots in its request queue, so that upper layers can use the inline
> > >> encryption capabilities of the DM device's underlying devices, but the
> > >> memory being allocated for the DM device's keyslots is a waste since they
> > >> won't actually be used by the DM device.
> > >>
> > >> The passthrough keyslot manager solves this issue - when the block layer
> > >> sees that a request queue has a passthrough keyslot manager, it doesn't
> > >> attempt to program any encryption context into the keyslot manager. The
> > >> passthrough keyslot manager only allows the device to expose its inline
> > >> encryption capabilities, and a way for upper layers to evict keys if
> > >> necessary.
> > >>
> > >> There also exist inline encryption hardware that can handle encryption
> > >> contexts directly, and allow users to pass them a data request along with
> > >> the encryption context (as opposed to inline encryption hardware that
> > >> require users to first program a keyslot with an encryption context, and
> > >> then require the users to pass the keyslot index with the data request).
> > >> Such devices can also make use of the passthrough keyslot manager.
> > >>
> > >> Patch 2 introduces some keyslot manager functions useful for the device
> > >> mapper.
> > >>
> > >> Patch 3 introduces the changes for inline encryption support for the 
> > >> device
> > >> mapper. A DM device only exposes the intersection of the crypto
> > >> capabilities of its underlying devices. This is so that in case a bio 
> > >> with
> > >> an encryption context is eventually mapped to an underlying device that
> > >> doesn't support that encryption context, the blk-crypto-fallback's cipher
> > >> tfms are allocated ahead of time by the call to 
> > >> blk_crypto_start_using_key.
> > >>
> > >> Each DM target can now also specify the "DM_TARGET_PASSES_CRYPTO" flag in
> > >> the target type features to opt-in to supporting passing through the
> > >> underlying inline encryption capabilities.  This flag is needed because 
> > >> it
> > >> doesn't make much semantic sense for certain targets like dm-crypt to
> > >> expose the underlying inline encryption capabilities to the upper layers.
> > >> Again, the DM exposes inline encryption capabilities of the underlying
> > >> devices only if all of them opt-in to passing through inline encryption
> > >> support.
> > >>
> > >> A keyslot manager is created for a table when it is loaded. However, the
> > >> mapped device's exposed capabilities *only* updated once the table is
> > >> swapped in (until the new table is swapped in, the mapped device 
> > >> continues
> > >> to expose the old table's crypto capabilities).
> > >>
> > >> This patch only allows the keyslot manager's capabilities to *expand*
> > >> because of table changes. Any attempt to load a new table that doesn't
> > >> support a crypto capability that the old table did is rejected.
> > >>
> > >> This patch also only exposes the intersection of the underlying device's
> > >> capabilities, which has the effect of causing en/decryption of a bio to
> > >> fall back to the kernel crypto API (if the fallback is enable

Re: [PATCH v5 5/6] dm: add 'noexcl' option for dm-linear

2021-02-11 Thread Mike Snitzer

On Tue, Feb 09 2021 at  9:30am -0500,
Sergei Shtepa  wrote:

> The 'noexcl' option allow to open underlying block-device
> without FMODE_EXCL.
> 
> Signed-off-by: Sergei Shtepa 
> ---
>  drivers/md/dm-linear.c| 14 +-
>  drivers/md/dm-table.c | 14 --
>  drivers/md/dm.c   | 26 +++---
>  drivers/md/dm.h   |  2 +-
>  include/linux/device-mapper.h |  7 +++
>  5 files changed, 48 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index 00774b5d7668..b16d89802b9d 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -33,7 +33,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>   char dummy;
>   int ret;
>  
> - if (argc != 2) {
> + if ((argc < 2) || (argc > 3)) {
>   ti->error = "Invalid argument count";
>   return -EINVAL;
>   }
> @@ -51,6 +51,18 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>   }
>   lc->start = tmp;
>  
> + ti->non_exclusive = false;
> + if (argc > 2) {
> + if (strcmp("noexcl", argv[2]) == 0)
> + ti->non_exclusive = true;
> + else if (strcmp("excl", argv[2]) == 0)
> + ti->non_exclusive = false;
> + else {
> + ti->error = "Invalid exclusive option";
> + return -EINVAL;
> + }
> + }
> +
>   ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), 
> >dev);
>   if (ret) {
>   ti->error = "Device lookup failed";
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 4acf2342f7ad..f020459465bd 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -322,7 +322,7 @@ static int device_area_is_invalid(struct dm_target *ti, 
> struct dm_dev *dev,
>   * device and not to touch the existing bdev field in case
>   * it is accessed concurrently.
>   */
> -static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
> +static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode, bool 
> non_exclusive,
>   struct mapped_device *md)
>  {
>   int r;
> @@ -330,8 +330,8 @@ static int upgrade_mode(struct dm_dev_internal *dd, 
> fmode_t new_mode,
>  
>   old_dev = dd->dm_dev;
>  
> - r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev,
> - dd->dm_dev->mode | new_mode, _dev);
> + r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev, dd->dm_dev->mode 
> | new_mode,
> + non_exclusive, _dev);
>   if (r)
>   return r;
>  
> @@ -387,7 +387,8 @@ int dm_get_device(struct dm_target *ti, const char *path, 
> fmode_t mode,
>   if (!dd)
>   return -ENOMEM;
>  
> - if ((r = dm_get_table_device(t->md, dev, mode, >dm_dev))) {
> + r = dm_get_table_device(t->md, dev, mode, ti->non_exclusive, 
> >dm_dev);
> + if (r) {
>   kfree(dd);
>   return r;
>   }
> @@ -396,8 +397,9 @@ int dm_get_device(struct dm_target *ti, const char *path, 
> fmode_t mode,
>   list_add(>list, >devices);
>   goto out;
>  
> - } else if (dd->dm_dev->mode != (mode | dd->dm_dev->mode)) {
> - r = upgrade_mode(dd, mode, t->md);
> + } else if ((dd->dm_dev->mode != (mode | dd->dm_dev->mode)) &&
> +(dd->dm_dev->non_exclusive != ti->non_exclusive)) {
> + r = upgrade_mode(dd, mode, ti->non_exclusive, t->md);
>   if (r)
>   return r;
>   }
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 00c41aa6d092..c25dcc2fdb89 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1117,33 +1117,44 @@ static void close_table_device(struct table_device 
> *td, struct mapped_device *md
>   if (!td->dm_dev.bdev)
>   return;
>  
> - bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
> - blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
> + if (td->dm_dev.non_exclusive)
> + blkdev_put(td->dm_dev.bdev, td->dm_dev.mode);
> + else {
> + bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
> + blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
> + }
> +
> +
> + blkdev_put(td->dm_dev.bdev, td->dm_dev.mode);
> +
>   put_dax(td->dm_dev.dax_dev);
>   td->dm_dev.bdev = NULL;
>   td->dm_dev.dax_dev = NULL;
> + td->dm_dev.non_exclusive = false;
>  }
>  
>  static struct table_device *find_table_device(struct list_head *l, dev_t dev,
> -   fmode_t mode)
> +   fmode_t mode, bool non_exclusive)
>  {
>   struct table_device *td;
>  
>   list_for_each_entry(td, l, list)
> -

Re: linux-next: build failure after merge of the device-mapper tree

2021-02-11 Thread Mike Snitzer

On Wed, Feb 10 2021 at 10:36pm -0500,
Stephen Rothwell  wrote:

> Hi all,
> 
> After merging the device-mapper tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
> 
> drivers/md/dm-linear.c:234:3: error: 'struct target_type' has no member named 
> 'report_zones'
>   234 |  .report_zones = linear_report_zones,
>   |   ^~~~
> drivers/md/dm-crypt.c:3585:3: error: 'struct target_type' has no member named 
> 'report_zones'
>  3585 |  .report_zones = crypt_report_zones,
>   |   ^~~~
> 
> Caused by commit
> 
>   7defd0da9dd2 ("dm: simplify target code conditional on 
> CONFIG_BLK_DEV_ZONED")
> 
> The report_zones members existence is guarded by CONFIG_BLK_DEV_ZONED.
> 
> I have used the device-mapper tree from next-20210210 for today.

Sorry, I didn't test with CONFIG_BLK_DEV_ZONED disabled, now fixed.

Thanks,
Mike

Re: [PATCH v4 0/5] add support for inline encryption to device mapper

2021-02-10 Thread Mike Snitzer

On Mon, Feb 01 2021 at 12:10am -0500,
Satya Tangirala  wrote:

> This patch series adds support for inline encryption to the device mapper.
> 
> Patch 1 introduces the "passthrough" keyslot manager.
> 
> The regular keyslot manager is designed for inline encryption hardware that
> have only a small fixed number of keyslots. A DM device itself does not
> actually have only a small fixed number of keyslots - it doesn't actually
> have any keyslots in the first place, and programming an encryption context
> into a DM device doesn't make much semantic sense. It is possible for a DM
> device to set up a keyslot manager with some "sufficiently large" number of
> keyslots in its request queue, so that upper layers can use the inline
> encryption capabilities of the DM device's underlying devices, but the
> memory being allocated for the DM device's keyslots is a waste since they
> won't actually be used by the DM device.
> 
> The passthrough keyslot manager solves this issue - when the block layer
> sees that a request queue has a passthrough keyslot manager, it doesn't
> attempt to program any encryption context into the keyslot manager. The
> passthrough keyslot manager only allows the device to expose its inline
> encryption capabilities, and a way for upper layers to evict keys if
> necessary.
> 
> There also exist inline encryption hardware that can handle encryption
> contexts directly, and allow users to pass them a data request along with
> the encryption context (as opposed to inline encryption hardware that
> require users to first program a keyslot with an encryption context, and
> then require the users to pass the keyslot index with the data request).
> Such devices can also make use of the passthrough keyslot manager.
> 
> Patch 2 introduces some keyslot manager functions useful for the device
> mapper.
> 
> Patch 3 introduces the changes for inline encryption support for the device
> mapper. A DM device only exposes the intersection of the crypto
> capabilities of its underlying devices. This is so that in case a bio with
> an encryption context is eventually mapped to an underlying device that
> doesn't support that encryption context, the blk-crypto-fallback's cipher
> tfms are allocated ahead of time by the call to blk_crypto_start_using_key.
> 
> Each DM target can now also specify the "DM_TARGET_PASSES_CRYPTO" flag in
> the target type features to opt-in to supporting passing through the
> underlying inline encryption capabilities.  This flag is needed because it
> doesn't make much semantic sense for certain targets like dm-crypt to
> expose the underlying inline encryption capabilities to the upper layers.
> Again, the DM exposes inline encryption capabilities of the underlying
> devices only if all of them opt-in to passing through inline encryption
> support.
> 
> A keyslot manager is created for a table when it is loaded. However, the
> mapped device's exposed capabilities *only* updated once the table is
> swapped in (until the new table is swapped in, the mapped device continues
> to expose the old table's crypto capabilities).
> 
> This patch only allows the keyslot manager's capabilities to *expand*
> because of table changes. Any attempt to load a new table that doesn't
> support a crypto capability that the old table did is rejected.
> 
> This patch also only exposes the intersection of the underlying device's
> capabilities, which has the effect of causing en/decryption of a bio to
> fall back to the kernel crypto API (if the fallback is enabled) whenever
> any of the underlying devices doesn't support the encryption context of the
> bio - it might be possible to make the bio only fall back to the kernel
> crypto API if the bio's target underlying device doesn't support the bio's
> encryption context, but the use case may be uncommon enough in the first
> place not to warrant worrying about it right now.
> 
> Patch 4 makes DM evict a key from all its underlying devices when asked to
> evict a key.
> 
> Patch 5 makes some DM targets opt-in to passing through inline encryption
> support. It does not (yet) try to enable this option with dm-raid, since
> users can "hot add" disks to a raid device, which makes this not completely
> straightforward (we'll need to ensure that any "hot added" disks must have
> a superset of the inline encryption capabilities of the rest of the disks
> in the raid device, due to the way Patch 2 of this series works).
> 
> Changes v3 => v4:
>  - Allocate the memory for the ksm of the mapped device in
>dm_table_complete(), and install the ksm in the md queue in __bind()
>(as suggested by Mike). Also drop patch 5 from v3 since it's no longer
>needed.
>  - Some cleanups
> 
> Changes v2 => v3:
>  - Split up the main DM patch into 4 separate patches
>  - Removed the priv variable added to struct keyslot manager in v2
>  - Use a flag in target type features for opting-in to inline encryption
>support, instead of using

Re: [PATCH v4 2/6] block: add blk_interposer

2021-02-03 Thread Mike Snitzer

On Wed, Feb 03 2021 at 10:53am -0500,
Sergei Shtepa  wrote:

> blk_interposer allows to intercept bio requests, remap bio to another devices 
> or add new bios.
> 
> Signed-off-by: Sergei Shtepa 
> ---
>  block/bio.c   |  2 +
>  block/blk-core.c  | 33 
>  block/genhd.c | 82 +++
>  include/linux/blk_types.h |  6 ++-
>  include/linux/genhd.h | 18 +
>  5 files changed, 139 insertions(+), 2 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 1f2cc1fbe283..f6f135eb84b5 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -684,6 +684,8 @@ void __bio_clone_fast(struct bio *bio, struct bio 
> *bio_src)
>   bio_set_flag(bio, BIO_CLONED);
>   if (bio_flagged(bio_src, BIO_THROTTLED))
>   bio_set_flag(bio, BIO_THROTTLED);
> + if (bio_flagged(bio_src, BIO_INTERPOSED))
> + bio_set_flag(bio, BIO_INTERPOSED);
>   bio->bi_opf = bio_src->bi_opf;
>   bio->bi_ioprio = bio_src->bi_ioprio;
>   bio->bi_write_hint = bio_src->bi_write_hint;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7663a9b94b80..c84bc42ba88b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1032,6 +1032,32 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>   return ret;
>  }
>  
> +static blk_qc_t __submit_bio_interposed(struct bio *bio)
> +{
> + struct bio_list bio_list[2] = { };
> + blk_qc_t ret = BLK_QC_T_NONE;
> +
> + current->bio_list = bio_list;
> + if (likely(bio_queue_enter(bio) == 0)) {
> + struct gendisk *disk = bio->bi_disk;
> +
> + if (likely(blk_has_interposer(disk))) {
> + bio_set_flag(bio, BIO_INTERPOSED);
> + disk->interposer->ip_submit_bio(bio);
> + } else /* interposer was removed */
> + bio_list_add(>bio_list[0], bio);

style nit:

} else {
/* interposer was removed */
bio_list_add(>bio_list[0], bio);
}

> +
> + blk_queue_exit(disk->queue);
> + }
> + current->bio_list = NULL;
> +
> + /* Resubmit remaining bios */
> + while ((bio = bio_list_pop(_list[0])))
> + ret = submit_bio_noacct(bio);
> +
> + return ret;
> +}
> +
>  /**
>   * submit_bio_noacct - re-submit a bio to the block device layer for I/O
>   * @bio:  The bio describing the location in memory and on the device.
> @@ -1057,6 +1083,13 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
>   return BLK_QC_T_NONE;
>   }
>  
> + /*
> +  * Checking the BIO_INTERPOSED flag is necessary so that the bio
> +  * created by the blk_interposer do not get to it for processing.
> +  */
> + if (blk_has_interposer(bio->bi_disk) &&
> + !bio_flagged(bio, BIO_INTERPOSED))
> + return __submit_bio_interposed(bio);
>   if (!bio->bi_disk->fops->submit_bio)
>   return __submit_bio_noacct_mq(bio);
>   return __submit_bio_noacct(bio);
> diff --git a/block/genhd.c b/block/genhd.c
> index 419548e92d82..39785a3ef703 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -30,6 +30,7 @@
>  static struct kobject *block_depr;
>  
>  DECLARE_RWSEM(bdev_lookup_sem);
> +DEFINE_MUTEX(bdev_interposer_mutex);

Seems you're using this mutex to protect access to disk->interposer in
attach/detach.  This is to prevent attach/detach races to same device?

Thankfully attach/detach isn't in the bio submission fast path but it'd
be helpful to document what this mutex is protecting).

A storm of attach or detach will all hit this global mutex though...

Mike

Re: [PATCH v4 3/6] block: add blk_mq_is_queue_frozen()

2021-02-03 Thread Mike Snitzer

On Wed, Feb 03 2021 at 10:53am -0500,
Sergei Shtepa  wrote:

> blk_mq_is_queue_frozen() allow to assert that the queue is frozen.
> 
> Signed-off-by: Sergei Shtepa 
> ---
>  block/blk-mq.c | 13 +
>  include/linux/blk-mq.h |  1 +
>  2 files changed, 14 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f285a9123a8b..924ec26fae5f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -161,6 +161,19 @@ int blk_mq_freeze_queue_wait_timeout(struct 
> request_queue *q,
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
>  
> +
> +bool blk_mq_is_queue_frozen(struct request_queue *q)
> +{
> + bool ret;
> +
> + mutex_lock(>mq_freeze_lock);
> + ret = percpu_ref_is_dying(>q_usage_counter) && 
> percpu_ref_is_zero(>q_usage_counter);
> + mutex_unlock(>mq_freeze_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_is_queue_frozen);
> +
>  /*
>   * Guarantee no request is in use, so we can change any data structure of
>   * the queue afterward.
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index d705b174d346..9d1e8c4e922e 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -525,6 +525,7 @@ void blk_freeze_queue_start(struct request_queue *q);
>  void blk_mq_freeze_queue_wait(struct request_queue *q);
>  int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
>unsigned long timeout);
> +bool blk_mq_is_queue_frozen(struct request_queue *q);
>  
>  int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
>  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int 
> nr_hw_queues);
> -- 
> 2.20.1
> 

This needs to come before patch 2 (since patch 2 uses it).

Mike

Re: [PATCH 1/2] dm crypt: replaced #if defined with IS_ENABLED

2021-02-02 Thread Mike Snitzer

On Fri, Jan 22 2021 at  3:43am -0500,
Ahmad Fatoum  wrote:

> IS_ENABLED(CONFIG_ENCRYPTED_KEYS) is true whether the option is built-in
> or a module, so use it instead of #if defined checking for each
> separately.
> 
> The other #if was to avoid a static function defined, but unused
> warning. As we now always build the callsite when the function
> is defined, we can remove that first #if guard.
> 
> Suggested-by: Arnd Bergmann 
> Signed-off-by: Ahmad Fatoum 
> ---
> Cc: Dmitry Baryshkov 
> ---
>  drivers/md/dm-crypt.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 8c874710f0bc..7eeb9248eda5 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -2436,7 +2436,6 @@ static int set_key_user(struct crypt_config *cc, struct 
> key *key)
>   return 0;
>  }
>  
> -#if defined(CONFIG_ENCRYPTED_KEYS) || defined(CONFIG_ENCRYPTED_KEYS_MODULE)
>  static int set_key_encrypted(struct crypt_config *cc, struct key *key)
>  {
>   const struct encrypted_key_payload *ekp;
> @@ -2452,7 +2451,6 @@ static int set_key_encrypted(struct crypt_config *cc, 
> struct key *key)
>  
>   return 0;
>  }
> -#endif /* CONFIG_ENCRYPTED_KEYS */
>  
>  static int crypt_set_keyring_key(struct crypt_config *cc, const char 
> *key_string)
>  {
> @@ -2482,11 +2480,10 @@ static int crypt_set_keyring_key(struct crypt_config 
> *cc, const char *key_string
>   } else if (!strncmp(key_string, "user:", key_desc - key_string + 1)) {
>   type = _type_user;
>   set_key = set_key_user;
> -#if defined(CONFIG_ENCRYPTED_KEYS) || defined(CONFIG_ENCRYPTED_KEYS_MODULE)
> - } else if (!strncmp(key_string, "encrypted:", key_desc - key_string + 
> 1)) {
> + } else if (IS_ENABLED(CONFIG_ENCRYPTED_KEYS) &&
> +!strncmp(key_string, "encrypted:", key_desc - key_string + 
> 1)) {
>   type = _type_encrypted;
>   set_key = set_key_encrypted;
> -#endif
>   } else {
>   return -EINVAL;
>   }
> -- 
> 2.30.0
> 

I could be mistaken but the point of the previous way used to enable
this code was to not compile the code at all.  How you have it, the
branch just isn't taken but the compiled code is left to bloat dm-crypt.

Why not leave this as is and follow same pattern in your next patch?

Mike

Re: [RFC PATCH 00/37] block: introduce bio_init_fields()

2021-01-19 Thread Mike Snitzer

On Tue, Jan 19 2021 at 12:05am -0500,
Chaitanya Kulkarni  wrote:

> Hi,
> 
> This is a *compile only RFC* which adds a generic helper to initialize
> the various fields of the bio that is repeated all the places in
> file-systems, block layer, and drivers.
> 
> The new helper allows callers to initialize various members such as
> bdev, sector, private, end io callback, io priority, and write hints.
> 
> The objective of this RFC is to only start a discussion, this it not 
> completely tested at all. Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â 
> Following diff shows code level benefits of this helper :-
> Â 38 files changed, 124 insertions(+), 236 deletions(-)


Please no... this is just obfuscation.

Adding yet another field to set would create a cascade of churn
throughout kernel (and invariably many callers won't need the new field
initialized, so you keep passing 0 for more and more fields).

Nacked-by: Mike Snitzer

Re: [PATCH v3 5/6] dm: Verify inline encryption capabilities of new table when it is loaded

2021-01-14 Thread Mike Snitzer

On Tue, Dec 29 2020 at  3:55am -0500,
Satya Tangirala  wrote:

> DM only allows the table to be swapped if the new table's inline encryption
> capabilities are a superset of the old table's. We only check that this
> constraint is true when the table is actually swapped in (in
> dm_swap_table()). But this allows a user to load an unacceptable table
> without any complaint from DM, only for DM to throw an error when the
> device is resumed, and the table is swapped in.
> 
> This patch makes DM verify the inline encryption capabilities of the new
> table when the table is loaded. DM continues to verify and use the
> capabilities at the time of table swap, since the capabilities of
> underlying child devices can expand during the time between the table load
> and table swap (which in turn can cause the capabilities of this parent
> device to expand as well).
> 
> Signed-off-by: Satya Tangirala 
> ---
>  drivers/md/dm-ioctl.c |  8 
>  drivers/md/dm.c   | 25 +
>  drivers/md/dm.h   | 19 +++
>  3 files changed, 52 insertions(+)
> 
> diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> index 5e306bba4375..055a3c745243 100644
> --- a/drivers/md/dm-ioctl.c
> +++ b/drivers/md/dm-ioctl.c
> @@ -1358,6 +1358,10 @@ static int table_load(struct file *filp, struct 
> dm_ioctl *param, size_t param_si
>   goto err_unlock_md_type;
>   }
>  
> + r = dm_verify_inline_encryption(md, t);
> + if (r)
> + goto err_unlock_md_type;
> +
>   if (dm_get_md_type(md) == DM_TYPE_NONE) {
>   /* Initial table load: acquire type of table. */
>   dm_set_md_type(md, dm_table_get_type(t));
> @@ -2115,6 +2119,10 @@ int __init dm_early_create(struct dm_ioctl *dmi,
>   if (r)
>   goto err_destroy_table;
>  
> + r = dm_verify_inline_encryption(md, t);
> + if (r)
> + goto err_destroy_table;
> +
>   md->type = dm_table_get_type(t);
>   /* setup md->queue to reflect md's type (may block) */
>   r = dm_setup_md_queue(md, t);
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index b8844171d8e4..04322de34d29 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -2094,6 +2094,31 @@ dm_construct_keyslot_manager(struct mapped_device *md, 
> struct dm_table *t)
>   return ksm;
>  }
>  
> +/**
> + * dm_verify_inline_encryption() - Verifies that the current keyslot manager 
> of
> + *  the mapped_device can be replaced by the
> + *  keyslot manager of a given dm_table.
> + * @md: The mapped_device
> + * @t: The dm_table
> + *
> + * In particular, this function checks that the keyslot manager that will be
> + * constructed for the dm_table will support a superset of the capabilities 
> that
> + * the current keyslot manager of the mapped_device supports.
> + *
> + * Return: 0 if the table's keyslot_manager can replace the current keyslot
> + *  manager of the mapped_device. Negative value otherwise.
> + */
> +int dm_verify_inline_encryption(struct mapped_device *md, struct dm_table *t)
> +{
> + struct blk_keyslot_manager *ksm = dm_construct_keyslot_manager(md, t);
> +
> + if (IS_ERR(ksm))
> + return PTR_ERR(ksm);
> + dm_destroy_keyslot_manager(ksm);
> +
> + return 0;
> +}
> +
>  static void dm_update_keyslot_manager(struct request_queue *q,
> struct blk_keyslot_manager *ksm)
>  {


There shouldn't be any need to bolt on ksm verification in terms of a
temporary ksm.  If you run with my suggestions I just provided in review
of patch 3: dm_table_complete()'s setup of the ksm should also
implicitly validate it.

So this patch, and extra dm_verify_inline_encryption() interface,
shouldn't be needed.

Mike

Re: [PATCH v3 3/6] dm: add support for passing through inline crypto support

2021-01-14 Thread Mike Snitzer

On Tue, Dec 29 2020 at  3:55am -0500,
Satya Tangirala  wrote:

> Update the device-mapper core to support exposing the inline crypto
> support of the underlying device(s) through the device-mapper device.
> 
> This works by creating a "passthrough keyslot manager" for the dm
> device, which declares support for encryption settings which all
> underlying devices support.  When a supported setting is used, the bio
> cloning code handles cloning the crypto context to the bios for all the
> underlying devices.  When an unsupported setting is used, the blk-crypto
> fallback is used as usual.
> 
> Crypto support on each underlying device is ignored unless the
> corresponding dm target opts into exposing it.  This is needed because
> for inline crypto to semantically operate on the original bio, the data
> must not be transformed by the dm target.  Thus, targets like dm-linear
> can expose crypto support of the underlying device, but targets like
> dm-crypt can't.  (dm-crypt could use inline crypto itself, though.)
> 
> A DM device's table can only be changed if the "new" inline encryption
> capabilities are a (*not* necessarily strict) superset of the "old" inline
> encryption capabilities.  Attempts to make changes to the table that result
> in some inline encryption capability becoming no longer supported will be
> rejected.
> 
> For the sake of clarity, key eviction from underlying devices will be
> handled in a future patch.
> 
> Co-developed-by: Eric Biggers 
> Signed-off-by: Eric Biggers 
> Signed-off-by: Satya Tangirala 
> ---
>  drivers/md/dm.c | 164 +++-
>  include/linux/device-mapper.h   |   6 ++
>  include/linux/keyslot-manager.h |   8 ++
>  3 files changed, 177 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index b3c3c8b4cb42..13b9c8e2e21b 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DM_MSG_PREFIX "core"
>  
> @@ -1718,6 +1719,8 @@ static const struct dax_operations dm_dax_ops;
>  
>  static void dm_wq_work(struct work_struct *work);
>  
> +static void dm_destroy_inline_encryption(struct request_queue *q);
> +
>  static void cleanup_mapped_device(struct mapped_device *md)
>  {
>   if (md->wq)
> @@ -1739,8 +1742,10 @@ static void cleanup_mapped_device(struct mapped_device 
> *md)
>   put_disk(md->disk);
>   }
>  
> - if (md->queue)
> + if (md->queue) {
> + dm_destroy_inline_encryption(md->queue);
>   blk_cleanup_queue(md->queue);
> + }
>  
>   cleanup_srcu_struct(>io_barrier);
>  
> @@ -1937,6 +1942,150 @@ static void event_callback(void *context)
>   dm_issue_global_event();
>  }
>  
> +#ifdef CONFIG_BLK_INLINE_ENCRYPTION
> +
> +struct dm_keyslot_manager {
> + struct blk_keyslot_manager ksm;
> + struct mapped_device *md;
> +};
> +
> +static int device_intersect_crypto_modes(struct dm_target *ti,
> +  struct dm_dev *dev, sector_t start,
> +  sector_t len, void *data)
> +{
> + struct blk_keyslot_manager *parent = data;
> + struct blk_keyslot_manager *child = bdev_get_queue(dev->bdev)->ksm;
> +
> + blk_ksm_intersect_modes(parent, child);
> + return 0;
> +}
> +
> +static void dm_destroy_keyslot_manager(struct blk_keyslot_manager *ksm)
> +{
> + struct dm_keyslot_manager *dksm = container_of(ksm,
> +struct 
> dm_keyslot_manager,
> +ksm);
> +
> + if (!ksm)
> + return;
> +
> + blk_ksm_destroy(ksm);
> + kfree(dksm);
> +}
> +
> +/*
> + * Constructs and returns a keyslot manager that represents the crypto
> + * capabilities of the devices described by the dm_table. However, if the
> + * constructed keyslot manager does not support a superset of the crypto
> + * capabilities supported by the current keyslot manager of the 
> mapped_device,
> + * it returns an error instead, since we don't support restricting crypto
> + * capabilities on table changes. Finally, if the constructed keyslot manager
> + * doesn't actually support any crypto modes at all, it just returns NULL.
> + */
> +static struct blk_keyslot_manager *
> +dm_construct_keyslot_manager(struct mapped_device *md, struct dm_table *t)
> +{
> + struct dm_keyslot_manager *dksm;
> + struct blk_keyslot_manager *ksm;
> + struct dm_target *ti;
> + unsigned int i;
> + bool ksm_is_empty = true;
> +
> + dksm = kmalloc(sizeof(*dksm), GFP_KERNEL);
> + if (!dksm)
> + return ERR_PTR(-ENOMEM);
> + dksm->md = md;
> +
> + ksm = >ksm;
> + blk_ksm_init_passthrough(ksm);
> + ksm->max_dun_bytes_supported = UINT_MAX;
> + memset(ksm->crypto_modes_supported, 0xFF,
> +sizeof(ksm->crypto_modes_supported));
> +
> + for (i = 0; i <

Re: [PATCH v3 2/6] block: keyslot-manager: Introduce functions for device mapper support

2021-01-14 Thread Mike Snitzer

On Tue, Dec 29 2020 at  3:55am -0500,
Satya Tangirala  wrote:

> Introduce blk_ksm_update_capabilities() to update the capabilities of
> a keyslot manager (ksm) in-place. The pointer to a ksm in a device's
> request queue may not be easily replaced, because upper layers like
> the filesystem might access it (e.g. for programming keys/checking
> capabilities) at the same time the device wants to replace that
> request queue's ksm (and free the old ksm's memory). This function
> allows the device to update the capabilities of the ksm in its request
> queue directly.
> 
> Also introduce blk_ksm_is_superset() which checks whether one ksm's
> capabilities are a (not necessarily strict) superset of another ksm's.
> The blk-crypto framework requires that crypto capabilities that were
> advertised when a bio was created continue to be supported by the
> device until that bio is ended - in practice this probably means that
> a device's advertised crypto capabilities can *never* "shrink" (since
> there's no synchronization between bio creation and when a device may
> want to change its advertised capabilities) - so a previously
> advertised crypto capability must always continue to be supported.
> This function can be used to check that a new ksm is a valid
> replacement for an old ksm.
> 
> Signed-off-by: Satya Tangirala 
> ---
>  block/keyslot-manager.c | 91 +
>  include/linux/keyslot-manager.h |  9 
>  2 files changed, 100 insertions(+)
> 
> diff --git a/block/keyslot-manager.c b/block/keyslot-manager.c
> index ac7ce83a76e8..f13ab7410eca 100644
> --- a/block/keyslot-manager.c
> +++ b/block/keyslot-manager.c
> @@ -424,6 +424,97 @@ void blk_ksm_unregister(struct request_queue *q)
>   q->ksm = NULL;
>  }
>  
> +/**
> + * blk_ksm_intersect_modes() - restrict supported modes by child device
> + * @parent: The keyslot manager for parent device
> + * @child: The keyslot manager for child device, or NULL
> + *
> + * Clear any crypto mode support bits in @parent that aren't set in @child.
> + * If @child is NULL, then all parent bits are cleared.
> + *
> + * Only use this when setting up the keyslot manager for a layered device,
> + * before it's been exposed yet.
> + */
> +void blk_ksm_intersect_modes(struct blk_keyslot_manager *parent,
> +  const struct blk_keyslot_manager *child)
> +{
> + if (child) {
> + unsigned int i;
> +
> + parent->max_dun_bytes_supported =
> + min(parent->max_dun_bytes_supported,
> + child->max_dun_bytes_supported);
> + for (i = 0; i < ARRAY_SIZE(child->crypto_modes_supported);
> +  i++) {
> + parent->crypto_modes_supported[i] &=
> + child->crypto_modes_supported[i];
> + }
> + } else {
> + parent->max_dun_bytes_supported = 0;
> + memset(parent->crypto_modes_supported, 0,
> +sizeof(parent->crypto_modes_supported));
> + }
> +}
> +EXPORT_SYMBOL_GPL(blk_ksm_intersect_modes);
> +
> +/**
> + * blk_ksm_is_superset() - Check if a KSM supports a superset of crypto modes
> + *  and DUN bytes that another KSM supports. Here,
> + *  "superset" refers to the mathematical meaning of the
> + *  word - i.e. if two KSMs have the *same* capabilities,
> + *  they *are* considered supersets of each other.
> + * @ksm_superset: The KSM that we want to verify is a superset
> + * @ksm_subset: The KSM that we want to verify is a subset
> + *
> + * Return: True if @ksm_superset supports a superset of the crypto modes and 
> DUN
> + *  bytes that @ksm_subset supports.
> + */
> +bool blk_ksm_is_superset(struct blk_keyslot_manager *ksm_superset,
> +  struct blk_keyslot_manager *ksm_subset)
> +{
> + int i;
> +
> + if (!ksm_subset)
> + return true;
> +
> + if (!ksm_superset)
> + return false;
> +
> + for (i = 0; i < ARRAY_SIZE(ksm_superset->crypto_modes_supported); i++) {
> + if (ksm_subset->crypto_modes_supported[i] &
> + (~ksm_superset->crypto_modes_supported[i])) {
> + return false;
> + }
> + }
> +
> + if (ksm_subset->max_dun_bytes_supported >
> + ksm_superset->max_dun_bytes_supported) {
> + return false;
> + }
> +
> + return true;
> +}
> +EXPORT_SYMBOL_GPL(blk_ksm_is_superset);
> +
> +/**
> + * blk_ksm_update_capabilities() - Update the restrictions of a KSM to those 
> of
> + *  another KSM
> + * @target_ksm: The KSM whose restrictions to update.
> + * @reference_ksm: The KSM to whose restrictions this function will update
> + *  @target_ksm's restrictions to,
> + */
> +void blk_ksm_update_capabilities(struct blk_keyslot_manager *target_ksm,
> +

Re: dm snap : add sanity checks to snapshot_ctr

2021-01-04 Thread Mike Snitzer

On Fri, Dec 25 2020 at  1:48am -0500,
Defang Bo  wrote:

> Similar to commit<70de2cbd>,there should be a check for argc and argv to 
> prevent Null pointer dereferencing
> when the dm_get_device invoked twice on the same device path with differnt 
> mode.
> 
> Signed-off-by: Defang Bo 
> ---
>  drivers/md/dm-snap.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
> index 4668b2c..dccce8b 100644
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -1258,6 +1258,13 @@ static int snapshot_ctr(struct dm_target *ti, unsigned 
> int argc, char **argv)
>  
>   as.argc = argc;
>   as.argv = argv;
> +
> + if (!strcmp(argv[0], argv[1])) {
> + ti->error = "Error setting metadata or data device";
> + r = -EINVAL;
> + goto bad;
> + }
> +
>   dm_consume_args(, 4);
>   r = parse_snapshot_features(, s, ti);
>   if (r)
> -- 
> 2.7.4
> 

We already have this later in snapshot_ctr:

if (cow_dev && cow_dev == origin_dev) {
ti->error = "COW device cannot be the same as origin device";
r = -EINVAL;
goto bad_cow;
}

Which happens before the 2nd dm_get_device() for the cow device.  So
I'm not seeing how you could experience the NULL pointer you say is
possible.

Mike

DM's filesystem lookup in dm_get_dev_t() [was: Re: linux-next: manual merge of the device-mapper tree with Linus' tree]

2020-12-22 Thread Mike Snitzer

[added linux-block and dm-devel, if someone replies to this email to
continue "proper discussion" _please_ at least drop sfr and linux-next
from Cc]

On Tue, Dec 22 2020 at  8:15am -0500,
Christoph Hellwig  wrote:

> Mike, Hannes,
> 
> I think this patch is rather harmful.  Why does device mapper even
> mix file system path with a dev_t and all the other weird forms
> parsed by name_to_dev_t, which was supposed to be be for the early
> init code where no file system is available.

OK, I'll need to revisit (unless someone beats me to it) because this
could've easily been a blind-spot for me when the dm-init code went in.
Any dm-init specific enabling interface shouldn't be used by more
traditional DM interfaces.  So Hannes' change might be treating symptom
rather than the core problem (which would be better treated by factoring
out dm-init requirements for a name_to_dev_t()-like interface?).

DM has supported passing maj:min and blockdev names on DM table lines
forever... so we'll need to be very specific about where/why things
regressed.

> Can we please kick off a proper discussion for this on the linux-block
> list?

Sure, done. But I won't drive that discussion in the near-term. I need
to take some time off for a few weeks.

In the meantime I'll drop Hannes' patch for 5.11; I'm open to an
alternative fix that I'd pickup during 5.11-rcX.

Thanks,
Mike

Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace

2020-12-22 Thread Mike Snitzer

On Tue, Dec 22 2020 at  8:32am -0500,
Christoph Hellwig  wrote:

> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
> > I haven't gotten a whole lot of feedback, so I'm inclined to at least have 
> > some
> > reasonable performance numbers before bothering with a v2.
> 
> FYI, my other main worry beside duplicating nbd is that device mapper
> really is a stacked interface that sits on top of other block device.
> Turning this into something else that just pipes data to userspace
> seems very strange.

I agree.  Only way I'd be interested is if it somehow tackled enabling
much more efficient IO.  Earlier discussion in this thread mentioned
that zero-copy and low overhead wasn't a priority (because it is hard,
etc).  But the hard work has already been done with io_uring.  If
dm-user had a prereq of leaning heavily on io_uring and also enabled IO
polling for bio-based then there may be a win to supporting it.

But unless lower latency (or some other more significant win) is made
possible I just don't care to prop up an unnatural DM bolt-on.

Mike

Re: Linux 5.10

2020-12-14 Thread Mike Snitzer

On Mon, Dec 14 2020 at 11:44am -0500,
Greg KH  wrote:

> On Mon, Dec 14, 2020 at 11:02:47AM -0500, Mike Snitzer wrote:
> > On Mon, Dec 14 2020 at 12:52am -0500,
> > Greg KH  wrote:
> > 
> > > On Mon, Dec 14, 2020 at 12:31:47AM -0500, Dave Jones wrote:
> > > > On Sun, Dec 13, 2020 at 03:03:29PM -0800, Linus Torvalds wrote:
> > > >  > Ok, here it is - 5.10 is tagged and pushed out.
> > > >  > 
> > > >  > I pretty much always wish that the last week was even calmer than it
> > > >  > was, and that's true here too. There's a fair amount of fixes in 
> > > > here,
> > > >  > including a few last-minute reverts for things that didn't get fixed,
> > > >  > but nothing makes me go "we need another week".
> > > > 
> > > > ...
> > > > 
> > > >  > Mike Snitzer (1):
> > > >  >   md: change mddev 'chunk_sectors' from int to unsigned
> > > > 
> > > > Seems to be broken.  This breaks mounting my raid6 partition:
> > > > 
> > > > [   87.290698] attempt to access beyond end of device
> > > >md0: rw=4096, want=13996467328, limit=6261202944
> > > > [   87.293371] attempt to access beyond end of device
> > > >md0: rw=4096, want=13998564480, limit=6261202944
> > > > [   87.296045] BTRFS warning (device md0): couldn't read tree root
> > > > [   87.300056] BTRFS error (device md0): open_ctree failed
> > > > 
> > > > Reverting it goes back to the -rc7 behaviour where it mounts fine.
> > > 
> > > If the developer/maintainer(s) agree, I can revert this and push out a
> > > 5.10.1, just let me know.
> > 
> > Yes, these should be reverted from 5.10 via 5.10.1:
> > 
> > e0910c8e4f87 dm raid: fix discard limits for raid1 and raid10
> > f075cfb1dc59 md: change mddev 'chunk_sectors' from int to unsigned
> > 
> > They were also both marked for stable@, and I just got email stating
> > that you've staged them both for 5.4 and 5.9, but they shouldn't go to
> > stable@.  We need to reassess and fix during 5.11.
> 
> Ok, will go push out a 5.10.1 with those reverted, but please, get the
> reverts into Linus as soon as possible as well.

Yes, will do, working on it now!

Thanks,
Mike

Re: Linux 5.10

2020-12-14 Thread Mike Snitzer

On Mon, Dec 14 2020 at 11:02am -0500,
Mike Snitzer  wrote:

> On Mon, Dec 14 2020 at 12:52am -0500,
> Greg KH  wrote:
> 
> > On Mon, Dec 14, 2020 at 12:31:47AM -0500, Dave Jones wrote:
> > > On Sun, Dec 13, 2020 at 03:03:29PM -0800, Linus Torvalds wrote:
> > >  > Ok, here it is - 5.10 is tagged and pushed out.
> > >  > 
> > >  > I pretty much always wish that the last week was even calmer than it
> > >  > was, and that's true here too. There's a fair amount of fixes in here,
> > >  > including a few last-minute reverts for things that didn't get fixed,
> > >  > but nothing makes me go "we need another week".
> > > 
> > > ...
> > > 
> > >  > Mike Snitzer (1):
> > >  >   md: change mddev 'chunk_sectors' from int to unsigned
> > > 
> > > Seems to be broken.  This breaks mounting my raid6 partition:
> > > 
> > > [   87.290698] attempt to access beyond end of device
> > >md0: rw=4096, want=13996467328, limit=6261202944
> > > [   87.293371] attempt to access beyond end of device
> > >md0: rw=4096, want=13998564480, limit=6261202944
> > > [   87.296045] BTRFS warning (device md0): couldn't read tree root
> > > [   87.300056] BTRFS error (device md0): open_ctree failed
> > > 
> > > Reverting it goes back to the -rc7 behaviour where it mounts fine.
> > 
> > If the developer/maintainer(s) agree, I can revert this and push out a
> > 5.10.1, just let me know.
> 
> Yes, these should be reverted from 5.10 via 5.10.1:
> 
> e0910c8e4f87 dm raid: fix discard limits for raid1 and raid10
> f075cfb1dc59 md: change mddev 'chunk_sectors' from int to unsigned

Sorry, f075cfb1dc59 was my local commit id, the corresponding upstream
commit as staged by Jens is:

6ffeb1c3f82 md: change mddev 'chunk_sectors' from int to unsigned

So please revert:
6ffeb1c3f822 md: change mddev 'chunk_sectors' from int to unsigned
and then revert:
e0910c8e4f87 dm raid: fix discard limits for raid1 and raid10

Thanks,
Mike

Re: Linux 5.10

2020-12-14 Thread Mike Snitzer

On Mon, Dec 14 2020 at 12:52am -0500,
Greg KH  wrote:

> On Mon, Dec 14, 2020 at 12:31:47AM -0500, Dave Jones wrote:
> > On Sun, Dec 13, 2020 at 03:03:29PM -0800, Linus Torvalds wrote:
> >  > Ok, here it is - 5.10 is tagged and pushed out.
> >  > 
> >  > I pretty much always wish that the last week was even calmer than it
> >  > was, and that's true here too. There's a fair amount of fixes in here,
> >  > including a few last-minute reverts for things that didn't get fixed,
> >  > but nothing makes me go "we need another week".
> > 
> > ...
> > 
> >  > Mike Snitzer (1):
> >  >   md: change mddev 'chunk_sectors' from int to unsigned
> > 
> > Seems to be broken.  This breaks mounting my raid6 partition:
> > 
> > [   87.290698] attempt to access beyond end of device
> >md0: rw=4096, want=13996467328, limit=6261202944
> > [   87.293371] attempt to access beyond end of device
> >md0: rw=4096, want=13998564480, limit=6261202944
> > [   87.296045] BTRFS warning (device md0): couldn't read tree root
> > [   87.300056] BTRFS error (device md0): open_ctree failed
> > 
> > Reverting it goes back to the -rc7 behaviour where it mounts fine.
> 
> If the developer/maintainer(s) agree, I can revert this and push out a
> 5.10.1, just let me know.

Yes, these should be reverted from 5.10 via 5.10.1:

e0910c8e4f87 dm raid: fix discard limits for raid1 and raid10
f075cfb1dc59 md: change mddev 'chunk_sectors' from int to unsigned

They were also both marked for stable@, and I just got email stating
that you've staged them both for 5.4 and 5.9, but they shouldn't go to
stable@.  We need to reassess and fix during 5.11.

I'll now respond with my Nacked-by to each patch email relative to these
commits.

Thanks,
Mike

Re: [PATCH 0/3] block: blk_interposer - Block Layer Interposer

2020-12-11 Thread Mike Snitzer

On Thu, Dec 10 2020 at 11:32am -0500,
Mike Snitzer  wrote:

> On Thu, Dec 10 2020 at  9:58am -0500,
> Sergei Shtepa  wrote:
> 
> > The 12/09/2020 16:51, Mike Snitzer wrote:
> > > On Wed, Dec 09 2020 at  8:01am -0500,
> > > Sergei Shtepa  wrote:
> > > 
> > > > Hi all.
> > > > 
> > > > I try to suggest the Block Layer Interposer (blk_interposer) again.
> > > > It`s allows to intercept bio requests, remap bio to another devices
> > > > or add new bios.
> > > > 
> > > > Initially, blk_interposer was designed to be compatible with
> > > > device mapper. Our (my and Hannes) previous attempt to offer
> > > > blk_interposer integrated with device mapper did not receive
> > > > any comments from the dm-devel team, and without their help
> > > > I will not be able to make a full implementation. I hope later
> > > > they will have time to support blk_interposer in device mapper.
> > > 
> > > Excuse me?  I gave you quite a bit of early feedback!  I then went on
> > > PTO for ~10 days, when I returned last week I had to deal with fixing
> > > some 5.10 dm/block bio splitting regressions that only got resolved this
> > > past Saturday in time for 5.10-rc7.
> > 
> > Mike,
> > 
> > I would like to clarify some points that I've made, and also try 
> > to repair the damage from the misunderstandings that I think have occured.
> > 
> > First of all, I actually meant the feedback on Hannes's patch which was
> > sent on the 19th of November:
> > https://lore.kernel.org/linux-block/20201119164924.74401-1-h...@suse.de/
> > 
> > Your feedback from the 18th of November ("[PATCH 4/4] dm_interposer - 
> > Try to use blk_interpose in dm") is very valuable, but I am not sure that
> > I am currently capable of implementing the proposed DM changes.
> > The overall architecture of DM is still not clear to me, and I am studying
> > it right now.
> > 
> > This new patch (the one that Hannes sent on the 19th of November) is also
> > compatibile with DM and should not pose any problems - the architecture is
> > the same. There are some changes that make blk_interposer simpler and 
> > better,
> > plus the sample is added.
> > 
> > > 
> > > blk_interposer was/is on my short list to review closer (Hannes' version
> > > that refined yours a bit more).. primarily to see if I could avoid the
> > > extra clone and endio hooking.
> > 
> > Glad to hear that! In order to avoid the additional copying one can only
> > change an intercepted bio, which might be dangerous.
> > 
> > > 
> > > The development window for 5.11 is past, so you pushing this without
> > > using the approach discussed (in terms of DM) doesn't help your cause.
> > > 
> > > > And my blk-snap module requires an architecture change to
> > > > support blk_interposer.
> > > > 
> > > > This time I offer it along with a sample.
> > > > Despite the fact that blk_interposer is quite simple,
> > > > there are a few non-obvious points that I would like to clarify.
> > > > 
> > > > However, I suggest the blk_interposer now so that people
> > > > could discuss it and use it in their projects as soon as possible.
> > > 
> > > So you weren't willing to put the work in on something DM oriented
> > > because you expected me to do the work for you?  And now you're looking
> > > to side-step the consensus that was built because I didn't contribute
> > > quick enough?  That's pretty messed up.
> > 
> > I just think that no one can implement integration of DM with
> > blk_interposer better than dm-devel team. I will certainly try my best,
> > but I am afraid that such efforts will only produce incongruous
> > DM patches that will be a waste of time (yours and mine).
> > 
> > > 
> > > In the time-scale that is Linux kernel development.. you've really
> > > jumped the gun and undercut my enthusiasm to work through the details.
> > > I'd rather not read into your actions more than I already have here, but
> > > I'm not liking what you've done here.  Kind of left in dismay with how
> > > you decided to go down this path without a followup note to me or others
> > > (that I'm aware of).
> > 
> > I am very sorry that I undercut your enthusiasm, but, as you rightly
> > pointed out, another development windows is closing, and my product
> > is still not able to work on newer Lin

Re: [PATCH 0/3] block: blk_interposer - Block Layer Interposer

2020-12-10 Thread Mike Snitzer

On Thu, Dec 10 2020 at  9:58am -0500,
Sergei Shtepa  wrote:

> The 12/09/2020 16:51, Mike Snitzer wrote:
> > On Wed, Dec 09 2020 at  8:01am -0500,
> > Sergei Shtepa  wrote:
> > 
> > > Hi all.
> > > 
> > > I try to suggest the Block Layer Interposer (blk_interposer) again.
> > > It`s allows to intercept bio requests, remap bio to another devices
> > > or add new bios.
> > > 
> > > Initially, blk_interposer was designed to be compatible with
> > > device mapper. Our (my and Hannes) previous attempt to offer
> > > blk_interposer integrated with device mapper did not receive
> > > any comments from the dm-devel team, and without their help
> > > I will not be able to make a full implementation. I hope later
> > > they will have time to support blk_interposer in device mapper.
> > 
> > Excuse me?  I gave you quite a bit of early feedback!  I then went on
> > PTO for ~10 days, when I returned last week I had to deal with fixing
> > some 5.10 dm/block bio splitting regressions that only got resolved this
> > past Saturday in time for 5.10-rc7.
> 
> Mike,
> 
> I would like to clarify some points that I've made, and also try 
> to repair the damage from the misunderstandings that I think have occured.
> 
> First of all, I actually meant the feedback on Hannes's patch which was
> sent on the 19th of November:
> https://lore.kernel.org/linux-block/20201119164924.74401-1-h...@suse.de/
> 
> Your feedback from the 18th of November ("[PATCH 4/4] dm_interposer - 
> Try to use blk_interpose in dm") is very valuable, but I am not sure that
> I am currently capable of implementing the proposed DM changes.
> The overall architecture of DM is still not clear to me, and I am studying
> it right now.
> 
> This new patch (the one that Hannes sent on the 19th of November) is also
> compatibile with DM and should not pose any problems - the architecture is
> the same. There are some changes that make blk_interposer simpler and better,
> plus the sample is added.
> 
> > 
> > blk_interposer was/is on my short list to review closer (Hannes' version
> > that refined yours a bit more).. primarily to see if I could avoid the
> > extra clone and endio hooking.
> 
> Glad to hear that! In order to avoid the additional copying one can only
> change an intercepted bio, which might be dangerous.
> 
> > 
> > The development window for 5.11 is past, so you pushing this without
> > using the approach discussed (in terms of DM) doesn't help your cause.
> > 
> > > And my blk-snap module requires an architecture change to
> > > support blk_interposer.
> > > 
> > > This time I offer it along with a sample.
> > > Despite the fact that blk_interposer is quite simple,
> > > there are a few non-obvious points that I would like to clarify.
> > > 
> > > However, I suggest the blk_interposer now so that people
> > > could discuss it and use it in their projects as soon as possible.
> > 
> > So you weren't willing to put the work in on something DM oriented
> > because you expected me to do the work for you?  And now you're looking
> > to side-step the consensus that was built because I didn't contribute
> > quick enough?  That's pretty messed up.
> 
> I just think that no one can implement integration of DM with
> blk_interposer better than dm-devel team. I will certainly try my best,
> but I am afraid that such efforts will only produce incongruous
> DM patches that will be a waste of time (yours and mine).
> 
> > 
> > In the time-scale that is Linux kernel development.. you've really
> > jumped the gun and undercut my enthusiasm to work through the details.
> > I'd rather not read into your actions more than I already have here, but
> > I'm not liking what you've done here.  Kind of left in dismay with how
> > you decided to go down this path without a followup note to me or others
> > (that I'm aware of).
> 
> I am very sorry that I undercut your enthusiasm, but, as you rightly
> pointed out, another development windows is closing, and my product
> is still not able to work on newer Linux versions starting from 5.8.
> That fact makes me particularly sad and forces me to search for different
> means to draw some attention to blk_interposer. I've seen an RHEL 8.4
> alpha recently, all looks very cool there but made me even more sad ...

Made you more sad because you don't have a working solution for upstream
or RHEL 8.4?

I may have missed it in your past emails but how were you able to
provide blk-snap support for kernels prior to 5.8?

> Also I certainly remember

Re: [PATCH 2/3] block: blk_interposer - sample

2020-12-10 Thread Mike Snitzer

On Thu, Dec 10 2020 at 10:54am -0500,
Sergei Shtepa  wrote:

> The 12/09/2020 17:36, Mike Snitzer wrote:
> > On Wed, Dec 09 2020 at  8:01am -0500,
> > Sergei Shtepa  wrote:
> > 
> > > This sample demonstrates how to use blk_interposer.
> > > It show how to properly connect the interposer module to kernel,
> > > and perform the simplest monitoring of the number of bio.
> > > 
> > > Signed-off-by: Sergei Shtepa 
> > > ---
> > >  samples/blk_interposer/Makefile |   2 +
> > >  samples/blk_interposer/blk-interposer.c | 276 
> > >  2 files changed, 278 insertions(+)
> > >  create mode 100644 samples/blk_interposer/Makefile
> > >  create mode 100644 samples/blk_interposer/blk-interposer.c
> > > 
> > > diff --git a/samples/blk_interposer/Makefile 
> > > b/samples/blk_interposer/Makefile
> > > new file mode 100644
> > > index ..b11aefde2b1c
> > > --- /dev/null
> > > +++ b/samples/blk_interposer/Makefile
> > > @@ -0,0 +1,2 @@
> > > +# SPDX-License-Identifier: GPL-2.0
> > > +obj-$(CONFIG_SAMPLE_BLK_INTERPOSER) += blk-interposer.o
> > > diff --git a/samples/blk_interposer/blk-interposer.c 
> > > b/samples/blk_interposer/blk-interposer.c
> > > new file mode 100644
> > > index ..92b4c1fcf8f7
> > > --- /dev/null
> > > +++ b/samples/blk_interposer/blk-interposer.c
> > > @@ -0,0 +1,276 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +
> > > +/*
> > > + * Block layer interposer allow to interpose bio requests from kernel 
> > > module.
> > > + * This allows you to monitor requests, modify requests, add new request,
> > > + * or even redirect requests to another devices.
> > > + *
> > > + * This sample demonstrates how to use blk_interposer.
> > > + * It show how to properly connect the interposer module to kernel,
> > > + * and perform the simplest monitoring of the number of bio.
> > > + */
> > > +
> > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > +
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +int device_major = 8;
> > > +int device_minor;
> > > +int fmode = FMODE_READ | FMODE_WRITE;
> > > +
> > > +/*
> > > + * Each interposer must have a common part in the form of the 
> > > blk_interposer structure,
> > > + * as well as its own unique data.
> > > + */
> > > +struct my_interposer {
> > > + /*
> > > +  * Common part of block device interposer.
> > > +  */
> > > + struct blk_interposer interposer;
> > > + /*
> > > +  * Specific part for our interposer data.
> > > +  */
> > > + atomic_t counter;
> > > +};
> > > +
> > > +struct my_interposer my_ip;
> > > +
> > > +/**
> > > + * blk_interposer_attach - Attach interposer to disk
> > > + * @disk: target disk
> > > + * @interposer: block device interposer
> > > + */
> > > +static int blk_interposer_attach(struct gendisk *disk, struct 
> > > blk_interposer *interposer)
> > > +{
> > > + int ret = 0;
> > > +
> > > + /*
> > > +  * Stop disks queue processing.
> > > +  */
> > > + blk_mq_freeze_queue(disk->queue);
> > > + blk_mq_quiesce_queue(disk->queue);
> > > +
> > > + /*
> > > +  * Check if the interposer is already busy.
> > > +  * The interposer will only connect if it is not busy.
> > > +  */
> > > + if (blk_has_interposer(disk)) {
> > > + pr_info("The interposer is already busy.\n");
> > > + ret = -EBUSY;
> > > + goto out;
> > > + }
> > > +
> > > + /*
> > > +  * Attach the interposer.
> > > +  */
> > > + disk->interposer = interposer;
> > > + /*
> > > +  * And while the queue is stopped, we can do something specific for our 
> > > module.
> > > +  */
> > > + pr_info("Block device interposer attached successfully.\n");
> > > +
> > > +out:
> > > + /*
> > > +  * Resume disks queue processing
> > > +  */
> > > + blk_mq_unquiesce_queue(disk->queue);
> > > + blk_mq_unfreeze_queue(disk->queue);
> > > +
> > > + return ret;
&

Re: Revert "dm raid: remove unnecessary discard limits for raid10"

2020-12-09 Thread Mike Snitzer

On Wed, Dec 09 2020 at  4:58pm -0500,
Song Liu  wrote:

> This reverts commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28.
> 
> Matthew Ruffell reported data corruption in raid10 due to the changes
> in discard handling [1]. Revert these changes before we find a proper fix.
> 
> [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/
> Cc: Matthew Ruffell 
> Cc: Xiao Ni 
> Cc: Mike Snitzer 
> Signed-off-by: Song Liu 
> ---
>  drivers/md/dm-raid.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 9c1f7c4de65b3..dc8568ab96f24 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -3728,6 +3728,17 @@ static void raid_io_hints(struct dm_target *ti, struct 
> queue_limits *limits)
>  
>   blk_limits_io_min(limits, chunk_size_bytes);
>   blk_limits_io_opt(limits, chunk_size_bytes * mddev_data_stripes(rs));
> +
> + /*
> +  * RAID10 personality requires bio splitting,
> +  * RAID0/1/4/5/6 don't and process large discard bios properly.
> +  */
> + if (rs_is_raid10(rs)) {
> + limits->discard_granularity = max(chunk_size_bytes,
> +   limits->discard_granularity);
> + limits->max_discard_sectors = min_not_zero(rs->md.chunk_sectors,
> +
> limits->max_discard_sectors);
> + }
>  }
>  
>  static void raid_postsuspend(struct dm_target *ti)
> -- 
> 2.24.1
> 

Short of you sending a v2 pull request to Jens...

Jens please pick this up once you pull Song's MD pull that reverts all
the MD raid10 discard changes.

Thanks!

Acked-by: Mike Snitzer

Re: Revert "dm raid: remove unnecessary discard limits for raid10"

2020-12-09 Thread Mike Snitzer

On Wed, Dec 09 2020 at  4:58pm -0500,
Song Liu  wrote:

> This reverts commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28.
> 
> Matthew Ruffell reported data corruption in raid10 due to the changes
> in discard handling [1]. Revert these changes before we find a proper fix.
> 
> [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/
> Cc: Matthew Ruffell 
> Cc: Xiao Ni 
> Cc: Mike Snitzer 
> Signed-off-by: Song Liu 

If you're reverting all the MD code that enabled this DM change, then
obviously the DM change must be reverted too.  But please do _not_
separate the DM revert from the MD reverts.

Please include this in a v2 pull request to Jens.

Mike

> ---
>  drivers/md/dm-raid.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 9c1f7c4de65b3..dc8568ab96f24 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -3728,6 +3728,17 @@ static void raid_io_hints(struct dm_target *ti, struct 
> queue_limits *limits)
>  
>   blk_limits_io_min(limits, chunk_size_bytes);
>   blk_limits_io_opt(limits, chunk_size_bytes * mddev_data_stripes(rs));
> +
> + /*
> +  * RAID10 personality requires bio splitting,
> +  * RAID0/1/4/5/6 don't and process large discard bios properly.
> +  */
> + if (rs_is_raid10(rs)) {
> + limits->discard_granularity = max(chunk_size_bytes,
> +   limits->discard_granularity);
> + limits->max_discard_sectors = min_not_zero(rs->md.chunk_sectors,
> +
> limits->max_discard_sectors);
> + }
>  }
>  
>  static void raid_postsuspend(struct dm_target *ti)
> -- 
> 2.24.1
>

Re: [PATCH 2/3] block: blk_interposer - sample

2020-12-09 Thread Mike Snitzer

   /*
> +  * Detach interposer.
> +  */
> + disk->interposer = NULL;
> + /*
> +  * And while the queue is stopped, we can do something specific for our 
> module.
> +  */
> + pr_info("Block device interposer detached successfully.\n");
> +
> +out:
> + /*
> +  * Resume disks queue processing.
> +  */
> + blk_mq_unquiesce_queue(disk->queue);
> + blk_mq_unfreeze_queue(disk->queue);
> +
> + return ret;
> +}

This attach and detach code needs to be elevated out of samples so that
any future consumer of blk_interposer doesn't reinvent it.  It is far
too fundamental.

The way you've proposed this be merged is very much unacceptable.

Nacked-by: Mike Snitzer

Re: [PATCH 0/3] block: blk_interposer - Block Layer Interposer

2020-12-09 Thread Mike Snitzer

On Wed, Dec 09 2020 at  8:01am -0500,
Sergei Shtepa  wrote:

> Hi all.
> 
> I try to suggest the Block Layer Interposer (blk_interposer) again.
> It`s allows to intercept bio requests, remap bio to another devices
> or add new bios.
> 
> Initially, blk_interposer was designed to be compatible with
> device mapper. Our (my and Hannes) previous attempt to offer
> blk_interposer integrated with device mapper did not receive
> any comments from the dm-devel team, and without their help
> I will not be able to make a full implementation. I hope later
> they will have time to support blk_interposer in device mapper.

Excuse me?  I gave you quite a bit of early feedback!  I then went on
PTO for ~10 days, when I returned last week I had to deal with fixing
some 5.10 dm/block bio splitting regressions that only got resolved this
past Saturday in time for 5.10-rc7.

blk_interposer was/is on my short list to review closer (Hannes' version
that refined yours a bit more).. primarily to see if I could avoid the
extra clone and endio hooking.

The development window for 5.11 is past, so you pushing this without
using the approach discussed (in terms of DM) doesn't help your cause.

> And my blk-snap module requires an architecture change to
> support blk_interposer.
> 
> This time I offer it along with a sample.
> Despite the fact that blk_interposer is quite simple,
> there are a few non-obvious points that I would like to clarify.
> 
> However, I suggest the blk_interposer now so that people
> could discuss it and use it in their projects as soon as possible.

So you weren't willing to put the work in on something DM oriented
because you expected me to do the work for you?  And now you're looking
to side-step the consensus that was built because I didn't contribute
quick enough?  That's pretty messed up.

In the time-scale that is Linux kernel development.. you've really
jumped the gun and undercut my enthusiasm to work through the details.
I'd rather not read into your actions more than I already have here, but
I'm not liking what you've done here.  Kind of left in dismay with how
you decided to go down this path without a followup note to me or others
(that I'm aware of).

But I'm still willing to make blk_interposer work as we all discussed
(in terms of DM).

Mike

> Sergei Shtepa (3):
>   block: blk_interposer - Block Layer Interposer
>   block: blk_interposer - sample
>   block: blk_interposer - sample config
> 
>  block/blk-core.c|  32 +++
>  include/linux/blk_types.h   |   6 +-
>  include/linux/genhd.h   |  12 +-
>  samples/Kconfig |   7 +
>  samples/Makefile|   1 +
>  samples/blk_interposer/Makefile |   2 +
>  samples/blk_interposer/blk-interposer.c | 276 
>  7 files changed, 333 insertions(+), 3 deletions(-)
>  create mode 100644 samples/blk_interposer/Makefile
>  create mode 100644 samples/blk_interposer/blk-interposer.c
> 
> --
> 2.20.1
>

Re: md: dm-writeback: add __noreturn to BUG-ging function

2020-11-20 Thread Mike Snitzer

On Wed, Nov 18 2020 at  4:24pm -0500,
Mikulas Patocka  wrote:

> 
> 
> On Wed, 18 Nov 2020, Mike Snitzer wrote:
> 
> > On Wed, Nov 18 2020 at 10:49am -0500,
> > Mike Snitzer  wrote:
> > 
> > > I don't think my suggestion will help.. given it'd still leave
> > > persistent_memory_claim() without a return statement.
> > > 
> > > Think it worthwhile to just add a dummy 'return 0;' after the BUG().
> > 
> > Decided to go with this, now staged for 5.11:
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.11=a1e4865b4dda7071f3707f7e551289ead66e38b1
> 
> Hi
> 
> I would just use "return -EOPNOTSUPP;" and drop the "#ifdef 
> DM_WRITECACHE_HAS_PMEM" that you added.
> 
> That BUG/return -EOPNOTSUPP code can't happen at all - if 
> DM_WRITECACHE_HAS_PMEM is not defined, WC_MODE_PMEM(wc) always returns 
> false - so persistent_memory_claim and BUG() can't ever be called. And if 
> it can't be called, you don't need to add a code that prints an error in 
> that case.
> 
> If we don't have DM_WRITECACHE_HAS_PMEM, the compiler optimizer will 
> remove all the code guarded with if (WC_MODE_PMEM(wc)) as unreachable.
> 
> Mikulas

Fair enough.

Re: linux-next: Signed-off-by missing for commit in the device-mapper tree

2020-11-18 Thread Mike Snitzer

On Wed, Nov 18 2020 at  3:47pm -0500,
Stephen Rothwell  wrote:

> Hi all,
> 
> Commit
> 
>   a1e4865b4dda ("dm writecache: remove BUG() and fail gracefully instead")
> 
> is missing a Signed-off-by from its author.

Thanks, because I went in a different direction on the code changes I
adjusted author via rebase in my local branch to be me.. but when I
pushed it, it reverted back to being Randy as author on the remote
branch. Turns out the rebase needed to be --continue'd.

Anyway, it should be all set now.

Thanks,
Mike

Re: md: dm-writeback: add __noreturn to BUG-ging function

2020-11-18 Thread Mike Snitzer

On Wed, Nov 18 2020 at 10:49am -0500,
Mike Snitzer  wrote:

> I don't think my suggestion will help.. given it'd still leave
> persistent_memory_claim() without a return statement.
> 
> Think it worthwhile to just add a dummy 'return 0;' after the BUG().

Decided to go with this, now staged for 5.11:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.11=a1e4865b4dda7071f3707f7e551289ead66e38b1

Re: md: dm-writeback: add __noreturn to BUG-ging function

2020-11-18 Thread Mike Snitzer

On Tue, Nov 17 2020 at 11:31am -0500,
Mike Snitzer  wrote:

> On Mon, Nov 16 2020 at  6:00pm -0500,
> Randy Dunlap  wrote:
> 
> > On 11/15/20 11:30 PM, Christian Borntraeger wrote:
> > > 
> > > 
> > > On 13.11.20 23:52, Randy Dunlap wrote:
> > >> Building on arch/s390/ flags this as an error, so add the
> > >> __noreturn attribute modifier to prevent the build error.
> > >>
> > >> cc1: some warnings being treated as errors
> > >> ../drivers/md/dm-writecache.c: In function 'persistent_memory_claim':
> > >> ../drivers/md/dm-writecache.c:323:1: error: no return statement in 
> > >> function returning non-void [-Werror=return-type]
> > > 
> > > ok with me, but I am asking why
> > > 
> > > the unreachable macro is not good enough. For x86 it obviously is.
> > > 
> > > form arch/s390/include/asm/bug.h
> > > #define BUG() do {  \
> > > __EMIT_BUG(0);  \
> > > unreachable();  \
> > > } while (0)
> > > 
> > 
> > Hi Christian,
> > 
> > Good question.
> > I don't see any guidance about when to use one or the other etc.
> > 
> > I see __noreturn being used 109 times and unreachable();
> > being used 33 times, but only now that I look at them.
> > That had nothing to do with why I used __noreturn in the patch.
> 
> But doesn't that speak to the proper fix being needed in unreachable()?
> Or at a minimum the fix is needed to arch/s390/include/asm/bug.h's BUG.
> 
> I really don't think we should be papering over that by sprinkling
> __noreturn around the kernel's BUG() callers.
> 
> Maybe switch arch/s390/include/asm/bug.h's BUG to be like
> arch/mips/include/asm/bug.h?  It itself uses __noreturn with a 'static
> inline' function definition rather than #define.
> 
> Does that fix the issue?
> 
> Thanks,
> Mike
> 
> p.s. you modified dm-writecache.c (not dm-writeback, wich doesn't
> exist).

I don't think my suggestion will help.. given it'd still leave
persistent_memory_claim() without a return statement.

Think it worthwhile to just add a dummy 'return 0;' after the BUG().

Mike

Re: md: dm-writeback: add __noreturn to BUG-ging function

2020-11-17 Thread Mike Snitzer

On Mon, Nov 16 2020 at  6:00pm -0500,
Randy Dunlap  wrote:

> On 11/15/20 11:30 PM, Christian Borntraeger wrote:
> > 
> > 
> > On 13.11.20 23:52, Randy Dunlap wrote:
> >> Building on arch/s390/ flags this as an error, so add the
> >> __noreturn attribute modifier to prevent the build error.
> >>
> >> cc1: some warnings being treated as errors
> >> ../drivers/md/dm-writecache.c: In function 'persistent_memory_claim':
> >> ../drivers/md/dm-writecache.c:323:1: error: no return statement in 
> >> function returning non-void [-Werror=return-type]
> > 
> > ok with me, but I am asking why
> > 
> > the unreachable macro is not good enough. For x86 it obviously is.
> > 
> > form arch/s390/include/asm/bug.h
> > #define BUG() do {  \
> > __EMIT_BUG(0);  \
> > unreachable();  \
> > } while (0)
> > 
> 
> Hi Christian,
> 
> Good question.
> I don't see any guidance about when to use one or the other etc.
> 
> I see __noreturn being used 109 times and unreachable();
> being used 33 times, but only now that I look at them.
> That had nothing to do with why I used __noreturn in the patch.

But doesn't that speak to the proper fix being needed in unreachable()?
Or at a minimum the fix is needed to arch/s390/include/asm/bug.h's BUG.

I really don't think we should be papering over that by sprinkling
__noreturn around the kernel's BUG() callers.

Maybe switch arch/s390/include/asm/bug.h's BUG to be like
arch/mips/include/asm/bug.h?  It itself uses __noreturn with a 'static
inline' function definition rather than #define.

Does that fix the issue?

Thanks,
Mike

p.s. you modified dm-writecache.c (not dm-writeback, wich doesn't
exist).

Re: [PATCH AUTOSEL 5.9 089/147] dm: change max_io_len() to use blk_max_size_offset()

2020-10-27 Thread Mike Snitzer

On Mon, Oct 26 2020 at  7:48pm -0400,
Sasha Levin  wrote:

> From: Mike Snitzer 
> 
> [ Upstream commit 5091cdec56faeaefa79de4b6cb3c3c55e50d1ac3 ]
> 
> Using blk_max_size_offset() enables DM core's splitting to impose
> ti->max_io_len (via q->limits.chunk_sectors) and also fallback to
> respecting q->limits.max_sectors if chunk_sectors isn't set.
> 
> Signed-off-by: Mike Snitzer 
> Signed-off-by: Sasha Levin 

Not sure why this commit elevated to stable@ picking it up, please
explain.

But you cannot take this commit standalone. These commits are prereqs:

22ada802ede8 block: use lcm_not_zero() when stacking chunk_sectors
07d098e6bbad block: allow 'chunk_sectors' to be non-power-of-2
882ec4e609c1 dm table: stack 'chunk_sectors' limit to account for 
target-specific splitting

This goes for all stable@ trees you AUTOSEL'd commit 5091cdec56f for.

Mike

> ---
>  drivers/md/dm.c | 20 
>  1 file changed, 8 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 6ed05ca65a0f8..3982012b1309c 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1051,22 +1051,18 @@ static sector_t max_io_len_target_boundary(sector_t 
> sector, struct dm_target *ti
>  static sector_t max_io_len(sector_t sector, struct dm_target *ti)
>  {
>   sector_t len = max_io_len_target_boundary(sector, ti);
> - sector_t offset, max_len;
> + sector_t max_len;
>  
>   /*
>* Does the target need to split even further?
> +  * - q->limits.chunk_sectors reflects ti->max_io_len so
> +  *   blk_max_size_offset() provides required splitting.
> +  * - blk_max_size_offset() also respects q->limits.max_sectors
>*/
> - if (ti->max_io_len) {
> - offset = dm_target_offset(ti, sector);
> - if (unlikely(ti->max_io_len & (ti->max_io_len - 1)))
> - max_len = sector_div(offset, ti->max_io_len);
> - else
> - max_len = offset & (ti->max_io_len - 1);
> - max_len = ti->max_io_len - max_len;
> -
> - if (len > max_len)
> - len = max_len;
> - }
> + max_len = blk_max_size_offset(dm_table_get_md(ti->table)->queue,
> +   dm_target_offset(ti, sector));
> + if (len > max_len)
> + len = max_len;
>  
>   return len;
>  }
> -- 
> 2.25.1
>

Re: [PATCH v2] dm verity: Add support for signature verification with 2nd keyring

2020-10-23 Thread Mike Snitzer

On Fri, Oct 23 2020 at  6:20am -0400,
Mickaël Salaün  wrote:

> It seems that there is no more question. Mike, Alasdair, could you
> please consider to merge this into the tree?
> 
> On 16/10/2020 14:19, Mickaël Salaün wrote:
> > 
> > On 16/10/2020 13:08, Milan Broz wrote:
> >> On 16/10/2020 10:49, Mickaël Salaün wrote:
> >>> On 16/10/2020 10:29, Mickaël Salaün wrote:
> >>>>
> >>>> On 15/10/2020 18:52, Mike Snitzer wrote:
> >>>>> Can you please explain why you've decided to make this a Kconfig CONFIG
> >>>>> knob?  Why not either add: a dm-verity table argument? A dm-verity
> >>>>> kernel module parameter? or both (to allow a particular default but
> >>>>> then
> >>>>> per-device override)?
> >>>>
> >>>> The purpose of signed dm-verity images is to authenticate files, or said
> >>>> in another way, to enable the kernel to trust disk images in a flexible
> >>>> way (i.e. thanks to certificate's chain of trust). Being able to update
> >>>> such chain at run time requires to use the second trusted keyring. This
> >>>> keyring automatically includes the certificate authorities from the
> >>>> builtin trusted keyring, which are required to dynamically populate the
> >>>> secondary trusted keyring with certificates signed by an already trusted
> >>>> authority. The roots of trust must then be included at build time in the
> >>>> builtin trusted keyring.
> >>>>
> >>>> To be meaningful, using dm-verity signatures implies to have a
> >>>> restricted user space, i.e. even the root user has limited power over
> >>>> the kernel and the rest of the system. Blindly trusting data provided by
> >>>> user space (e.g. dm-verity table argument, kernel module parameter)
> >>>> defeat the purpose of (mandatory) authenticated images.
> >>>>
> >>>>>
> >>>>> Otherwise, _all_ DM verity devices will be configured to use secondary
> >>>>> keyring fallback.  Is that really desirable?
> >>>>
> >>>> That is already the current state (on purpose).
> >>>
> >>> I meant that when DM_VERITY_VERIFY_ROOTHASH_SIG is set, dm-verity
> >>> signature becomes mandatory. This new configuration
> >>> DM_VERITY_VERIFY_ROOTHASH_SIG_SECONDARY_KEYRING extend this trust to the
> >>> secondary trusted keyring, which contains certificates signed (directly
> >>> or indirectly) by CA from the builtin trusted keyring.
> >>>
> >>> So yes, this new (optional) configuration *extends* the source of trust
> >>> for all dm-verity devices, and yes, it is desirable. I think it should
> >>> have been this way from the beginning (as for other authentication
> >>> mechanisms) but it wasn't necessary at that time.
> >>
> >> Well, I understand why you need a config option here.
> >> And using the secondary keyring actually makes much more sense to me than
> >> the original approach.
> >>
> >> But please do not forget that dm-verity is sometimes used in different
> >> contexts where such strict in-kernel certificate trust is unnecessary.
> >> With your configure options set, you deliberately remove the possibility
> >> to configure such devices.
> > It doesn't make sense to set DM_VERITY_VERIFY_ROOTHASH_SIG in generic
> > distro because such policy is configured at build time in the kernel
> > with hardcoded CAs. If the new option is not set then nothing change. I
> > don't see why it could be an issue for use cases we previously defined
> > (with DM_VERITY_VERIFY_ROOTHASH_SIG).
> > 
> >> I understand that it is needed for "trusted" systems, but we should be
> >> clear
> >> in the documentation.
> >> Maybe also add note to
> >> /Documentation/admin-guide/device-mapper/verity.rst ?
> >> We already mention DM_VERITY_VERIFY_ROOTHASH_SIG there.
> > 
> > The current documentation remains true.
> > DM_VERITY_VERIFY_ROOTHASH_SIG_SECONDARY_KEYRING depends on
> > DM_VERITY_VERIFY_ROOTHASH_SIG.

Yes, while true that doesn't change the fact that documenting
DM_VERITY_VERIFY_ROOTHASH_SIG_SECONDARY_KEYRING is useful to potential
consumers of baseline DM_VERITY_VERIFY_ROOTHASH_SIG.

Please update Documentation and post v3, I'll get it merged for 5.11.

Thanks,
Mike

Re: [PATCH 0/2] block layer filter and block device snapshot module

2020-10-22 Thread Mike Snitzer

On Wed, Oct 21, 2020 at 5:04 AM Sergei Shtepa  wrote:
>
> Hello everyone! Requesting for your comments and suggestions.
>
> # blk-filter
>
> Block layer filter allows to intercept BIO requests to a block device.
>
> Interception is performed at the very beginning of the BIO request
> processing, and therefore does not affect the operation of the request
> processing queue. This also makes it possible to intercept requests from
> a specific block device, rather than from the entire disk.
>
> The logic of the submit_bio function has been changed - since the
> function execution results are not processed anywhere (except for swap
> and direct-io) the function won't return a value anymore.

Your desire to switch to a void return comes exactly when I've noticed
we need it.

->submit_bio's blk_qc_t return is the cookie assigned by blk-mq.  Up
to this point we haven't actually used it for bio-based devices but it
seems clear we'll soon need for bio-based IO polling support.

Just today, I've been auditing drivers/md/dm.c with an eye toward
properly handling the blk_qc_t return (or lack thereof) from various
DM methods.

It could easily be that __submit_bio_noacct and __submit_bio_noacct_mq
will be updated to do something meaningful with the returned cookie
(or that DM will) to facilitate proper IO polling.

Mike

Re: [PATCH 0/2] block layer filter and block device snapshot module

2020-10-22 Thread Mike Snitzer

On Thu, Oct 22, 2020 at 11:14 AM Darrick J. Wong
 wrote:
>
> On Thu, Oct 22, 2020 at 04:52:13PM +0300, Sergei Shtepa wrote:
> > The 10/22/2020 13:28, Damien Le Moal wrote:
> > > On 2020/10/22 18:43, Sergei Shtepa wrote:
> > > >
> > > > Maybe, but the problem is that I can't imagine how to implement
> > > > dm-intercept yet.
> > > > How to use dm to implement interception without changing the stack
> > > > of block devices. We'll have to make a hook somewhere, isn`t it?
> > >
> > > Once your dm-intercept target driver is inserted with "dmsetup" or any 
> > > user land
> > > tool you implement using libdevicemapper, the "hooks" will naturally be 
> > > in place
> > > since the dm infrastructure already does that: all submitted BIOs will be 
> > > passed
> > > to dm-intercept through the "map" operation defined in the target_type
> > > descriptor. It is then that driver job to execute the BIOs as it sees fit.
> > >
> > > Look at simple device mappers like dm-linear or dm-flakey for hints of how
> > > things work (driver/md/dm-linear.c). More complex dm drivers like 
> > > dm-crypt,
> > > dm-writecache or dm-thin can give you hints about more features of device 
> > > mapper.
> > > Functions such as __map_bio() in drivers/md/dm.c are the core of DM and 
> > > show
> > > what happens to BIOs depending on the the return value of the map 
> > > operation.
> > > dm_submit_bio() and __split_and_process_bio() is the entry points for BIO
> > > processing in DM.
> > >
> >
> > Is there something I don't understand? Please correct me.
> >
> > Let me remind that by the condition of the problem, we can't change
> > the configuration of the block device stack.
> >
> > Let's imagine this configuration: /root mount point on ext filesystem
> > on /dev/sda1.
> > +---+
> > |   |
> > |  /root|
> > |   |
> > +---+
> > |   |
> > | EXT FS|
> > |   |
> > +---+
> > |   |
> > | block layer   |
> > |   |
> > | sda queue |
> > |   |
> > +---+
> > |   |
> > | scsi driver   |
> > |   |
> > +---+
> >
> > We need to add change block tracking (CBT) and snapshot functionality for
> > incremental backup.
> >
> > With the DM we need to change the block device stack. Add device /dev/sda1
> > to LVM Volume group, create logical volume, change /etc/fstab and reboot.
> >
> > The new scheme will look like this:
> > +---+
> > |   |
> > |  /root|
> > |   |
> > +---+
> > |   |
> > | EXT FS|
> > |   |
> > +---+
> > |   |
> > | LV-root   |
> > |   |
> > +--+
> > |  |
> > | dm-cbt & dm-snap |
> > |  |
> > +--+
> > |   |
> > | sda queue |
> > |   |
> > +---+
> > |   |
> > | scsi driver   |
> > |   |
> > +---+
> >
> > But I cannot change block device stack. And so I propose a scheme with
> > interception.
> > +---+
> > |   |
> > |  /root|
> > |   |
> > +---+
> > |   |
> > | EXT FS|
> > |   |
> > +---+   +-+
> > |  ||   | |
> > |  | blk-filter |-> | cbt & snapshot  |
> > |  ||<- | |
> > |  ++   +-+
> > |   |
> > | sda blk queue |
> > |   |
> > +---+
> > |   |
> > | scsi driver   |
> > |   |
> > +---+
> >
> > Perhaps I can make "cbt & snapshot" inside the DM, but without interception
> > in any case, it will not work. Isn't that right?
>
> Stupid question: Why don't you change the block layer to make it
> possible to insert device mapper devices after the blockdev has been set
> up?

Not a stupid question.  Definitely something that us DM developers
have wanted to do for a while.  Devil is in the details but it is the
right way forward.

Otherwise, this intercept is really just a DM-lite remapping layer
without any of DM's well established capabilities.  Encouragingly, all
of the replies have effectively echoed this point.  (amusingly, seems
every mailing list under the sun is on the cc except dm-devel... now
rectified)

Alasdair has some concrete ideas on this line of work; I'm trying to
encourage him to reply ;)

Mike

Re: [EXT] Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)

2020-10-21 Thread Mike Snitzer

Hey Dan,

On Fri, Oct 16, 2020 at 6:38 PM Dan Williams  wrote:
>
> On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
> (nmeeramohide)  wrote:
>
> > (5) Representing an mpool as a /dev/mpool/ device file provides 
> > a
> > convenient mechanism for controlling access to and managing the multiple 
> > storage
> > volumes, and in the future pmem devices, that may comprise an logical mpool.
>
> Christoph and I have talked about replacing the pmem driver's
> dependence on device-mapper for pooling.

Was this discussion done publicly or private?  If public please share
a pointer to the thread.

I'd really like to understand the problem statement that is leading to
pursuing a pmem native alternative to existing DM.

Thanks,
Mike

Re: [PATCH v2] dm verity: Add support for signature verification with 2nd keyring

2020-10-15 Thread Mike Snitzer

On Thu, Oct 15 2020 at 11:05am -0400,
Mickaël Salaün  wrote:

> From: Mickaël Salaün 
> 
> Add a new configuration DM_VERITY_VERIFY_ROOTHASH_SIG_SECONDARY_KEYRING
> to enable dm-verity signatures to be verified against the secondary
> trusted keyring.  Instead of relying on the builtin trusted keyring
> (with hard-coded certificates), the second trusted keyring can include
> certificate authorities from the builtin trusted keyring and child
> certificates loaded at run time.  Using the secondary trusted keyring
> enables to use dm-verity disks (e.g. loop devices) signed by keys which
> did not exist at kernel build time, leveraging the certificate chain of
> trust model.  In practice, this makes it possible to update certificates
> without kernel update and reboot, aligning with module and kernel
> (kexec) signature verification which already use the secondary trusted
> keyring.
> 
> Cc: Alasdair Kergon 
> Cc: Andrew Morton 
> Cc: Jarkko Sakkinen 
> Cc: Jaskaran Khurana 
> Cc: Mike Snitzer 
> Cc: Milan Broz 
> Signed-off-by: Mickaël Salaün 
> ---
> 
> Previous version:
> https://lore.kernel.org/lkml/20201002071802.535023-1-...@digikod.net/
> 
> Changes since v1:
> * Extend the commit message (asked by Jarkko Sakkinen).
> * Rename the Kconfig "help" keyword according to commit 84af7a6194e4
>   ("checkpatch: kconfig: prefer 'help' over '---help---'").

Can you please explain why you've decided to make this a Kconfig CONFIG
knob?  Why not either add: a dm-verity table argument? A dm-verity
kernel module parameter? or both (to allow a particular default but then
per-device override)?

Otherwise, _all_ DM verity devices will be configured to use secondary
keyring fallback.  Is that really desirable?

Regardless, I really don't see why a Kconfig knob is appropriate.

Mike


> ---
>  drivers/md/Kconfig| 13 -
>  drivers/md/dm-verity-verify-sig.c |  9 +++--
>  2 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index 30ba3573626c..1d68935e45ef 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -530,11 +530,22 @@ config DM_VERITY_VERIFY_ROOTHASH_SIG
>   bool "Verity data device root hash signature verification support"
>   depends on DM_VERITY
>   select SYSTEM_DATA_VERIFICATION
> -   help
> + help
> Add ability for dm-verity device to be validated if the
> pre-generated tree of cryptographic checksums passed has a pkcs#7
> signature file that can validate the roothash of the tree.
>  
> +   By default, rely on the builtin trusted keyring.
> +
> +   If unsure, say N.
> +
> +config DM_VERITY_VERIFY_ROOTHASH_SIG_SECONDARY_KEYRING
> + bool "Verity data device root hash signature verification with 
> secondary keyring"
> + depends on DM_VERITY_VERIFY_ROOTHASH_SIG
> + depends on SECONDARY_TRUSTED_KEYRING
> + help
> +   Rely on the secondary trusted keyring to verify dm-verity signatures.
> +
> If unsure, say N.
>  
>  config DM_VERITY_FEC
> diff --git a/drivers/md/dm-verity-verify-sig.c 
> b/drivers/md/dm-verity-verify-sig.c
> index 614e43db93aa..29385dc470d5 100644
> --- a/drivers/md/dm-verity-verify-sig.c
> +++ b/drivers/md/dm-verity-verify-sig.c
> @@ -119,8 +119,13 @@ int verity_verify_root_hash(const void *root_hash, 
> size_t root_hash_len,
>   }
>  
>   ret = verify_pkcs7_signature(root_hash, root_hash_len, sig_data,
> - sig_len, NULL, VERIFYING_UNSPECIFIED_SIGNATURE,
> - NULL, NULL);
> + sig_len,
> +#ifdef CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG_SECONDARY_KEYRING
> + VERIFY_USE_SECONDARY_KEYRING,
> +#else
> + NULL,
> +#endif
> + VERIFYING_UNSPECIFIED_SIGNATURE, NULL, NULL);
>  
>   return ret;
>  }
> 
> base-commit: bbf5c979011a099af5dc76498918ed7df445635b
> -- 
> 2.28.0
>

Re: [PATCH 2/3] dm: add support for passing through inline crypto support

2020-09-24 Thread Mike Snitzer

On Thu, Sep 24 2020 at 11:45am -0400,
Eric Biggers  wrote:

> On Thu, Sep 24, 2020 at 09:46:49AM -0400, Mike Snitzer wrote:
> > > > Can you help me better understand the expected consumer of this code?
> > > > If you have something _real_ please be explicit.  It makes justifying
> > > > supporting niche code like this more tolerable.
> > >
> > > So the motivation for this code was that Android currently uses a device
> > > mapper target on top of a phone's disk for user data. On many phones,
> > > that disk has inline encryption support, and it'd be great to be able to
> > > make use of that. The DM device configuration isn't changed at runtime.
> > 
> > OK, which device mapper target is used?
> 
> There are several device-mapper targets that Android can require for the
> "userdata" partition -- potentially in a stack of more than one:
> 
> dm-linear: required for Dynamic System Updates
> (https://developer.android.com/topic/dsu)
> 
> dm-bow: required for User Data Checkpoints on ext4
> (https://source.android.com/devices/tech/ota/user-data-checkpoint)
> (https://patchwork.kernel.org/patch/10838743/)
> 
> dm-default-key: required for metadata encryption
> (https://source.android.com/security/encryption/metadata)

Please work with all google stakeholders to post the latest code for the
dm-bow and dm-default-key targets and I'll work through them.

I think the more code we have to inform DM core's implementation the
better off we'll be in the long run.  Could also help improve these
targets as a side-effect of additional review.

I know I largely ignored dm-bow before but that was more to do with
competing tasks, etc.  I'll try my best...

> We're already carrying this patchset in the Android common kernels since late
> last year, as it's required for inline encryption to work when any of the 
> above
> is used.  So this is something that is needed and is already being used.
> 
> Now, you don't have to "count" dm-bow and dm-default-key since they aren't
> upstream; that leaves dm-linear.  But hopefully the others at least show that
> architecturally, dm-linear is just the initial use case, and this patchset 
> also
> makes it easy to pass through inline crypto on any other target that can 
> support
> it (basically, anything that doesn't change the data itself as it goes 
> through).

Sure, that context really helps.

About "basically, anything that doesn't change the data itself as it
goes through": could you be a bit more precise?  Very few DM targets
actually change the data as associated bios are remapped.

I'm just wondering if your definition of "doesn't change the data"
includes things more subtle than the data itself?

Thanks,
Mike

Re: [PATCH 07/13] block: lift setting the readahead size into the block layer

2020-09-24 Thread Mike Snitzer

On Thu, Sep 24 2020 at  2:51am -0400,
Christoph Hellwig  wrote:

> Drivers shouldn't really mess with the readahead size, as that is a VM
> concept.  Instead set it based on the optimal I/O size by lifting the
> algorithm from the md driver when registering the disk.  Also set
> bdi->io_pages there as well by applying the same scheme based on
> max_sectors.  To ensure the limits work well for stacking drivers a
> new helper is added to update the readahead limits from the block
> limits, which is also called from disk_stack_limits.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Coly Li 
> Reviewed-by: Johannes Thumshirn 

Thanks for adding blk_queue_update_readahead()

Reviewed-by: Mike Snitzer

Re: [PATCH 2/3] dm: add support for passing through inline crypto support

2020-09-24 Thread Mike Snitzer

On Thu, Sep 24 2020 at  3:38am -0400,
Satya Tangirala  wrote:

> On Wed, Sep 23, 2020 at 09:21:03PM -0400, Mike Snitzer wrote:
> > On Wed, Sep 09 2020 at  7:44pm -0400,
> > Satya Tangirala  wrote:
> > 
> > > From: Eric Biggers 
> > > 
> > > Update the device-mapper core to support exposing the inline crypto
> > > support of the underlying device(s) through the device-mapper device.
> > > 
> > > This works by creating a "passthrough keyslot manager" for the dm
> > > device, which declares support for encryption settings which all
> > > underlying devices support.  When a supported setting is used, the bio
> > > cloning code handles cloning the crypto context to the bios for all the
> > > underlying devices.  When an unsupported setting is used, the blk-crypto
> > > fallback is used as usual.
> > > 
> > > Crypto support on each underlying device is ignored unless the
> > > corresponding dm target opts into exposing it.  This is needed because
> > > for inline crypto to semantically operate on the original bio, the data
> > > must not be transformed by the dm target.  Thus, targets like dm-linear
> > > can expose crypto support of the underlying device, but targets like
> > > dm-crypt can't.  (dm-crypt could use inline crypto itself, though.)
> > > 
> > > When a key is evicted from the dm device, it is evicted from all
> > > underlying devices.
> > > 
> > > Signed-off-by: Eric Biggers 
> > > Co-developed-by: Satya Tangirala 
> > > Signed-off-by: Satya Tangirala 
> > > ---
> > >  block/blk-crypto.c  |  1 +
> > >  block/keyslot-manager.c | 34 
> > >  drivers/md/dm-core.h|  4 ++
> > >  drivers/md/dm-table.c   | 52 +++
> > >  drivers/md/dm.c | 92 -
> > >  include/linux/device-mapper.h   |  6 +++
> > >  include/linux/keyslot-manager.h |  7 +++
> > >  7 files changed, 195 insertions(+), 1 deletion(-)
> > > 

> > > diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> > > index c4ef1fceead6..4542050eebfc 100644
> > > --- a/drivers/md/dm-core.h
> > > +++ b/drivers/md/dm-core.h
> > > @@ -12,6 +12,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  
> > >  #include 
> > >  
> > > @@ -49,6 +50,9 @@ struct mapped_device {
> > >  
> > >   int numa_node_id;
> > >   struct request_queue *queue;
> > > +#ifdef CONFIG_BLK_INLINE_ENCRYPTION
> > > + struct blk_keyslot_manager ksm;
> > > +#endif
> > >  
> > >   atomic_t holders;
> > >   atomic_t open_count;
> > 
> > Any reason you placed the ksm member where you did?
> > 
> > Looking at 'struct blk_keyslot_manager' I'm really hating adding that
> > bloat to every DM device for a feature that really won't see much broad
> > use (AFAIK).
> > 
> > Any chance you could allocate 'struct blk_keyslot_manager' as needed so
> > that most users of DM would only be carrying 1 extra pointer (set to
> > NULL)?
>
> I don't think there's any technical problem with doing that - the only
> other thing that would need addressing is that the patch uses
> "container_of" on that blk_keyslot_manager in dm_keyslot_evict() to get
> a pointer to the struct mapped_device. I could try adding a "private"
> field to struct blk_keyslot_manager and store a pointer to the struct
> mapped_device there).

Yes, that'd be ideal.

As for the lifetime of the struct blk_keyslot_manager pointer DM would
manage (in your future code revision): you meantioned in one reply that
the request_queue takes care of setting up the ksm... but the ksm
is tied to the queue at a later phase using blk_ksm_register(). 

In any case, I think my feature reequest (to have DM allocate the ksm
struct only as needed) is a bit challenging because of how DM allocates
the request_queue upfront in alloc_dev() and then later completes the
request_queue initialization based on DM_TYPE* in dm_setup_md_queue().

It _could_ be that you'll need to add a new DM_TYPE_KSM_BIO_BASED or
something.  But you have a catch-22 in that the dm-table.c code to
establish the intersection of supported modes assumes ksm is already
allocated.  So something needs to give by reasoning through: _what_ is
the invariant that will trigger the delayed allocation of the ksm
struct?  I don't yet see how you can make that informed decision that
the target(s) in the DM table _will_ use the ksm if it exists.

But then once the ksm is allocated, it never gets allocated again
because md->queue->ksm is already set, and it inherits the lifetime that
is used when destroying the mapped_device (md->queue, etc).

Mike

Re: [PATCH 2/3] dm: add support for passing through inline crypto support

2020-09-24 Thread Mike Snitzer

On Thu, Sep 24 2020 at  3:17am -0400,
Satya Tangirala  wrote:

> On Wed, Sep 23, 2020 at 09:14:39PM -0400, Mike Snitzer wrote:
> > On Mon, Sep 21 2020 at  8:32pm -0400,
> > Eric Biggers  wrote:
> > 
> > > On Wed, Sep 09, 2020 at 11:44:21PM +, Satya Tangirala wrote:
> > > > From: Eric Biggers 
> > > > 
> > > > Update the device-mapper core to support exposing the inline crypto
> > > > support of the underlying device(s) through the device-mapper device.
> > > > 
> > > > This works by creating a "passthrough keyslot manager" for the dm
> > > > device, which declares support for encryption settings which all
> > > > underlying devices support.  When a supported setting is used, the bio
> > > > cloning code handles cloning the crypto context to the bios for all the
> > > > underlying devices.  When an unsupported setting is used, the blk-crypto
> > > > fallback is used as usual.
> > > > 
> > > > Crypto support on each underlying device is ignored unless the
> > > > corresponding dm target opts into exposing it.  This is needed because
> > > > for inline crypto to semantically operate on the original bio, the data
> > > > must not be transformed by the dm target.  Thus, targets like dm-linear
> > > > can expose crypto support of the underlying device, but targets like
> > > > dm-crypt can't.  (dm-crypt could use inline crypto itself, though.)
> > > > 
> > > > When a key is evicted from the dm device, it is evicted from all
> > > > underlying devices.
> > > > 
> > > > Signed-off-by: Eric Biggers 
> > > > Co-developed-by: Satya Tangirala 
> > > > Signed-off-by: Satya Tangirala 
> > > 
> > > Looks good as far as Satya's changes from my original patch are concerned.
> > > 
> > > Can the device-mapper maintainers take a look at this?
> > 
> > In general it looks like these changes were implemented very carefully
> > and are reasonable if we _really_ want to enable passing through inline
> > crypto.
> > 
> > I do have concerns about the inability to handle changes at runtime (due
> > to a table reload that introduces new devices without the encryption
> > settings the existing devices in the table are using).  But the fallback
> > mechanism saves it from being a complete non-starter.
>
> Unfortunately, the fallback doesn't completely handle that situation
> right now. The DM device could be suspended while an upper layer like
> fscrypt is doing something like "checking if encryption algorithm 'A'
> is supported by the DM device". It's possible that fscrypt thinks
> the DM device supports 'A' even though the DM device is suspended, and
> the table is about to be reloaded to introduce a new device that doesn't
> support 'A'. Before the DM device is resumed with the new table, fscrypt
> might send a bio that uses encryption algorithm 'A' without initializing
> the blk-crypto-fallback ciphers for 'A', because it believes that the DM
> device supports 'A'. When the bio gets processed by the DM (or when
> blk-crypto does its checks to decide whether to use the fallback on that
> bio), the bio will fail because the fallback ciphers aren't initialized.
> 
> Off the top of my head, one thing we could do is to always allocate the
> fallback ciphers when the device mapper is the target device for the bio
> (by maybe adding a "encryption_capabilities_may_change_at_runtime" flag
> to struct blk_keyslot_manager that the DM will set to true, and that
> the block layer will check for and decide to appropriately allocate
> the fallback ciphers), although this does waste memory on systems
> where we know the DM device tables will never change

Yeah, I agree that'd be too wasteful.
 
> This patch also doesn't handle the case when the encryption capabilities
> of the new table are a superset of the old capabilities.  Currently, a
> DM device's capabilities can only shrink after the device is initially
> created. They can never "expand" to make use of capabilities that might
> be added due to introduction of new devices via table reloads.  I might
> be forgetting something I thought of before, but looking at it again
> now, I don't immediately see anything wrong with expanding the
> advertised capabilities on table reloadI'll look carefully into that
> again.

OK, that'd be good (expanding capabilities on reload).

And conversely, you _could_ also fail a reload if the new device(s)
don't have capabilities that are in use by the active table.

> > Can you help me better understand the expected consumer of this code?
> > If you have something _real_ please be explicit.  It makes justifying
> > supporting niche code like this more tolerable.
>
> So the motivation for this code was that Android currently uses a device
> mapper target on top of a phone's disk for user data. On many phones,
> that disk has inline encryption support, and it'd be great to be able to
> make use of that. The DM device configuration isn't changed at runtime.

OK, which device mapper target is used?

Thanks,
Mike

Re: [PATCH 2/3] dm: add support for passing through inline crypto support

2020-09-24 Thread Mike Snitzer

On Thu, Sep 24 2020 at  3:48am -0400,
Satya Tangirala  wrote:

> On Wed, Sep 23, 2020 at 09:21:03PM -0400, Mike Snitzer wrote:
> > On Wed, Sep 09 2020 at  7:44pm -0400,
> > Satya Tangirala  wrote:
> > 
> > > From: Eric Biggers 
> > > 
> > > Update the device-mapper core to support exposing the inline crypto
> > > support of the underlying device(s) through the device-mapper device.
> > > 
> > > This works by creating a "passthrough keyslot manager" for the dm
> > > device, which declares support for encryption settings which all
> > > underlying devices support.  When a supported setting is used, the bio
> > > cloning code handles cloning the crypto context to the bios for all the
> > > underlying devices.  When an unsupported setting is used, the blk-crypto
> > > fallback is used as usual.
> > > 
> > > Crypto support on each underlying device is ignored unless the
> > > corresponding dm target opts into exposing it.  This is needed because
> > > for inline crypto to semantically operate on the original bio, the data
> > > must not be transformed by the dm target.  Thus, targets like dm-linear
> > > can expose crypto support of the underlying device, but targets like
> > > dm-crypt can't.  (dm-crypt could use inline crypto itself, though.)
> > > 
> > > When a key is evicted from the dm device, it is evicted from all
> > > underlying devices.
> > > 
> > > Signed-off-by: Eric Biggers 
> > > Co-developed-by: Satya Tangirala 
> > > Signed-off-by: Satya Tangirala 
> > > ---
> > >  block/blk-crypto.c  |  1 +
> > >  block/keyslot-manager.c | 34 
> > >  drivers/md/dm-core.h|  4 ++
> > >  drivers/md/dm-table.c   | 52 +++
> > >  drivers/md/dm.c | 92 -
> > >  include/linux/device-mapper.h   |  6 +++
> > >  include/linux/keyslot-manager.h |  7 +++
> > >  7 files changed, 195 insertions(+), 1 deletion(-)
> > > 

> > > diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> > > index c4ef1fceead6..4542050eebfc 100644
> > > --- a/drivers/md/dm-core.h
> > > +++ b/drivers/md/dm-core.h
> > > @@ -12,6 +12,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  
> > >  #include 
> > >  
> > > @@ -49,6 +50,9 @@ struct mapped_device {
> > >  
> > >   int numa_node_id;
> > >   struct request_queue *queue;
> > > +#ifdef CONFIG_BLK_INLINE_ENCRYPTION
> > > + struct blk_keyslot_manager ksm;
> > > +#endif
> > >  
> > >   atomic_t holders;
> > >   atomic_t open_count;
> > 
> > Any reason you placed the ksm member where you did?
>
> As in, any reason why it's placed right after the struct request_queue
> *queue? The ksm is going to be set up in the request_queue and is a part
> of the request_queue is some sense, so it seemed reasonable to me to
> group them togetherbut I don't think there's any reason it *has* to
> be there, if you think it should be put elsewhere (or maybe I'm
> misunderstanding your question :) ).

Placing the full struct where you did is highly disruptive to the prior
care taken to tune alignment of struct members within mapped_device.

Switching to a pointer will be less so, but even still it might be best
to either find a hole in the struct (not looked recently, but there may
not be one) or simply put it at the end of the structure.

The pahole utility is very useful for this kind of struct member
placement, etc.  But it is increasingly unavailable in modern Linux
distros...

Mike

Re: [PATCH 2/3] dm: add support for passing through inline crypto support

2020-09-23 Thread Mike Snitzer

On Wed, Sep 09 2020 at  7:44pm -0400,
Satya Tangirala  wrote:

> From: Eric Biggers 
> 
> Update the device-mapper core to support exposing the inline crypto
> support of the underlying device(s) through the device-mapper device.
> 
> This works by creating a "passthrough keyslot manager" for the dm
> device, which declares support for encryption settings which all
> underlying devices support.  When a supported setting is used, the bio
> cloning code handles cloning the crypto context to the bios for all the
> underlying devices.  When an unsupported setting is used, the blk-crypto
> fallback is used as usual.
> 
> Crypto support on each underlying device is ignored unless the
> corresponding dm target opts into exposing it.  This is needed because
> for inline crypto to semantically operate on the original bio, the data
> must not be transformed by the dm target.  Thus, targets like dm-linear
> can expose crypto support of the underlying device, but targets like
> dm-crypt can't.  (dm-crypt could use inline crypto itself, though.)
> 
> When a key is evicted from the dm device, it is evicted from all
> underlying devices.
> 
> Signed-off-by: Eric Biggers 
> Co-developed-by: Satya Tangirala 
> Signed-off-by: Satya Tangirala 
> ---
>  block/blk-crypto.c  |  1 +
>  block/keyslot-manager.c | 34 
>  drivers/md/dm-core.h|  4 ++
>  drivers/md/dm-table.c   | 52 +++
>  drivers/md/dm.c | 92 -
>  include/linux/device-mapper.h   |  6 +++
>  include/linux/keyslot-manager.h |  7 +++
>  7 files changed, 195 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-crypto.c b/block/blk-crypto.c
> index 2d5e60023b08..33555cf0e3e7 100644
> --- a/block/blk-crypto.c
> +++ b/block/blk-crypto.c
> @@ -402,3 +402,4 @@ int blk_crypto_evict_key(struct request_queue *q,
>*/
>   return blk_crypto_fallback_evict_key(key);
>  }
> +EXPORT_SYMBOL_GPL(blk_crypto_evict_key);
> diff --git a/block/keyslot-manager.c b/block/keyslot-manager.c
> index 60ac406d54b9..e0f776c38d8a 100644
> --- a/block/keyslot-manager.c
> +++ b/block/keyslot-manager.c
> @@ -416,6 +416,40 @@ void blk_ksm_unregister(struct request_queue *q)
>  {
>   q->ksm = NULL;
>  }
> +EXPORT_SYMBOL_GPL(blk_ksm_unregister);
> +
> +/**
> + * blk_ksm_intersect_modes() - restrict supported modes by child device
> + * @parent: The keyslot manager for parent device
> + * @child: The keyslot manager for child device, or NULL
> + *
> + * Clear any crypto mode support bits in @parent that aren't set in @child.
> + * If @child is NULL, then all parent bits are cleared.
> + *
> + * Only use this when setting up the keyslot manager for a layered device,
> + * before it's been exposed yet.
> + */
> +void blk_ksm_intersect_modes(struct blk_keyslot_manager *parent,
> +  const struct blk_keyslot_manager *child)
> +{
> + if (child) {
> + unsigned int i;
> +
> + parent->max_dun_bytes_supported =
> + min(parent->max_dun_bytes_supported,
> + child->max_dun_bytes_supported);
> + for (i = 0; i < ARRAY_SIZE(child->crypto_modes_supported);
> +  i++) {
> + parent->crypto_modes_supported[i] &=
> + child->crypto_modes_supported[i];
> + }
> + } else {
> + parent->max_dun_bytes_supported = 0;
> + memset(parent->crypto_modes_supported, 0,
> +sizeof(parent->crypto_modes_supported));
> + }
> +}
> +EXPORT_SYMBOL_GPL(blk_ksm_intersect_modes);
>  
>  /**
>   * blk_ksm_init_passthrough() - Init a passthrough keyslot manager
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index c4ef1fceead6..4542050eebfc 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -49,6 +50,9 @@ struct mapped_device {
>  
>   int numa_node_id;
>   struct request_queue *queue;
> +#ifdef CONFIG_BLK_INLINE_ENCRYPTION
> + struct blk_keyslot_manager ksm;
> +#endif
>  
>   atomic_t holders;
>   atomic_t open_count;

Any reason you placed the ksm member where you did?

Looking at 'struct blk_keyslot_manager' I'm really hating adding that
bloat to every DM device for a feature that really won't see much broad
use (AFAIK).

Any chance you could allocate 'struct blk_keyslot_manager' as needed so
that most users of DM would only be carrying 1 extra pointer (set to
NULL)?

Thanks,
Mike

Re: [PATCH 2/3] dm: add support for passing through inline crypto support

2020-09-23 Thread Mike Snitzer

On Mon, Sep 21 2020 at  8:32pm -0400,
Eric Biggers  wrote:

> On Wed, Sep 09, 2020 at 11:44:21PM +, Satya Tangirala wrote:
> > From: Eric Biggers 
> > 
> > Update the device-mapper core to support exposing the inline crypto
> > support of the underlying device(s) through the device-mapper device.
> > 
> > This works by creating a "passthrough keyslot manager" for the dm
> > device, which declares support for encryption settings which all
> > underlying devices support.  When a supported setting is used, the bio
> > cloning code handles cloning the crypto context to the bios for all the
> > underlying devices.  When an unsupported setting is used, the blk-crypto
> > fallback is used as usual.
> > 
> > Crypto support on each underlying device is ignored unless the
> > corresponding dm target opts into exposing it.  This is needed because
> > for inline crypto to semantically operate on the original bio, the data
> > must not be transformed by the dm target.  Thus, targets like dm-linear
> > can expose crypto support of the underlying device, but targets like
> > dm-crypt can't.  (dm-crypt could use inline crypto itself, though.)
> > 
> > When a key is evicted from the dm device, it is evicted from all
> > underlying devices.
> > 
> > Signed-off-by: Eric Biggers 
> > Co-developed-by: Satya Tangirala 
> > Signed-off-by: Satya Tangirala 
> 
> Looks good as far as Satya's changes from my original patch are concerned.
> 
> Can the device-mapper maintainers take a look at this?

In general it looks like these changes were implemented very carefully
and are reasonable if we _really_ want to enable passing through inline
crypto.

I do have concerns about the inability to handle changes at runtime (due
to a table reload that introduces new devices without the encryption
settings the existing devices in the table are using).  But the fallback
mechanism saves it from being a complete non-starter.

Can you help me better understand the expected consumer of this code?
If you have something _real_ please be explicit.  It makes justifying
supporting niche code like this more tolerable.

Thanks,
Mike

Re: [PATCH 6/6] mm: Add memalloc_nowait

2020-09-23 Thread Mike Snitzer

On Thu, Jun 25 2020 at  7:31am -0400,
Matthew Wilcox (Oracle)  wrote:

> Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> guarantees we will not sleep to reclaim memory.  Use it to simplify
> dm-bufio's allocations.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  drivers/md/dm-bufio.c| 30 --
>  include/linux/sched.h|  1 +
>  include/linux/sched/mm.h | 12 
>  3 files changed, 17 insertions(+), 26 deletions(-)


Hi,

Curious on the state of this patchset?  Not seeing it in next-20200923

The dm-bufio cleanup looks desirable.

Thanks,
Mike

Re: [PATCH 06/14] block: lift setting the readahead size into the block layer

2020-09-10 Thread Mike Snitzer

On Thu, Sep 10 2020 at  5:28am -0400,
Christoph Hellwig  wrote:

> On Wed, Sep 02, 2020 at 12:20:07PM -0400, Mike Snitzer wrote:
> > On Wed, Sep 02 2020 at 11:11am -0400,
> > Christoph Hellwig  wrote:
> > 
> > > On Wed, Aug 26, 2020 at 06:07:38PM -0400, Mike Snitzer wrote:
> > > > On Sun, Jul 26 2020 at 11:03am -0400,
> > > > Christoph Hellwig  wrote:
> > > > 
> > > > > Drivers shouldn't really mess with the readahead size, as that is a VM
> > > > > concept.  Instead set it based on the optimal I/O size by lifting the
> > > > > algorithm from the md driver when registering the disk.  Also set
> > > > > bdi->io_pages there as well by applying the same scheme based on
> > > > > max_sectors.
> > > > > 
> > > > > Signed-off-by: Christoph Hellwig 
> > > > > ---
> > > > >  block/blk-settings.c |  5 ++---
> > > > >  block/blk-sysfs.c|  1 -
> > > > >  block/genhd.c| 13 +++--
> > > > >  drivers/block/aoe/aoeblk.c   |  2 --
> > > > >  drivers/block/drbd/drbd_nl.c | 12 +---
> > > > >  drivers/md/bcache/super.c|  4 
> > > > >  drivers/md/dm-table.c|  3 ---
> > > > >  drivers/md/raid0.c   | 16 
> > > > >  drivers/md/raid10.c  | 24 +---
> > > > >  drivers/md/raid5.c   | 13 +
> > > > >  10 files changed, 16 insertions(+), 77 deletions(-)
> > > > 
> > > > 
> > > > In general these changes need a solid audit relative to stacking
> > > > drivers.  That is, the limits stacking methods (blk_stack_limits)
> > > > vs lower level allocation methods (__device_add_disk).
> > > > 
> > > > You optimized for lowlevel __device_add_disk establishing the bdi's
> > > > ra_pages and io_pages.  That is at the beginning of disk allocation,
> > > > well before any build up of stacking driver's queue_io_opt() -- which
> > > > was previously done in disk_stack_limits or driver specific methods
> > > > (e.g. dm_table_set_restrictions) that are called _after_ all the limits
> > > > stacking occurs.
> > > > 
> > > > By inverting the setting of the bdi's ra_pages and io_pages to be done
> > > > so early in __device_add_disk it'll break properly setting these values
> > > > for at least DM afaict.
> > > 
> > > ra_pages never got inherited by stacking drivers, check it by modifying
> > > it on an underlying device and then creating a trivial dm or md one.
> > 
> > Sure, not saying that it did.  But if the goal is to set ra_pages based
> > on io_opt then to do that correctly on stacking drivers it must be done
> > in terms of limits stacking right?  Or at least done at a location that
> > is after the limits stacking has occurred?  So should DM just open-code
> > setting ra_pages like it did for io_pages?
> > 
> > Because setting ra_pages in __device_add_disk() is way too early for DM
> > -- given it uses device_add_disk_no_queue_reg via add_disk_no_queue_reg
> > at DM device creation (before stacking all underlying devices' limits).
> 
> I'll move it to blk_register_queue, which should work just fine.

That'll work for initial DM table load as part of DM device creation
(dm_setup_md_queue).  But it won't account for DM table reloads that
might change underlying devices on a live DM device (done using
__bind).

Both dm_setup_md_queue() and __bind() call dm_table_set_restrictions()
to set/update queue_limits.  It feels like __bind() will need to call a
new block helper to set/update parts of queue_limits (e.g. ra_pages and
io_pages).

Any chance you're open to factoring out that block function as an
exported symbol for use by blk_register_queue() and code like DM's
__bind()?

Thanks,
Mike

Re: [PATCH 06/14] block: lift setting the readahead size into the block layer

2020-09-02 Thread Mike Snitzer

On Wed, Sep 02 2020 at 11:11am -0400,
Christoph Hellwig  wrote:

> On Wed, Aug 26, 2020 at 06:07:38PM -0400, Mike Snitzer wrote:
> > On Sun, Jul 26 2020 at 11:03am -0400,
> > Christoph Hellwig  wrote:
> > 
> > > Drivers shouldn't really mess with the readahead size, as that is a VM
> > > concept.  Instead set it based on the optimal I/O size by lifting the
> > > algorithm from the md driver when registering the disk.  Also set
> > > bdi->io_pages there as well by applying the same scheme based on
> > > max_sectors.
> > > 
> > > Signed-off-by: Christoph Hellwig 
> > > ---
> > >  block/blk-settings.c |  5 ++---
> > >  block/blk-sysfs.c|  1 -
> > >  block/genhd.c| 13 +++--
> > >  drivers/block/aoe/aoeblk.c   |  2 --
> > >  drivers/block/drbd/drbd_nl.c | 12 +---
> > >  drivers/md/bcache/super.c|  4 
> > >  drivers/md/dm-table.c|  3 ---
> > >  drivers/md/raid0.c   | 16 
> > >  drivers/md/raid10.c  | 24 +---
> > >  drivers/md/raid5.c   | 13 +
> > >  10 files changed, 16 insertions(+), 77 deletions(-)
> > 
> > 
> > In general these changes need a solid audit relative to stacking
> > drivers.  That is, the limits stacking methods (blk_stack_limits)
> > vs lower level allocation methods (__device_add_disk).
> > 
> > You optimized for lowlevel __device_add_disk establishing the bdi's
> > ra_pages and io_pages.  That is at the beginning of disk allocation,
> > well before any build up of stacking driver's queue_io_opt() -- which
> > was previously done in disk_stack_limits or driver specific methods
> > (e.g. dm_table_set_restrictions) that are called _after_ all the limits
> > stacking occurs.
> > 
> > By inverting the setting of the bdi's ra_pages and io_pages to be done
> > so early in __device_add_disk it'll break properly setting these values
> > for at least DM afaict.
> 
> ra_pages never got inherited by stacking drivers, check it by modifying
> it on an underlying device and then creating a trivial dm or md one.

Sure, not saying that it did.  But if the goal is to set ra_pages based
on io_opt then to do that correctly on stacking drivers it must be done
in terms of limits stacking right?  Or at least done at a location that
is after the limits stacking has occurred?  So should DM just open-code
setting ra_pages like it did for io_pages?

Because setting ra_pages in __device_add_disk() is way too early for DM
-- given it uses device_add_disk_no_queue_reg via add_disk_no_queue_reg
at DM device creation (before stacking all underlying devices' limits).

> And I think that is a good thing - in general we shouldn't really mess
> with this thing from drivers if we can avoid it.  I've kept the legacy
> aoe and md parity raid cases, out of which the first looks pretty weird
> and the md one at least remotely sensible.

I don't want drivers, like DM, to have to worry about these.  So I agree
with that goal ;)

> ->io_pages is still inherited in disk_stack_limits, just like before
> so no change either.

I'm missing where, but I only looked closer at this 06/14 patch.
In it I see io_pages is no longer adjusted in disk_stack_limits().

Mike

Re: [PATCH 06/14] block: lift setting the readahead size into the block layer

2020-08-26 Thread Mike Snitzer

On Sun, Jul 26 2020 at 11:03am -0400,
Christoph Hellwig  wrote:

> Drivers shouldn't really mess with the readahead size, as that is a VM
> concept.  Instead set it based on the optimal I/O size by lifting the
> algorithm from the md driver when registering the disk.  Also set
> bdi->io_pages there as well by applying the same scheme based on
> max_sectors.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/blk-settings.c |  5 ++---
>  block/blk-sysfs.c|  1 -
>  block/genhd.c| 13 +++--
>  drivers/block/aoe/aoeblk.c   |  2 --
>  drivers/block/drbd/drbd_nl.c | 12 +---
>  drivers/md/bcache/super.c|  4 
>  drivers/md/dm-table.c|  3 ---
>  drivers/md/raid0.c   | 16 
>  drivers/md/raid10.c  | 24 +---
>  drivers/md/raid5.c   | 13 +
>  10 files changed, 16 insertions(+), 77 deletions(-)


In general these changes need a solid audit relative to stacking
drivers.  That is, the limits stacking methods (blk_stack_limits)
vs lower level allocation methods (__device_add_disk).

You optimized for lowlevel __device_add_disk establishing the bdi's
ra_pages and io_pages.  That is at the beginning of disk allocation,
well before any build up of stacking driver's queue_io_opt() -- which
was previously done in disk_stack_limits or driver specific methods
(e.g. dm_table_set_restrictions) that are called _after_ all the limits
stacking occurs.

By inverting the setting of the bdi's ra_pages and io_pages to be done
so early in __device_add_disk it'll break properly setting these values
for at least DM afaict.

Mike


> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 76a7e03bcd6cac..01049e9b998f1d 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -452,6 +452,8 @@ EXPORT_SYMBOL(blk_limits_io_opt);
>  void blk_queue_io_opt(struct request_queue *q, unsigned int opt)
>  {
>   blk_limits_io_opt(>limits, opt);
> + q->backing_dev_info->ra_pages =
> + max(queue_io_opt(q) * 2 / PAGE_SIZE, VM_READAHEAD_PAGES);
>  }
>  EXPORT_SYMBOL(blk_queue_io_opt);
>  
> @@ -628,9 +630,6 @@ void disk_stack_limits(struct gendisk *disk, struct 
> block_device *bdev,
>   printk(KERN_NOTICE "%s: Warning: Device %s is misaligned\n",
>  top, bottom);
>   }
> -
> - t->backing_dev_info->io_pages =
> - t->limits.max_sectors >> (PAGE_SHIFT - 9);
>  }
>  EXPORT_SYMBOL(disk_stack_limits);
>  
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 7dda709f3ccb6f..ce418d9128a0b2 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -245,7 +245,6 @@ queue_max_sectors_store(struct request_queue *q, const 
> char *page, size_t count)
>  
>   spin_lock_irq(>queue_lock);
>   q->limits.max_sectors = max_sectors_kb << 1;
> - q->backing_dev_info->io_pages = max_sectors_kb >> (PAGE_SHIFT - 10);
>   spin_unlock_irq(>queue_lock);
>  
>   return ret;
> diff --git a/block/genhd.c b/block/genhd.c
> index 8b1e9f48957cb5..097d4e4bc0b8a2 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -775,6 +775,7 @@ static void __device_add_disk(struct device *parent, 
> struct gendisk *disk,
> const struct attribute_group **groups,
> bool register_queue)
>  {
> + struct request_queue *q = disk->queue;
>   dev_t devt;
>   int retval;
>  
> @@ -785,7 +786,7 @@ static void __device_add_disk(struct device *parent, 
> struct gendisk *disk,
>* registration.
>*/
>   if (register_queue)
> - elevator_init_mq(disk->queue);
> + elevator_init_mq(q);
>  
>   /* minors == 0 indicates to use ext devt from part0 and should
>* be accompanied with EXT_DEVT flag.  Make sure all
> @@ -815,10 +816,18 @@ static void __device_add_disk(struct device *parent, 
> struct gendisk *disk,
>   disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
>   disk->flags |= GENHD_FL_NO_PART_SCAN;
>   } else {
> - struct backing_dev_info *bdi = disk->queue->backing_dev_info;
> + struct backing_dev_info *bdi = q->backing_dev_info;
>   struct device *dev = disk_to_dev(disk);
>   int ret;
>  
> + /*
> +  * For read-ahead of large files to be effective, we need to
> +  * readahead at least twice the optimal I/O size.
> +  */
> + bdi->ra_pages = max(queue_io_opt(q) * 2 / PAGE_SIZE,
> + VM_READAHEAD_PAGES);
> + bdi->io_pages = queue_max_sectors(q) >> (PAGE_SHIFT - 9);
> +
>   /* Register BDI before referencing it from bdev */
>   dev->devt = devt;
>   ret = bdi_register(bdi, "%u:%u", MAJOR(devt), MINOR(devt));
> diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
> index 5ca7216e9e01f3..89b33b402b4e52 100644

Re: drivers/md/dm-mpath.c:524 multipath_clone_and_map() error: double unlocked 'm->lock' (orig line 516)

2020-08-08 Thread Mike Snitzer

On Sat, Aug 08 2020 at  8:10am -0400,
kernel test robot  wrote:

> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   449dc8c97089a6e09fb2dac4d92b1b7ac0eb7c1e
> commit: 374117ad4736c5a4f8012cfe59fc07d9d58191d5 dm mpath: use double checked 
> locking in fast path
> date:   4 weeks ago
> config: arm-randconfig-m031-20200808 (attached as .config)
> compiler: arm-linux-gnueabi-gcc (GCC) 9.3.0
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 

All 3 recent reports about smatch showing "double unlocked 'm->lock'"
appear to be bogus.  Think smatch is generating false positives (and
given "Old smatch warnings" it seems it has been doing so for some
time).

In addition, does l...@intel.com no longer test linux-next?  The dm-mpath
locking changes that were just merged into mainline have been in
linux-next since July 13.  Why were these tests only done against
mainline?

Mike



> New smatch warnings:
> drivers/md/dm-mpath.c:524 multipath_clone_and_map() error: double unlocked 
> 'm->lock' (orig line 516)
> 
> Old smatch warnings:
> drivers/md/dm-mpath.c:446 choose_pgpath() error: double unlocked 'm->lock' 
> (orig line 416)
> drivers/md/dm-mpath.c:457 choose_pgpath() error: double unlocked 'm->lock' 
> (orig line 403)
> drivers/md/dm-mpath.c:525 multipath_clone_and_map() error: double unlocked 
> 'm->lock' (orig line 516)
> drivers/md/dm-mpath.c:526 multipath_clone_and_map() error: double unlocked 
> 'm->lock' (orig line 524)
> drivers/md/dm-mpath.c:626 __map_bio() error: double unlocked 'm->lock' (orig 
> line 615)
> drivers/md/dm-mpath.c:627 __map_bio() error: double unlocked 'm->lock' (orig 
> line 615)
> drivers/md/dm-mpath.c:628 __map_bio() error: double unlocked 'm->lock' (orig 
> line 626)
> drivers/md/dm-mpath.c:629 __map_bio() error: double unlocked 'm->lock' (orig 
> line 628)
> drivers/md/dm-mpath.c:1607 pg_init_done() error: double unlocked 'm->lock' 
> (orig line 1560)
> drivers/md/dm-mpath.c:1707 multipath_end_io_bio() error: double unlocked 
> 'm->lock' (orig line 1704)
> drivers/md/dm-mpath.c:1988 multipath_prepare_ioctl() error: double unlocked 
> 'm->lock' (orig line 1984)
> drivers/md/dm-mpath.c:2012 multipath_prepare_ioctl() error: double unlocked 
> 'm->lock' (orig line 2001)
> 
> vim +524 drivers/md/dm-mpath.c
> 
>498
>499/*
>500 * Map cloned requests (request-based multipath)
>501 */
>502static int multipath_clone_and_map(struct dm_target *ti, struct 
> request *rq,
>503   union map_info *map_context,
>504   struct request **__clone)
>505{
>506struct multipath *m = ti->private;
>507size_t nr_bytes = blk_rq_bytes(rq);
>508struct pgpath *pgpath;
>509struct block_device *bdev;
>510struct dm_mpath_io *mpio = get_mpio(map_context);
>511struct request_queue *q;
>512struct request *clone;
>513
>514/* Do we need to select a new pgpath? */
>515pgpath = READ_ONCE(m->current_pgpath);
>  > 516if (!pgpath || 
> !mpath_double_check_test_bit(MPATHF_QUEUE_IO, m))
>517pgpath = choose_pgpath(m, nr_bytes);
>518
>519if (!pgpath) {
>520if (must_push_back_rq(m))
>521return DM_MAPIO_DELAY_REQUEUE;
>522dm_report_EIO(m);   /* Failed */
>523return DM_MAPIO_KILL;
>  > 524} else if (mpath_double_check_test_bit(MPATHF_QUEUE_IO, 
> m) ||
>525   
> mpath_double_check_test_bit(MPATHF_PG_INIT_REQUIRED, m)) {
>526pg_init_all_paths(m);
>527return DM_MAPIO_DELAY_REQUEUE;
>528}
>529
>530mpio->pgpath = pgpath;
>531mpio->nr_bytes = nr_bytes;
>532
>533bdev = pgpath->path.dev->bdev;
>534q = bdev_get_queue(bdev);
>535clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE,
>536BLK_MQ_REQ_NOWAIT);
>537if (IS_ERR(clone)) {
>538/* EBUSY, ENODEV or EWOULDBLOCK: requeue */
>539if (blk_queue_dying(q)) {
>540atomic_inc(>pg_init_in_progress);
>541activate_or_offline_path(pgpath);
>542return DM_MAPIO_DELAY_REQUEUE;
>543}
>544
>545/*
>546 * blk-mq's

fixing 4.14-stable's broken DM cache writethrough support [was: Re: [(resend) PATCH v3: {linux-4.14.y} ] dm cache: submit writethrough writes in parallel to origin and cache]

2020-08-05 Thread Mike Snitzer

On Wed, Aug 05 2020 at 10:32am -0400,
Greg KH  wrote:

> On Tue, Aug 04, 2020 at 02:20:38PM -0400, Mike Snitzer wrote:
> > On Tue, Aug 04 2020 at  8:47am -0400,
> > Greg KH  wrote:
> > 
> > > On Tue, Aug 04, 2020 at 07:33:05AM -0500, John Donnelly wrote:
> > > > From: Mike Snitzer 
> > > > 
> > > > Discontinue issuing writethrough write IO in series to the origin and
> > > > then cache.
> > > > 
> > > > Use bio_clone_fast() to create a new origin clone bio that will be
> > > > mapped to the origin device and then bio_chain() it to the bio that gets
> > > > remapped to the cache device.  The origin clone bio does _not_ have a
> > > > copy of the per_bio_data -- as such check_if_tick_bio_needed() will not
> > > > be called.
> > > > 
> > > > The cache bio (parent bio) will not complete until the origin bio has
> > > > completed -- this fulfills bio_clone_fast()'s requirements as well as
> > > > the requirement to not complete the original IO until the write IO has
> > > > completed to both the origin and cache device.
> > > > 
> > > > Signed-off-by: Mike Snitzer 
> > > > 
> > > > (cherry picked from commit 2df3bae9a6543e90042291707b8db0cbfbae9ee9)
> > > > 
> > > > Fixes: 4ec34f2196d125ff781170ddc6c3058c08ec5e73 (dm bio record:
> > > > save/restore bi_end_io and bi_integrity )
> > > > 
> > > > 4ec34f21 introduced a mkfs.ext4 hang on a LVM device that has been
> > > > modified with lvconvert --cachemode=writethrough.
> > > > 
> > > > CC:sta...@vger.kernel.org for 4.14.y
> > > > 
> > > > Signed-off-by: John Donnelly 
> > > > Reviewed-by: Somasundaram Krishnasamy 
> > > > 
> > > > 
> > > > conflicts:
> > > > drivers/md/dm-cache-target.c. -  Corrected usage of
> > > > writethrough_mode(>feature) that was caught by
> > > > compiler, and removed unused static functions : 
> > > > writethrough_endio(),
> > > > defer_writethrough_bio(), wake_deferred_writethrough_worker()
> > > > that generated warnings.
> > > 
> > > What is this "conflicts nonsense"?  You don't see that in any other
> > > kernel patch changelog, do you?
> > > 
> > > > ---
> > > > drivers/md/dm-cache-target.c | 92 
> > > > ++--
> > > > 1 file changed, 37 insertions(+), 55 deletions(-)
> > > 
> > > Please fix your email client up, it's totally broken and this does not
> > > work at all and is getting frustrating from my side here.
> > > 
> > > Try sending emails to yourself and see if you can apply the patches, as
> > > the one you sent here does not work, again:
> > 
> > John's inability to submit a patch that can apply aside: I do not like
> > how this patch header is constructed (yet attributed "From" me).  It is
> > devoid of detail as it relates to stable@.
> > 
> > Greg, please don't apply the v4 of this patch either.  I'll craft a
> > proper stable@ patch that explains the reason for change and why we're
> > left having to resolve conflicts in stable@.
> > 
> > But first I need to focus on sending DM changes to Linus for v5.9 merge.
> 
> Ok, no worries, I'll drop all of these from my review queue and wait for
> something from you sometime in the future.

Hey Greg,

SO I've looked this required 4.14 stable@ backport over. Because 4.14
already has these commits (to fix a dm integrity issue):
1b17159e52b dm bio record: save/restore bi_end_io and bi_integrity
248aa2645aa dm integrity: use dm_bio_record and dm_bio_restore

DM-cache's 4.14 writethrough mode got broken because its implementation
(ab)used dm_hook_bio+dm_bio_record and predates 4.15's switch to using
bio_chain() via commit 2df3bae9a654.  Without commit 2df3bae9a654 the
dm_hook_bio+dm_bio_record changes from commit 1b17159e52b break
dm-cache's writethrough support.

So 4.14-stable now needs these 3 upstream 4.15 commits:
8e3c3827776f dm cache: pass cache structure to mode functions
2df3bae9a654 dm cache: submit writethrough writes in parallel to origin and 
cache
9958f1d9a04e dm cache: remove all obsolete writethrough-specific code

Applying those commits to v4.14.190 with:
git cherry-pick -x 8e3c3827776f^..9958f1d9a04e

results in a kernel that successfully builds and should fix
4.14-stable's broken dm-cache writethrough support.

Are you ok with queueing up applying these 3 upstream commits to
4.14-stable or do you need me to send a patchset?

Thanks,
Mike

Re: [(resend) PATCH v3: {linux-4.14.y} ] dm cache: submit writethrough writes in parallel to origin and cache

2020-08-04 Thread Mike Snitzer

On Tue, Aug 04 2020 at  8:47am -0400,
Greg KH  wrote:

> On Tue, Aug 04, 2020 at 07:33:05AM -0500, John Donnelly wrote:
> > From: Mike Snitzer 
> > 
> > Discontinue issuing writethrough write IO in series to the origin and
> > then cache.
> > 
> > Use bio_clone_fast() to create a new origin clone bio that will be
> > mapped to the origin device and then bio_chain() it to the bio that gets
> > remapped to the cache device.  The origin clone bio does _not_ have a
> > copy of the per_bio_data -- as such check_if_tick_bio_needed() will not
> > be called.
> > 
> > The cache bio (parent bio) will not complete until the origin bio has
> > completed -- this fulfills bio_clone_fast()'s requirements as well as
> > the requirement to not complete the original IO until the write IO has
> > completed to both the origin and cache device.
> > 
> > Signed-off-by: Mike Snitzer 
> > 
> > (cherry picked from commit 2df3bae9a6543e90042291707b8db0cbfbae9ee9)
> > 
> > Fixes: 4ec34f2196d125ff781170ddc6c3058c08ec5e73 (dm bio record:
> > save/restore bi_end_io and bi_integrity )
> > 
> > 4ec34f21 introduced a mkfs.ext4 hang on a LVM device that has been
> > modified with lvconvert --cachemode=writethrough.
> > 
> > CC:sta...@vger.kernel.org for 4.14.y
> > 
> > Signed-off-by: John Donnelly 
> > Reviewed-by: Somasundaram Krishnasamy 
> > 
> > conflicts:
> > drivers/md/dm-cache-target.c. -  Corrected usage of
> > writethrough_mode(>feature) that was caught by
> > compiler, and removed unused static functions : writethrough_endio(),
> > defer_writethrough_bio(), wake_deferred_writethrough_worker()
> > that generated warnings.
> 
> What is this "conflicts nonsense"?  You don't see that in any other
> kernel patch changelog, do you?
> 
> > ---
> > drivers/md/dm-cache-target.c | 92 
> > ++--
> > 1 file changed, 37 insertions(+), 55 deletions(-)
> 
> Please fix your email client up, it's totally broken and this does not
> work at all and is getting frustrating from my side here.
> 
> Try sending emails to yourself and see if you can apply the patches, as
> the one you sent here does not work, again:

John's inability to submit a patch that can apply aside: I do not like
how this patch header is constructed (yet attributed "From" me).  It is
devoid of detail as it relates to stable@.

Greg, please don't apply the v4 of this patch either.  I'll craft a
proper stable@ patch that explains the reason for change and why we're
left having to resolve conflicts in stable@.

But first I need to focus on sending DM changes to Linus for v5.9 merge.

Thanks,
Mike

Re: (resend) [PATCH [linux-4.14.y]] dm cache: submit writethrough writes in parallel to origin and cache

2020-07-29 Thread Mike Snitzer

On Wed, Jul 29 2020 at  7:55am -0400,
Greg KH  wrote:

> On Wed, Jul 29, 2020 at 01:51:19PM +0200, Greg KH wrote:
> > On Mon, Jul 27, 2020 at 11:00:14AM -0400, Mike Snitzer wrote:
> > > This mail needs to be saent to sta...@vger.kernel.org (now cc'd).
> > > 
> > > Greg et al: please backport 2df3bae9a6543e90042291707b8db0cbfbae9ee9
> > 
> > Now backported, thanks.
> 
> Nope, it broke the build, I need something that actually works :)
> 

OK, I'll defer to John Donnelly to get back with you (and rest of
stable@).  He is more invested due to SUSE also having this issue.  I
can put focus to it if John cannot sort this out.

Mike

Re: [PATCH 4.19 84/86] dm integrity: fix integrity recalculation that is improperly skipped

2020-07-27 Thread Mike Snitzer

On Mon, Jul 27 2020 at  7:31pm -0400,
Sasha Levin  wrote:

> On Mon, Jul 27, 2020 at 10:56:35PM +0200, Pavel Machek wrote:
> >Hi!
> >
> >>From: Mikulas Patocka 
> >>
> >>commit 5df96f2b9f58a5d2dc1f30fe7de75e197f2c25f2 upstream.
> >>
> >>Commit adc0daad366b62ca1bce3e2958a40b0b71a8b8b3 ("dm: report suspended
> >>device during destroy") broke integrity recalculation.
> >>
> >>The problem is dm_suspended() returns true not only during suspend,
> >>but also during resume. So this race condition could occur:
> >>1. dm_integrity_resume calls queue_work(ic->recalc_wq, >recalc_work)
> >>2. integrity_recalc (>recalc_work) preempts the current thread
> >>3. integrity_recalc calls if (unlikely(dm_suspended(ic->ti))) goto 
> >>unlock_ret;
> >>4. integrity_recalc exits and no recalculating is done.
> >>
> >>To fix this race condition, add a function dm_post_suspending that is
> >>only true during the postsuspend phase and use it instead of
> >>dm_suspended().
> >>
> >>Signed-off-by: Mikulas Patocka 
> >
> >Something is wrong with signoff here...
> 
> Heh, and the same thing happened with the stable tag:
> 
>   Cc: stable vger kernel org # v4.18+
> 
> But given that this is the way the upstream commit looks like we can't
> do much here.

Hmm, not sure what happened on the Signed-off-by and Cc for commit
5df96f2b9f.  Sorry about this!

Re: (resend) [PATCH [linux-4.14.y]] dm cache: submit writethrough writes in parallel to origin and cache

2020-07-27 Thread Mike Snitzer

On Mon, Jul 27 2020 at  4:17pm -0400,
Sasha Levin  wrote:

> On Mon, Jul 27, 2020 at 02:38:52PM -0500, John Donnelly wrote:
> >
> >
> >>On Jul 27, 2020, at 2:18 PM, Sasha Levin  wrote:
> >>
> >>On Mon, Jul 27, 2020 at 11:00:14AM -0400, Mike Snitzer wrote:
> >>>This mail needs to be saent to sta...@vger.kernel.org (now cc'd).
> >>>
> >>>Greg et al: please backport 2df3bae9a6543e90042291707b8db0cbfbae9ee9
> >>
> >>Hm, what's the issue that this patch addresses? It's not clear from the
> >>commit message.
> >>
> >>--
> >>Thanks,
> >>Sasha
> >
> >HI Sasha ,
> >
> >In an off-line conversation I had with Mike , he indicated that :
> >
> >
> >commit 1b17159e52bb31f982f82a6278acd7fab1d3f67b
> >Author: Mike Snitzer 
> >Date:   Fri Feb 28 18:00:53 2020 -0500
> >
> >  dm bio record: save/restore bi_end_io and bi_integrity
> >
> >
> >commit 248aa2645aa7fc9175d1107c2593cc90d4af5a4e
> >Author: Mike Snitzer 
> >Date:   Fri Feb 28 18:11:53 2020 -0500
> >
> >  dm integrity: use dm_bio_record and dm_bio_restore
> >
> >
> >Were picked up  in  "stable" kernels picked up even though
> >neither was marked for sta...@vger.kernel.org
> >
> >Adding this missing  commit :
> >
> >2df3bae9a6543e90042291707b8db0cbfbae9ee9
> >
> >
> >Completes the series
> 
> Should we just revert those two commits instead if they're not needed?

I'd be fine with that, exceept I haven't looked to see whether any
other stable commits conflict with reverting them.

But you have my blessing to give it a shot ;)

Mike

Re: (resend) [PATCH [linux-4.14.y]] dm cache: submit writethrough writes in parallel to origin and cache

2020-07-27 Thread Mike Snitzer

This mail needs to be saent to sta...@vger.kernel.org (now cc'd).

Greg et al: please backport 2df3bae9a6543e90042291707b8db0cbfbae9ee9

Thanks,
Mike

On Mon, Jul 27 2020 at  9:40am -0400,
John Donnelly  wrote:

> From: Mike Snitzer 
> 
> Discontinue issuing writethrough write IO in series to the origin and
> then cache.
> 
> Use bio_clone_fast() to create a new origin clone bio that will be
> mapped to the origin device and then bio_chain() it to the bio that gets
> remapped to the cache device. The origin clone bio does _not_ have a
> copy of the per_bio_data -- as such check_if_tick_bio_needed() will not
> be called.
> 
> The cache bio (parent bio) will not complete until the origin bio has
> completed -- this fulfills bio_clone_fast()'s requirements as well as
> the requirement to not complete the original IO until the write IO has
> completed to both the origin and cache device.
> 
> Signed-off-by: Mike Snitzer 
> 
> (cherry picked from commit 2df3bae9a6543e90042291707b8db0cbfbae9ee9)
> 
> Fixes: 705559706d62038b74c5088114c1799cf2c9dce8 (dm bio record:
> save/restore bi_end_io and bi_integrity, version 4.14.188)
> 
> 70555970 introduced a mkfs.ext4 hang on a LVM device that has been
> modified with lvconvert --cachemode=writethrough.
> 
> Signed-off-by: John Donnelly 
> Tested-by: John Donnelly 
> Reviewed-by: Somasundaram Krishnasamy 
> 
> conflict: drivers/md/dm-cache-target.c - Corrected syntax of
> writethrough_mode(>feature) that was caught by
> arm compiler.
> 
> cc: sta...@vger.kernel.org
> cc: snit...@redhat.com
> ---
> drivers/md/dm-cache-target.c | 54 
> 1 file changed, 37 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
> index 69cdb29ef6be..8241b7c36655 100644
> --- a/drivers/md/dm-cache-target.c
> +++ b/drivers/md/dm-cache-target.c
> @@ -450,6 +450,7 @@ struct cache {
> struct work_struct migration_worker;
> struct delayed_work waker;
> struct dm_bio_prison_v2 *prison;
> + struct bio_set *bs;
> mempool_t *migration_pool;
> @@ -868,16 +869,23 @@ static void check_if_tick_bio_needed(struct
> cache *cache, struct bio *bio)
> spin_unlock_irqrestore(>lock, flags);
> }
> -static void remap_to_origin_clear_discard(struct cache *cache,
> struct bio *bio,
> - dm_oblock_t oblock)
> +static void __remap_to_origin_clear_discard(struct cache *cache,
> struct bio *bio,
> + dm_oblock_t oblock, bool bio_has_pbd)
> {
> - // FIXME: this is called way too much.
> - check_if_tick_bio_needed(cache, bio);
> + if (bio_has_pbd)
> + check_if_tick_bio_needed(cache, bio);
> remap_to_origin(cache, bio);
> if (bio_data_dir(bio) == WRITE)
> clear_discard(cache, oblock_to_dblock(cache, oblock));
> }
> +static void remap_to_origin_clear_discard(struct cache *cache,
> struct bio *bio,
> + dm_oblock_t oblock)
> +{
> + // FIXME: check_if_tick_bio_needed() is called way too much
> through this interface
> + __remap_to_origin_clear_discard(cache, bio, oblock, true);
> +}
> +
> static void remap_to_cache_dirty(struct cache *cache, struct bio *bio,
> dm_oblock_t oblock, dm_cblock_t cblock)
> {
> @@ -971,23 +979,25 @@ static void writethrough_endio(struct bio *bio)
> }
> /*
> - * FIXME: send in parallel, huge latency as is.
> * When running in writethrough mode we need to send writes to clean blocks
> - * to both the cache and origin devices. In future we'd like to clone the
> - * bio and send them in parallel, but for now we're doing them in
> - * series as this is easier.
> + * to both the cache and origin devices. Clone the bio and send
> them in parallel.
> */
> -static void remap_to_origin_then_cache(struct cache *cache, struct
> bio *bio,
> - dm_oblock_t oblock, dm_cblock_t cblock)
> +static void remap_to_origin_and_cache(struct cache *cache, struct bio *bio,
> + dm_oblock_t oblock, dm_cblock_t cblock)
> {
> - struct per_bio_data *pb = get_per_bio_data(bio, PB_DATA_SIZE_WT);
> + struct bio *origin_bio = bio_clone_fast(bio, GFP_NOIO, cache->bs);
> - pb->cache = cache;
> - pb->cblock = cblock;
> - dm_hook_bio(>hook_info, bio, writethrough_endio, NULL);
> - dm_bio_record(>bio_details, bio);
> + BUG_ON(!origin_bio);
> - remap_to_origin_clear_discard(pb->cache, bio, oblock);
> + bio_chain(origin_bio, bio);
> + /*
> + * Passing false to __remap_to_origin_clear_discard() skips
> + * all code that might use per_bio_data (since clone doesn't have it)
> + */
> + __remap_to_origin_clear_discard(cache, origin_bio, oblock, false);
> + submit_bio(origin_bio);
> +
> + remap_to_cache(cache, bio, cblock);
> }
> /*-

Re: [PATCH 01/14] dm: use bio_uninit instead of bio_disassociate_blkg

2020-07-08 Thread Mike Snitzer

On Sat, Jun 27 2020 at  3:31am -0400,
Christoph Hellwig  wrote:

> bio_uninit is the proper API to clean up a BIO that has been allocated
> on stack or inside a structure that doesn't come from the BIO allocator.
> Switch dm to use that instead of bio_disassociate_blkg, which really is
> an implementation detail.  Note that the bio_uninit calls are also moved
> to the two callers of __send_empty_flush, so that they better pair with
> the bio_init calls used to initialize them.
> 
> Signed-off-by: Christoph Hellwig 

I've picked this up as a fix for 5.8

Thanks,
Mike

Re: [PATCH 10/20] dm: stop using ->queuedata

2020-07-01 Thread Mike Snitzer

On Wed, Jul 01 2020 at  4:59am -0400,
Christoph Hellwig  wrote:

> Instead of setting up the queuedata as well just use one private data
> field.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Mike Snitzer

Re: remove dead bdi congestion leftovers

2020-07-01 Thread Mike Snitzer

On Wed, Jul 01 2020 at  5:06am -0400,
Christoph Hellwig  wrote:

> Hi Jens,
> 
> we have a lot of bdi congestion related code that is left around without
> any use.  This series removes it in preparation of sorting out the bdi
> lifetime rules properly.

I could do some git archeology to see what the fs, mm and block core
changes were to stop using bdi congested but a pointer to associated
changes (or quick recap) would save me some time.

Also, curious to know how back-pressure should be felt back up the IO
stack now? (apologies if these are well worn topics, I haven't been
tracking this area of development).

Thanks,
Mike

Re: dm writecache: reject asynchronous pmem.

2020-06-30 Thread Mike Snitzer

On Tue, Jun 30 2020 at 10:10am -0400,
Michal Suchánek  wrote:

> On Tue, Jun 30, 2020 at 09:32:01AM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Tue, 30 Jun 2020, Michal Suchanek wrote:
> > 
> > > The writecache driver does not handle asynchronous pmem. Reject it when
> > > supplied as cache.
> > > 
> > > Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc@linux.ibm.com/
> > > Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
> > > 
> > > Signed-off-by: Michal Suchanek 
> > > ---
> > >  drivers/md/dm-writecache.c | 6 ++
> > >  1 file changed, 6 insertions(+)
> > > 
> > > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> > > index 30505d70f423..57b0a972f6fd 100644
> > > --- a/drivers/md/dm-writecache.c
> > > +++ b/drivers/md/dm-writecache.c
> > > @@ -2277,6 +2277,12 @@ static int writecache_ctr(struct dm_target *ti, 
> > > unsigned argc, char **argv)
> > >  
> > >   wc->memory_map_size -= (uint64_t)wc->start_sector << 
> > > SECTOR_SHIFT;
> > >  
> > > + if (!dax_synchronous(wc->ssd_dev->dax_dev)) {
> > > + r = -EOPNOTSUPP;
> > > + ti->error = "Asynchronous persistent memory not 
> > > supported as pmem cache";
> > > + goto bad;
> > > + }
> > > +
> > >   bio_list_init(>flush_list);
> > >   wc->flush_thread = kthread_create(writecache_flush_thread, wc, 
> > > "dm_writecache_flush");
> > >   if (IS_ERR(wc->flush_thread)) {
> > > -- 
> > 
> > Hi
> > 
> > Shouldn't this be in the "if (WC_MODE_PMEM(wc))" block?
> That should be always the case at this point.
> > 
> > WC_MODE_PMEM(wc) retrurns true if we are using persistent memory as a 
> > cache device, otherwise we are using generic block device as a cache 
> > device.
>
> This is to prevent the situation where we have WC_MODE_PMEM(wc) but
> cannot guarantee consistency because the async flush is not handled.

The writecache operates in 2 modes.  SSD or PMEM.  Mikulas is saying
your dax_synchronous() check should go within a WC_MODE_PMEM(wc) block
because it doesn't make sense to do the check when in SSD mode.

Mike

Re: [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-23 Thread Mike Snitzer

On Wed, Jun 24 2020 at 12:54am -0400,
Damien Le Moal  wrote:

> On 2020/06/24 0:23, Mike Snitzer wrote:
> > On Tue, Jun 23 2020 at 11:07am -0400,
> > Ignat Korchagin  wrote:
> > 
> >> Do you think it may be better to break it in two flags: one for read
> >> path and one for write? So, depending on the needs and workflow these
> >> could be enabled independently?
> > 
> > If there is a need to split, then sure.  But I think Damien had a hard
> > requirement that writes had to be inlined but that reads didn't _need_
> > to be for his dm-zoned usecase.  Damien may not yet have assessed the
> > performance implications, of not have reads inlined, as much as you
> > have.
> 
> We did do performance testing :)
> The results are mixed and performance differences between inline vs workqueues
> depend on the workload (IO size, IO queue depth and number of drives being 
> used
> mostly). In many cases, inlining everything does really improve performance as
> Ignat reported.
> 
> In our testing, we used hard drives and so focused mostly on throughput rather
> than command latency. The added workqueue context switch overhead and crypto
> work latency compared to typical HDD IO times is small, and significant only 
> if
> the backend storage as short IO times.
> 
> In the case of HDDs, especially for large IO sizes, inlining crypto work does
> not shine as it prevents an efficient use of CPU resources. This is especially
> true with reads on a large system with many drives connected to a single HBA:
> the softirq context decryption work does not lend itself well to using other
> CPUs that did not receive the HBA IRQ signaling command completions. The test
> results clearly show much higher throughputs using dm-crypt as is.
> 
> On the other hand, inlining crypto work significantly improves workloads of
> small random IOs, even for a large number of disks: removing the overhead of
> context switches allows faster completions, allowing sending more requests to
> the drives more quickly, keeping them busy.
> 
> For SMR, the inlining of write requests is *mandatory* to preserve the issuer
> write sequence, but encryption work being done in the issuer context (writes 
> to
> SMR drives can only be O_DIRECT writes), efficient CPU resource usage can be
> achieved by simply using multiple writer thread/processes, working on 
> different
> zones of different disks. This is a very reasonable model for SMR as writes 
> into
> a single zone have to be done under mutual exclusion to ensure sequentiality.
> 
> For reads, SMR drives are essentially exactly the same as regular disks, so
> as-is or inline are both OK. Based on our performance results, allowing the 
> user
> to have the choice of inlining or not reads based on the target workload would
> be great.
> 
> Of note is that zone append writes (emulated in SCSI, native with NVMe) are 
> not
> subject to the sequential write constraint, so they can also be executed 
> either
> inline or asynchronously.
> 
> > So let's see how Damien's work goes and if he trully doesn't need/want
> > reads to be inlined then 2 flags can be created.
> 
> For SMR, I do not need inline reads, but I do want the user to have the
> possibility of using this setup as that can provide better performance for 
> some
> workloads. I think that splitting the inline flag in 2 is exactly what we 
> want:
> 
> 1) For SMR, the write-inline flag can be automatically turned on when the 
> target
> device is created if the backend device used is a host-managed zoned drive 
> (scsi
> or NVMe ZNS). For reads, it would be the user choice, based on the target 
> workload.
> 2) For regular block devices, write-inline only, read-inline only or both 
> would
> be the user choice, to optimize for their target workload.
> 
> With the split into 2 flags, my SMR support patch becomes very simple.

OK, thanks for all the context.  Was a fun read ;)

SO let's run with splitting into 2 flags.  Ignat would you be up to
tweaking your patch to provide that and post a v2?

An added bonus would be to consolidate your 0/1 and 1/1 patch headers,
and add in the additional answers you provided in this thread to help
others understand the patch (mainly some more detail about why tasklet
is used).

Thanks,
Mike

Re: [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-23 Thread Mike Snitzer

On Fri, Jun 19 2020 at  9:23pm -0400,
Herbert Xu  wrote:

> On Fri, Jun 19, 2020 at 02:39:39PM -0400, Mikulas Patocka wrote:
> >
> > I'm looking at this and I'd like to know why does the crypto API fail in 
> > hard-irq context and why does it work in tasklet context. What's the exact 
> > reason behind this?
> 
> You're not supposed to do any real work in IRQ handlers.  All
> the substantial work should be postponed to softirq context.
> 
> Why do you need to do work in hard IRQ context?

Ignat, think you may have missed Herbert's question?

My understanding is that you're doing work in hard IRQ context (via
tasklet) precisely to avoid overhead of putting to a workqueue?  Did
you try using a workqueue and it didn't work adequately?  If so, do you
have a handle on why that is?  E.g. was it due to increased latency? or
IO completion occurring on different cpus that submitted (are you
leaning heavily on blk-mq's ability to pin IO completion to same cpu as
IO was submitted?)

I'm fishing here but I'd just like to tease out the details for _why_
you need to do work from hard IRQ via tasklet so that I can potentially
defend it if needed.

Thanks,
Mike

Re: [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-23 Thread Mike Snitzer

On Tue, Jun 23 2020 at 11:07am -0400,
Ignat Korchagin  wrote:

> Do you think it may be better to break it in two flags: one for read
> path and one for write? So, depending on the needs and workflow these
> could be enabled independently?

If there is a need to split, then sure.  But I think Damien had a hard
requirement that writes had to be inlined but that reads didn't _need_
to be for his dm-zoned usecase.  Damien may not yet have assessed the
performance implications, of not have reads inlined, as much as you
have.

So let's see how Damien's work goes and if he trully doesn't need/want
reads to be inlined then 2 flags can be created.

Mike

Re: [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-23 Thread Mike Snitzer

On Sun, Jun 21 2020 at  8:45pm -0400,
Damien Le Moal  wrote:

> On 2020/06/20 1:56, Mike Snitzer wrote:
> > On Fri, Jun 19 2020 at 12:41pm -0400,
> > Ignat Korchagin  wrote:
> > 
> >> This is a follow up from the long-forgotten [1], but with some more 
> >> convincing
> >> evidence. Consider the following script:
> >>
> >> #!/bin/bash -e
> >>
> >> # create 4G ramdisk
> >> sudo modprobe brd rd_nr=1 rd_size=4194304
> >>
> >> # create a dm-crypt device with NULL cipher on top of /dev/ram0
> >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo 
> >> dmsetup create eram0
> >>
> >> # create a dm-crypt device with NULL cipher and custom force_inline flag
> >> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 
> >> force_inline' | sudo dmsetup create inline-eram0
> >>
> >> # read all data from /dev/ram0
> >> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
> >>
> >> # read the same data from /dev/mapper/eram0
> >> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
> >>
> >> # read the same data from /dev/mapper/inline-eram0
> >> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
> >>
> >> This script creates a ramdisk (to eliminate hardware bias in the 
> >> benchmark) and
> >> two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
> >> to eliminate potentially expensive crypto bias (the NULL cipher just uses 
> >> memcpy
> >> for "encyption"). The first instance is the current dm-crypt 
> >> implementation from
> >> 5.8-rc1, the second is the dm-crypt instance with a custom new flag 
> >> enabled from
> >> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 
> >> cores
> >> on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted 
> >> for
> >> better readability):
> >>
> >> # plain ram0
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> # eram0 (current dm-crypt)
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> # inline-eram0 (patched dm-crypt)
> >> 1048576+0 records in
> >> 1048576+0 records out
> >> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
> >> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> >>
> >> As we can see, current dm-crypt implementation creates a significant IO
> >> performance overhead (at least on small IO block sizes) for both latency 
> >> and
> >> throughput. We suspect offloading IO request processing into workqueues and
> >> async threads is more harmful these days with the modern fast storage. I 
> >> also
> >> did some digging into the dm-crypt git history and much of this async 
> >> processing
> >> is not needed anymore, because the reasons it was added are mostly gone 
> >> from the
> >> kernel. More details can be found in [2] (see "Git archeology" section).
> >>
> >> We have been running the attached patch on different hardware generations 
> >> in
> >> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were 
> >> very
> >> happy with the performance benefits.
> >>
> >> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
> >> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> >>
> >> Ignat Korchagin (1):
> >>   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
> >>
> >>  drivers/md/dm-crypt.c | 55 +--
> >>  1 file changed, 43 insertions(+), 12 deletions(-)
> >>
> >> -- 
> >> 2.20.1
> >>
> > 
> > Hi,
> > 
> > I saw [2] and have been expecting something from cloudflare ever since.
> > Nice to see this submission.
> > 
> > There is useful context in your 0th patch header.  I'll likely merge
> > parts of this patch header with the more terse 1/1 header (reality is
> > there only needed to be a single patch submission).
> > 
> > Will review and stage accordi

Re: [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-19 Thread Mike Snitzer

On Fri, Jun 19 2020 at 12:41pm -0400,
Ignat Korchagin  wrote:

> This is a follow up from the long-forgotten [1], but with some more convincing
> evidence. Consider the following script:
> 
> #!/bin/bash -e
> 
> # create 4G ramdisk
> sudo modprobe brd rd_nr=1 rd_size=4194304
> 
> # create a dm-crypt device with NULL cipher on top of /dev/ram0
> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | sudo dmsetup 
> create eram0
> 
> # create a dm-crypt device with NULL cipher and custom force_inline flag
> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 force_inline' | 
> sudo dmsetup create inline-eram0
> 
> # read all data from /dev/ram0
> sudo dd if=/dev/ram0 bs=4k iflag=direct | sha256sum
> 
> # read the same data from /dev/mapper/eram0
> sudo dd if=/dev/mapper/eram0 bs=4k iflag=direct | sha256sum
> 
> # read the same data from /dev/mapper/inline-eram0
> sudo dd if=/dev/mapper/inline-eram0 bs=4k iflag=direct | sha256sum
> 
> This script creates a ramdisk (to eliminate hardware bias in the benchmark) 
> and
> two dm-crypt instances on top. Both dm-crypt instances use the NULL cipher
> to eliminate potentially expensive crypto bias (the NULL cipher just uses 
> memcpy
> for "encyption"). The first instance is the current dm-crypt implementation 
> from
> 5.8-rc1, the second is the dm-crypt instance with a custom new flag enabled 
> from
> the patch attached to this thread. On my VM (Debian in VirtualBox with 4 cores
> on 2.8 GHz Quad-Core Intel Core i7) I get the following output (formatted for
> better readability):
> 
> # plain ram0
> 1048576+0 records in
> 1048576+0 records out
> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.2305 s, 202 MB/s
> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> 
> # eram0 (current dm-crypt)
> 1048576+0 records in
> 1048576+0 records out
> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 53.2212 s, 80.7 MB/s
> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> 
> # inline-eram0 (patched dm-crypt)
> 1048576+0 records in
> 1048576+0 records out
> 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 21.3472 s, 201 MB/s
> 8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca  -
> 
> As we can see, current dm-crypt implementation creates a significant IO
> performance overhead (at least on small IO block sizes) for both latency and
> throughput. We suspect offloading IO request processing into workqueues and
> async threads is more harmful these days with the modern fast storage. I also
> did some digging into the dm-crypt git history and much of this async 
> processing
> is not needed anymore, because the reasons it was added are mostly gone from 
> the
> kernel. More details can be found in [2] (see "Git archeology" section).
> 
> We have been running the attached patch on different hardware generations in
> more than 200 datacentres on both SATA SSDs and NVME SSDs and so far were very
> happy with the performance benefits.
> 
> [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
> [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> 
> Ignat Korchagin (1):
>   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
> 
>  drivers/md/dm-crypt.c | 55 +--
>  1 file changed, 43 insertions(+), 12 deletions(-)
> 
> -- 
> 2.20.1
> 

Hi,

I saw [2] and have been expecting something from cloudflare ever since.
Nice to see this submission.

There is useful context in your 0th patch header.  I'll likely merge
parts of this patch header with the more terse 1/1 header (reality is
there only needed to be a single patch submission).

Will review and stage accordingly if all looks fine to me.  Mikulas,
please have a look too.

Thanks,
Mike

Re: [PATCH 21/29] docs: device-mapper: add dm-ebs.rst to an index file

2020-06-19 Thread Mike Snitzer

On Mon, Jun 15 2020 at  2:47am -0400,
Mauro Carvalho Chehab  wrote:

> Solves this Sphinx warning:
>   Documentation/admin-guide/device-mapper/dm-ebs.rst: WARNING: document 
> isn't included in any toctree
> 
> Signed-off-by: Mauro Carvalho Chehab 
> ---
>  Documentation/admin-guide/device-mapper/index.rst | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/Documentation/admin-guide/device-mapper/index.rst 
> b/Documentation/admin-guide/device-mapper/index.rst
> index ec62fcc8eece..6cf8adc86fa8 100644
> --- a/Documentation/admin-guide/device-mapper/index.rst
> +++ b/Documentation/admin-guide/device-mapper/index.rst
> @@ -11,6 +11,7 @@ Device Mapper
>  dm-clone
>  dm-crypt
>  dm-dust
> +dm-ebs
>  dm-flakey
>  dm-init
>  dm-integrity
> -- 
> 2.26.2
> 

Didn't see this fix staged in linux-next so I've picked it up for 5.8.

Thanks,
Mike

Re: New mode DM-Verity error handling

2020-06-18 Thread Mike Snitzer

On Thu, Jun 18 2020 at 12:50pm -0400,
Sami Tolvanen  wrote:

> On Thu, Jun 18, 2020 at 11:44:45AM -0400, Mike Snitzer wrote:
> > I do not accept that panicing the system because of verity failure is
> > reasonable.
> > 
> > In fact, even rebooting (via DM_VERITY_MODE_RESTART) looks very wrong.
> > 
> > The device should be put in a failed state and left for admin recovery.
> 
> That's exactly how the restart mode works on some Android devices. The
> bootloader sees the verification error and puts the device in recovery
> mode. Using the restart mode on systems without firmware support won't
> make sense, obviously.

OK, so I need further justification from Samsung why they are asking for
this panic mode.

Thanks,
Mike

Re: New mode DM-Verity error handling

2020-06-18 Thread Mike Snitzer

On Thu, Jun 18 2020 at  2:56am -0400,
JeongHyeon Lee  wrote:

> Hello, Dear devcice-mapper maintainers.
> 
> I'm JeongHyeon Lee, work in Samsung. I'm chage of DM-Verity feature with 
> Mr. sunwook eom.
> I have a patch or suggestion about DM-Verity error handling.
> 
> Our device (smart phone) need DM-Verity feature. So I hope there is new 
> mode DM-Verity error handling.
> This new mode concept is When detect corrupted block, will be go to panic.
> 
> Because our team policy is found device DM-Verity error, device will go 
> panic.
> And then analyze what kind of device fault (crash UFS, IO error, DRAM 
> bit flip etc)
> 
> In addition to the smart phone, I would like to have an option that 
> users or administrators can use accordingly.
> There are patch contents in the attachment. I would really appreciate it 
> if you could check it.
> 
> I will look forward to hearing from yours.
> Thank you :)
> 

I do not accept that panicing the system because of verity failure is
reasonable.

In fact, even rebooting (via DM_VERITY_MODE_RESTART) looks very wrong.

The device should be put in a failed state and left for admin recovery.

Mike

Re: dm writecache: correct uncommitted_block when discarding uncommitted entry

2020-06-17 Thread Mike Snitzer

On Sat, Jun 13 2020 at  8:40am -0400,
Mikulas Patocka  wrote:

> 
> 
> On Fri, 12 Jun 2020, Huaisheng Ye wrote:
> 
> > From: Huaisheng Ye 
> > 
> > When uncommitted entry has been discarded, correct wc->uncommitted_block
> > for getting the exact number.
> > 
> > Signed-off-by: Huaisheng Ye 
> 
> Acked-by: Mikulas Patocka 
> 
> Also, add:
> Cc: sta...@vger.kernel.org

I picked this up for 5.8 but I inverted the patch order because this
stable@ fix was dependent on the prior patch to skip waiting if in pmem
mode (due to locality of changes within same function).

See:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.8=39495b12ef1cf602e6abd350dce2ef4199906531

Mike

Re: dm ioctl: Use struct_size() helper

2020-06-17 Thread Mike Snitzer

On Tue, Jun 16 2020 at  6:06pm -0400,
Gustavo A. R. Silva  wrote:

> Hi all,
> 
> Friendly ping: who can take this?
> 
> It's been almost a year... and I just noticed there was a problem
> with the email addresses back then...
> 
> I just fixed the issue and this patch should now appear on
> dm-de...@redhat.com and LKML.

I don't see any resubmit from you on either list.  But I've applied the
fix by hand and attributed it to you.

Thanks,
Mike


> On 8/28/19 13:38, Gustavo A. R. Silva wrote:
> > One of the more common cases of allocation size calculations is finding
> > the size of a structure that has a zero-sized array at the end, along
> > with memory for some number of elements for that array. For example:
> > 
> > struct dm_target_deps {
> > ...
> > __u64 dev[0];   /* out */
> > };
> > 
> > Make use of the struct_size() helper instead of an open-coded version
> > in order to avoid any potential type mistakes.
> > 
> > So, replace the following form:
> > 
> > sizeof(*deps) + (sizeof(*deps->dev) * count)
> > 
> > with:
> > 
> > struct_size(deps, dev, count)
> > 
> > This code was detected with the help of Coccinelle.
> > 
> > Signed-off-by: Gustavo A. R. Silva 
> > ---
> >  drivers/md/dm-ioctl.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> > index fb6f8fb1f13d..b2d52cec70d4 100644
> > --- a/drivers/md/dm-ioctl.c
> > +++ b/drivers/md/dm-ioctl.c
> > @@ -1446,7 +1446,7 @@ static void retrieve_deps(struct dm_table *table,
> > /*
> >  * Check we have enough space.
> >  */
> > -   needed = sizeof(*deps) + (sizeof(*deps->dev) * count);
> > +   needed = struct_size(deps, dev, count);
> > if (len < needed) {
> > param->flags |= DM_BUFFER_FULL_FLAG;
> > return;
> > 
>

Re: [PATCH][next] dm zoned: fix memory leak of newly allocated zone on xa_insert failure

2020-06-03 Thread Mike Snitzer

On Wed, Jun 03 2020 at 12:02pm -0400,
Colin King  wrote:

> From: Colin Ian King 
> 
> Currently if an xa_insert fails then there is a memory lead of the
> recently allocated zone object. Fix this by kfree'ing zone before
> returning on the error return path.
> 
> Addresses-Coverity: ("Resource leak")
> Fixes: 1a311efa3916 ("dm zoned: convert to xarray")
> Signed-off-by: Colin Ian King 
> ---
>  drivers/md/dm-zoned-metadata.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c
> index b23ff090c056..130b5a6d9f12 100644
> --- a/drivers/md/dm-zoned-metadata.c
> +++ b/drivers/md/dm-zoned-metadata.c
> @@ -313,8 +313,10 @@ static struct dm_zone *dmz_insert(struct dmz_metadata 
> *zmd,
>   if (!zone)
>   return ERR_PTR(-ENOMEM);
>  
> - if (xa_insert(>zones, zone_id, zone, GFP_KERNEL))
> + if (xa_insert(>zones, zone_id, zone, GFP_KERNEL)) {
> + kfree(zone);
>   return ERR_PTR(-EBUSY);
> + }
>  
>   INIT_LIST_HEAD(>link);
>   atomic_set(>refcount, 0);
> -- 
> 2.25.1
> 

Thanks, I folded this in.

Re: next-20200514 - build issue in drivers/md/dm-zoned-target.c

2020-05-18 Thread Mike Snitzer

On Mon, May 18 2020 at  2:25am -0400,
Hannes Reinecke  wrote:

> On 5/16/20 1:19 PM, Valdis Klētnieks wrote:
> >Am seeing a build error in next-0514.  -0420 built OK.
> >building a 'make allmodconfig' on a RPi4 in 32-bit mode.
> >
> >   MODPOST 7575 modules
> >ERROR: modpost: "__aeabi_uldivmod" [drivers/md/dm-zoned.ko] undefined!
> >
> >objdump and 'make drivers/md/dm-zoned-target.s' tells
> >me that the problem is in function dmz_fixup_devices(), near here:
> >
> >@ drivers/md/dm-zoned-target.c:806: reg_dev->nr_zones = 
> >DIV_ROUND_UP(reg_dev->capacity,
> > ldr r0, [r6, #56]   @ reg_dev_166->capacity, 
> > reg_dev_166->capacity
> > addsr1, r3, r1  @ tmp316, _227, reg_dev_166->capacity
> > adc r0, r2, r0  @ tmp315, _227, reg_dev_166->capacity
> > subsr1, r1, #1  @, tmp316,
> >@ drivers/md/dm-zoned-target.c:805: reg_dev->zone_nr_sectors = 
> >zoned_dev->zone_nr_sectors;
> > strdr2, [r6, #80]   @, reg_dev,
> >@ drivers/md/dm-zoned-target.c:806: reg_dev->nr_zones = 
> >DIV_ROUND_UP(reg_dev->capacity,
> > sbc r0, r0, #0  @, tmp315,
> > bl  __aeabi_uldivmod@
> >@ drivers/md/dm-zoned-target.c:806: reg_dev->nr_zones = 
> >DIV_ROUND_UP(reg_dev->capacity,
> > str r1, [r6, #64]   @ tmp306, reg_dev_166->nr_zones
> >
> >git blame points at this commit:
> >
> >commit 70978208ec91d798066f4c291bc98ff914bea222
> >Author: Hannes Reinecke 
> >Date:   Mon May 11 10:24:30 2020 +0200
> >
> > dm zoned: metadata version 2
> >
> >Reverting that commit lets the build complete.
> >
> >
> I thought I've send a patch to fix that up; DIV_ROUND_UP() needs to
> be changed to DIV_ROUND_UP_ULL().
> I'll be checking and will be sending a patch if necessary.

Unless I'm missing something it was fixed up with this commit last
wednesday (13th):

https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.8=81a3a1453ec4e5da081e1395732801a600feb352

Re: [PATCH v4 0/2] Historical Service Time Path Selector

2020-05-11 Thread Mike Snitzer

On Mon, May 11 2020 at  1:31pm -0400,
Mike Snitzer  wrote:

> On Mon, May 11 2020 at  1:11pm -0400,
> Gabriel Krisman Bertazi  wrote:
> 
> > Mike Snitzer  writes:
> > 
> > > On Mon, May 11 2020 at 12:39pm -0400,
> > > Gabriel Krisman Bertazi  wrote:
> > >
> > >> Hi,
> > >> 
> > >> This fourth version of HST applies the suggestion from Mikulas Patocka
> > >> to do the ktime_get_ns inside the mpath map_bio instead of generic
> > >> device-mapper code. This means that struct dm_mpath_io gained another
> > >> 64bit field.  For the request-based case, we continue to use the block
> > >> layer start time information.
> > >> 
> > >> With this modification, I was able obtain similar performance on  BIO
> > >> to request-based multipath with HST on the benchmarks shared in v1.
> > >> 
> > >> v3: https://www.redhat.com/archives/dm-devel/2020-April/msg00308.html
> > >> v2: https://www.redhat.com/archives/dm-devel/2020-April/msg00270.html
> > >> v1: https://www.redhat.com/archives/dm-devel/2020-April/msg00176.html
> > >
> > > I already staged your v3 in linux-next.  Please provide an incremental
> > > patch that layers on this git branch:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.8
> > >
> > > I was hopeful for a flag to be set (e.g. in 'struct path_selector') to
> > > reflect whether the path selector expects highres start_time.  Makes
> > > little sense to incur that extra cost of providing the time if the path
> > > selector doesn't even use it.
> > >
> > > Alternatively, could split out the setting of the time needed by .end_io
> > > to a new path_selector_type method (e.g. .set_start_time).  And then
> > > only use ktime_get_ns() for bio-based if .set_start_time is defined.
> > > Would get a little fiddly needing to make sure a stale start_time isn't
> > > used... also, makes more sense to conditionally call this
> > > .set_start_time just after .start_io is.
> > 
> > Oh, my apologies, I hadn't noticed it was merged.  I will make the time 
> > fetch
> > conditional and submit a new patch based on that branch.
> 
> I don't want to waste your time so please don't run with that idea just yet.
> 
> There is a possibility we really _do_ need higher resolution time.
> 
> I'm about to have a concall to discuss some disk IO stat issues with DM
> disk stats vs NVMe disk stats (provided by block core).
> 
> I'll let you know the outcome and we can discuss further.

OK, that concall's issue had nothing to do with needing higher
resolution time (was about IOPs realized with requested-based vs
bio-based).

Reality is, DM won't need anything higher resolution than jiffies until
block core's interfaces require something other than jiffies
(e.g. generic_end_io_acct).

So feel free to proceed with the conditional time fetch solution you
were going to run with (prior to my previous mail asking you to hold
off).

Sorry for the noise.  Thanks,
Mike

Re: [PATCH v4 0/2] Historical Service Time Path Selector

2020-05-11 Thread Mike Snitzer

On Mon, May 11 2020 at  1:11pm -0400,
Gabriel Krisman Bertazi  wrote:

> Mike Snitzer  writes:
> 
> > On Mon, May 11 2020 at 12:39pm -0400,
> > Gabriel Krisman Bertazi  wrote:
> >
> >> Hi,
> >> 
> >> This fourth version of HST applies the suggestion from Mikulas Patocka
> >> to do the ktime_get_ns inside the mpath map_bio instead of generic
> >> device-mapper code. This means that struct dm_mpath_io gained another
> >> 64bit field.  For the request-based case, we continue to use the block
> >> layer start time information.
> >> 
> >> With this modification, I was able obtain similar performance on  BIO
> >> to request-based multipath with HST on the benchmarks shared in v1.
> >> 
> >> v3: https://www.redhat.com/archives/dm-devel/2020-April/msg00308.html
> >> v2: https://www.redhat.com/archives/dm-devel/2020-April/msg00270.html
> >> v1: https://www.redhat.com/archives/dm-devel/2020-April/msg00176.html
> >
> > I already staged your v3 in linux-next.  Please provide an incremental
> > patch that layers on this git branch:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.8
> >
> > I was hopeful for a flag to be set (e.g. in 'struct path_selector') to
> > reflect whether the path selector expects highres start_time.  Makes
> > little sense to incur that extra cost of providing the time if the path
> > selector doesn't even use it.
> >
> > Alternatively, could split out the setting of the time needed by .end_io
> > to a new path_selector_type method (e.g. .set_start_time).  And then
> > only use ktime_get_ns() for bio-based if .set_start_time is defined.
> > Would get a little fiddly needing to make sure a stale start_time isn't
> > used... also, makes more sense to conditionally call this
> > .set_start_time just after .start_io is.
> 
> Oh, my apologies, I hadn't noticed it was merged.  I will make the time fetch
> conditional and submit a new patch based on that branch.

I don't want to waste your time so please don't run with that idea just yet.

There is a possibility we really _do_ need higher resolution time.

I'm about to have a concall to discuss some disk IO stat issues with DM
disk stats vs NVMe disk stats (provided by block core).

I'll let you know the outcome and we can discuss further.

Thanks,
Mike

Re: [PATCH v4 0/2] Historical Service Time Path Selector

2020-05-11 Thread Mike Snitzer

On Mon, May 11 2020 at 12:39pm -0400,
Gabriel Krisman Bertazi  wrote:

> Hi,
> 
> This fourth version of HST applies the suggestion from Mikulas Patocka
> to do the ktime_get_ns inside the mpath map_bio instead of generic
> device-mapper code. This means that struct dm_mpath_io gained another
> 64bit field.  For the request-based case, we continue to use the block
> layer start time information.
> 
> With this modification, I was able obtain similar performance on  BIO
> to request-based multipath with HST on the benchmarks shared in v1.
> 
> v3: https://www.redhat.com/archives/dm-devel/2020-April/msg00308.html
> v2: https://www.redhat.com/archives/dm-devel/2020-April/msg00270.html
> v1: https://www.redhat.com/archives/dm-devel/2020-April/msg00176.html

I already staged your v3 in linux-next.  Please provide an incremental
patch that layers on this git branch:

https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.8

I was hopeful for a flag to be set (e.g. in 'struct path_selector') to
reflect whether the path selector expects highres start_time.  Makes
little sense to incur that extra cost of providing the time if the path
selector doesn't even use it.

Alternatively, could split out the setting of the time needed by .end_io
to a new path_selector_type method (e.g. .set_start_time).  And then
only use ktime_get_ns() for bio-based if .set_start_time is defined.
Would get a little fiddly needing to make sure a stale start_time isn't
used... also, makes more sense to conditionally call this
.set_start_time just after .start_io is.

Mike

Re: [PATCH -next] dm mpath: Remove unused variable ret

2020-05-07 Thread Mike Snitzer

On Thu, May 07 2020 at  8:26am -0400,
Samuel Zou  wrote:

> This patch fixes below warning reported by coccicheck:
> 
> drivers/md/dm-historical-service-time.c:240:14-16: Unneeded variable: "sz". 
> Return "0" on line 261
> 
> Reported-by: Hulk Robot 
> Signed-off-by: Samuel Zou 

Nack.

DMEMIT() uses the local sz variable.

Re: [PATCH v2 1/1] blk-mq: Inline request status checkers

2019-09-30 Thread Mike Snitzer

On Mon, Sep 30 2019 at  3:53pm -0400,
Bart Van Assche  wrote:

> On 9/30/19 12:43 PM, Pavel Begunkov (Silence) wrote:
> > @@ -282,7 +282,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, 
> > unsigned int bitnr, void *data)
> >  * test and set the bit before assining ->rqs[].
> >  */
> > rq = tags->rqs[bitnr];
> > -   if (rq && blk_mq_request_started(rq))
> > +   if (rq && blk_mq_rq_state(rq) != MQ_RQ_IDLE)
> > return iter_data->fn(rq, iter_data->data, reserved);
> >  
> > return true>
> > @@ -360,7 +360,7 @@ static bool blk_mq_tagset_count_completed_rqs(struct 
> > request *rq,
> >  {
> > unsigned *count = data;
> >  
> > -   if (blk_mq_request_completed(rq))
> > +   if (blk_mq_rq_state(rq) == MQ_RQ_COMPLETE)
> > (*count)++;
> > return true;
> >  }
> 
> Changes like the above significantly reduce readability of the code in
> the block layer core. I don't like this. I think this patch is a step
> backwards instead of a step forwards.

I agree, not helpful.

Re: dax: Arrange for dax_supported check to span multiple devices

2019-05-16 Thread Mike Snitzer

On Tue, May 14 2019 at 11:48pm -0400,
Dan Williams  wrote:

> Pankaj reports that starting with commit ad428cdb525a "dax: Check the
> end of the block-device capacity with dax_direct_access()" device-mapper
> no longer allows dax operation. This results from the stricter checks in
> __bdev_dax_supported() that validate that the start and end of a
> block-device map to the same 'pagemap' instance.
> 
> Teach the dax-core and device-mapper to validate the 'pagemap' on a
> per-target basis. This is accomplished by refactoring the
> bdev_dax_supported() internals into generic_fsdax_supported() which
> takes a sector range to validate. Consequently generic_fsdax_supported()
> is suitable to be used in a device-mapper ->iterate_devices() callback.
> A new ->dax_supported() operation is added to allow composite devices to
> split and route upper-level bdev_dax_supported() requests.
> 
> Fixes: ad428cdb525a ("dax: Check the end of the block-device...")
> Cc: 
> Cc: Jan Kara 
> Cc: Ira Weiny 
> Cc: Dave Jiang 
> Cc: Mike Snitzer 
> Cc: Keith Busch 
> Cc: Matthew Wilcox 
> Cc: Vishal Verma 
> Cc: Heiko Carstens 
> Cc: Martin Schwidefsky 
> Reported-by: Pankaj Gupta 
> Signed-off-by: Dan Williams 
> ---
> Hi Mike,
> 
> Another day another new dax operation to allow device-mapper to better
> scope dax operations.
> 
> Let me know if the device-mapper changes look sane. This passes a new
> unit test that indeed fails on current mainline.
> 
> https://github.com/pmem/ndctl/blob/device-mapper-pending/test/dm.sh
> 
>  drivers/dax/super.c  |   88 
> +++---
>  drivers/md/dm-table.c|   17 +---
>  drivers/md/dm.c  |   20 ++
>  drivers/md/dm.h  |1 
>  drivers/nvdimm/pmem.c|1 
>  drivers/s390/block/dcssblk.c |1 
>  include/linux/dax.h  |   19 +
>  7 files changed, 110 insertions(+), 37 deletions(-)
> 

...

> diff --git a/drivers/md/dm.h b/drivers/md/dm.h
> index 2d539b82ec08..e5e240bfa2d0 100644
> --- a/drivers/md/dm.h
> +++ b/drivers/md/dm.h
> @@ -78,6 +78,7 @@ void dm_unlock_md_type(struct mapped_device *md);
>  void dm_set_md_type(struct mapped_device *md, enum dm_queue_mode type);
>  enum dm_queue_mode dm_get_md_type(struct mapped_device *md);
>  struct target_type *dm_get_immutable_target_type(struct mapped_device *md);
> +bool dm_table_supports_dax(struct dm_table *t, int blocksize);
>  
>  int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t);
>  

I'd prefer to have dm_table_supports_dax come just after
dm_table_get_md_mempools in the preceding dm_table section of dm.h (just
above this mapped_device section you extended).

But other than that nit, patch looks great on a DM level:

Reviewed-by: Mike Snitzer

Re: dm ioctl: fix hang in early create error condition

2019-05-15 Thread Mike Snitzer

On Wed, May 15 2019 at 12:12pm -0400,
Helen Koike  wrote:

> Hi,
> 
> On 5/13/19 10:37 PM, Mike Snitzer wrote:
> > On Mon, May 13 2019 at  3:25P -0400,
> > Helen Koike  wrote:
> > 
> >> The dm_early_create() function (which deals with "dm-mod.create=" kernel
> >> command line option) calls dm_hash_insert() who gets an extra reference
> >> to the md object.
> >>
> >> In case of failure, this reference wasn't being released, causing
> >> dm_destroy() to hang, thus hanging the whole boot process.
> >>
> >> Fix this by calling __hash_remove() in the error path.
> >>
> >> Fixes: 6bbc923dfcf57d ("dm: add support to directly boot to a mapped 
> >> device")
> >> Cc: sta...@vger.kernel.org
> >> Signed-off-by: Helen Koike 
> >>
> >> ---
> >> Hi,
> >>
> >> I tested this patch by adding a new test case in the following test
> >> script:
> >>
> >> https://gitlab.collabora.com/koike/dm-cmdline-test/commit/d2d7a0ee4a49931cdb59f08a837b516c2d5d743d
> >>
> >> This test was failing, but with this patch it works correctly.
> >>
> >> Thanks
> >> Helen
> > 
> > Thanks for the patch but I'd prefer the following simpler fix.  What do
> > you think?
> > 
> > That said, I can provide a follow-on patch (inspired by the patch you
> > provided) that encourages more code sharing between dm_early_create()
> > and dev_create() by factoring out __dev_create().
> 
> Sounds great.
> 
> > 
> > diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> > index c740153b4e52..0eb0b462c736 100644
> > --- a/drivers/md/dm-ioctl.c
> > +++ b/drivers/md/dm-ioctl.c
> > @@ -2117,6 +2117,7 @@ int __init dm_early_create(struct dm_ioctl *dmi,
> >  err_destroy_table:
> > dm_table_destroy(t);
> >  err_destroy_dm:
> > +   (void) __hash_remove(__find_device_hash_cell(dmi));
> > dm_put(md);
> > dm_destroy(md);
> > return r;
> > 
> 
> This doesn't really work for two reasons:
> 
> 1) __find_device_hash_cell() requires a mutual exclusivity between name,
> uuid and dev. In dm_early_create(), dmi can have more then one of these.

__find_device_hash_cell's exclusivity requirements are strange; I'll try
to understand what requires this.

> 2) I can fix (1) by calling __get_name_cell(), as the name is mandatory
> anyway, but this function also grabs another reference to the md object,
> so I need to add an extra dm_put(md) there:
> 
>  err_destroy_table:
> dm_table_destroy(t);
> +err_hash_remove:
> +   (void) __hash_remove(__get_name_cell(dmi->name));
> +   dm_put(md);
>  err_destroy_dm:
> dm_put(md);
> dm_destroy(md);
> 
> 
> What do you think? Is this ok?

I think so.  Please submit a v2 and I'll rebase my followon patch
accordingly and will get it posted.

Thanks,
Mike

Re: x86 VM Boot hang with latest linux-next

2019-03-04 Thread Mike Snitzer

On Sun, Mar 03 2019 at 12:06pm -0500,
Alexander Duyck  wrote:

> On Sat, Mar 2, 2019 at 7:48 PM Mike Snitzer  wrote:
> >
> > On Sat, Mar 02 2019 at  6:34pm -0500,
> > Alexander Duyck  wrote:
> >
> > > So I have been seeing an issue with an intermittent boot hang on my
> > > x86 KVM VM with the latest linux-next and have bisected it down to the
> > > following commit:
> > > 1efa3bb79d3de8ca1b7f6770313a1fc0bebe25c7 is the first bad commit
> > > commit 1efa3bb79d3de8ca1b7f6770313a1fc0bebe25c7
> > > Author: Mike Snitzer 
> > > Date:   Fri Feb 22 11:23:01 2019 -0500
> > >
> > > dm: must allocate dm_noclone for stacked noclone devices
> > >
> > > Otherwise various lvm2 testsuite tests fail because the lower layers 
> > > of
> > > the stacked noclone device aren't updated to allocate a new 'struct
> > > dm_clone' that reflects the upper layer bio that was issued to it.
> > >
> > > Fixes: 97a89458020b38 ("dm: improve noclone bio support")
> > > Reported-by: Mikulas Patocka 
> > > Signed-off-by: Mike Snitzer 
> > >
> > > What I am seeing is in about 3 out of 4 boots the startup just hangs
> > > at the filesystem check stage with the following message:
> > > [  OK  ] Reached target Local File Systems (Pre).
> > >  Starting File System Check on 
> > > /dev/…127-ad57-426f-bb45-363950544c0c...
> > > [**] (1 of 2) A start job is running for…n on device 252:2 (19s / no 
> > > limit)
> > >
> > > I did some googling and it looks like a similar issue has been
> > > reported for s390. Based on the request for data there I have the
> > > following info:
> > > [root@localhost ~]# dmsetup ls --tree
> > > fedora-swap (253:1)
> > >  └─ (252:2)
> > > fedora-root (253:0)
> > >  └─ (252:2)
> > >
> > > [root@localhost ~]# dmsetup table
> > > fedora-swap: 0 4194304 linear 252:2 2048
> > > fedora-root: 0 31457280 linear 252:2 4196352
> >
> > Thanks, which version of Fedora are you running?
> 
> The VM is running Fedora 27 with a kernel built off of latest
> linux-next as of March 1st.
> 
> > Your case is more straightforward in that you're clearly using bio-based
> > DM linear (which was updated to leverage "noclone" support); whereas the
> > s390 case is using request-based DM which isn't impacted by the commit
> > in question at all.
> >
> > I'll attempt to reproduce first thing Monday.
> >
> > Mike
> 
> Thanks. The behavior of it has me wondering if we are looking at
> something like an uninitialized data issue or something like that
> since as I mentioned I don't see this occur on every boot, just on
> most of them. So every now and then I can boot up the VM without any
> issues, but most of the time it will boot and then get stuck waiting
> on jobs that take forever.

I just copied you on another related thread, but for the benefit of
anyone on LKML, please see the following for a fix that works for me:

https://www.redhat.com/archives/dm-devel/2019-March/msg00027.html

Re: x86 VM Boot hang with latest linux-next

2019-03-04 Thread Mike Snitzer

On Sun, Mar 03 2019 at 12:06pm -0500,
Alexander Duyck  wrote:

> Thanks. The behavior of it has me wondering if we are looking at
> something like an uninitialized data issue or something like that
> since as I mentioned I don't see this occur on every boot, just on
> most of them. So every now and then I can boot up the VM without any
> issues, but most of the time it will boot and then get stuck waiting
> on jobs that take forever.

The commit in question (1efa3bb79d3 "dm: must allocate dm_noclone for
stacked noclone devices") conditionally allocates the 'struct
dm_noclone' and I can only infer that the change is causing a negative
side-effect in virtualized environments (both x86_64 and s390).

I haven't been able to reproduce on x86_64 kvm with virtio-scsi on
fedora 25 though.  Will try fedora 29 shortly, and also try virtio-blk.

Could you please provide your guest's .config (even if just off-list)?
Also, what kernel are you running on the host?

Thanks,
Mike

Re: x86 VM Boot hang with latest linux-next

2019-03-02 Thread Mike Snitzer

On Sat, Mar 02 2019 at  6:34pm -0500,
Alexander Duyck  wrote:

> So I have been seeing an issue with an intermittent boot hang on my
> x86 KVM VM with the latest linux-next and have bisected it down to the
> following commit:
> 1efa3bb79d3de8ca1b7f6770313a1fc0bebe25c7 is the first bad commit
> commit 1efa3bb79d3de8ca1b7f6770313a1fc0bebe25c7
> Author: Mike Snitzer 
> Date:   Fri Feb 22 11:23:01 2019 -0500
> 
> dm: must allocate dm_noclone for stacked noclone devices
> 
> Otherwise various lvm2 testsuite tests fail because the lower layers of
> the stacked noclone device aren't updated to allocate a new 'struct
> dm_clone' that reflects the upper layer bio that was issued to it.
> 
> Fixes: 97a89458020b38 ("dm: improve noclone bio support")
>     Reported-by: Mikulas Patocka 
> Signed-off-by: Mike Snitzer 
> 
> What I am seeing is in about 3 out of 4 boots the startup just hangs
> at the filesystem check stage with the following message:
> [  OK  ] Reached target Local File Systems (Pre).
>  Starting File System Check on 
> /dev/…127-ad57-426f-bb45-363950544c0c...
> [**] (1 of 2) A start job is running for…n on device 252:2 (19s / no 
> limit)
> 
> I did some googling and it looks like a similar issue has been
> reported for s390. Based on the request for data there I have the
> following info:
> [root@localhost ~]# dmsetup ls --tree
> fedora-swap (253:1)
>  └─ (252:2)
> fedora-root (253:0)
>  └─ (252:2)
> 
> [root@localhost ~]# dmsetup table
> fedora-swap: 0 4194304 linear 252:2 2048
> fedora-root: 0 31457280 linear 252:2 4196352

Thanks, which version of Fedora are you running?

Your case is more straightforward in that you're clearly using bio-based
DM linear (which was updated to leverage "noclone" support); whereas the
s390 case is using request-based DM which isn't impacted by the commit
in question at all.

I'll attempt to reproduce first thing Monday.

Mike

Re: [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers

2019-02-28 Thread Mike Snitzer

On Thu, Feb 28 2019 at  4:32pm -0500,
Mike Snitzer  wrote:

> On Thu, Dec 20 2018 at  1:06pm -0500,
> Nikos Tsironis  wrote:
> 
> > Add hlist_bl_add_before/behind helpers to add an element before/after an
> > existing element in a bl_list.
> > 
> > Signed-off-by: Nikos Tsironis 
> > Signed-off-by: Ilias Tsitsimpis 
> > ---
> >  include/linux/list_bl.h | 27 +++
> >  1 file changed, 27 insertions(+)
> > 
> > diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
> > index 3fc2cc57ba1b..2fd918e5fd48 100644
> > --- a/include/linux/list_bl.h
> > +++ b/include/linux/list_bl.h
> > @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct 
> > hlist_bl_node *n,
> > hlist_bl_set_first(h, n);
> >  }
> >  
> > +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
> > +  struct hlist_bl_node *next)
> > +{
> > +   struct hlist_bl_node **pprev = next->pprev;
> > +
> > +   n->pprev = pprev;
> > +   n->next = next;
> > +   next->pprev = >next;
> > +
> > +   /* pprev may be `first`, so be careful not to lose the lock bit */
> > +   WRITE_ONCE(*pprev,
> > +  (struct hlist_bl_node *)
> > +   ((unsigned long)n |
> > +((unsigned long)*pprev & LIST_BL_LOCKMASK)));
> > +}
> > +
> > +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
> > +  struct hlist_bl_node *prev)
> > +{
> > +   n->next = prev->next;
> > +   n->pprev = >next;
> > +   WRITE_ONCE(prev->next, n);
> > +
> > +   if (n->next)
> > +   n->next->pprev = >next;
> > +}
> > +
> >  static inline void __hlist_bl_del(struct hlist_bl_node *n)
> >  {
> > struct hlist_bl_node *next = n->next;
> > -- 
> > 2.11.0
> 
> Hi Paul and Christoph,
> 
> You've added your Signed-off-by to include/linux/list_bl.h commits in
> the past.  I'm not sure how this proposed patch should be handled.
> 
> These new hlist_bl_add_{before,behind} changes are a prereq for
> dm-snapshot changes that Nikos has proposed, please see:
> https://patchwork.kernel.org/patch/10739265/
> 
> Any assistance/review you, or others on LKML, might be able to provide
> would be appreciated.

I should've clarified that: I'm asking for the purpose of getting these
changes staged for Linux 5.2.

Mike

Re: [PATCH 1/3] list_bl: Add hlist_bl_add_before/behind helpers

2019-02-28 Thread Mike Snitzer

On Thu, Dec 20 2018 at  1:06pm -0500,
Nikos Tsironis  wrote:

> Add hlist_bl_add_before/behind helpers to add an element before/after an
> existing element in a bl_list.
> 
> Signed-off-by: Nikos Tsironis 
> Signed-off-by: Ilias Tsitsimpis 
> ---
>  include/linux/list_bl.h | 27 +++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
> index 3fc2cc57ba1b..2fd918e5fd48 100644
> --- a/include/linux/list_bl.h
> +++ b/include/linux/list_bl.h
> @@ -86,6 +86,33 @@ static inline void hlist_bl_add_head(struct hlist_bl_node 
> *n,
>   hlist_bl_set_first(h, n);
>  }
>  
> +static inline void hlist_bl_add_before(struct hlist_bl_node *n,
> +struct hlist_bl_node *next)
> +{
> + struct hlist_bl_node **pprev = next->pprev;
> +
> + n->pprev = pprev;
> + n->next = next;
> + next->pprev = >next;
> +
> + /* pprev may be `first`, so be careful not to lose the lock bit */
> + WRITE_ONCE(*pprev,
> +(struct hlist_bl_node *)
> + ((unsigned long)n |
> +  ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
> +}
> +
> +static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
> +struct hlist_bl_node *prev)
> +{
> + n->next = prev->next;
> + n->pprev = >next;
> + WRITE_ONCE(prev->next, n);
> +
> + if (n->next)
> + n->next->pprev = >next;
> +}
> +
>  static inline void __hlist_bl_del(struct hlist_bl_node *n)
>  {
>   struct hlist_bl_node *next = n->next;
> -- 
> 2.11.0

Hi Paul and Christoph,

You've added your Signed-off-by to include/linux/list_bl.h commits in
the past.  I'm not sure how this proposed patch should be handled.

These new hlist_bl_add_{before,behind} changes are a prereq for
dm-snapshot changes that Nikos has proposed, please see:
https://patchwork.kernel.org/patch/10739265/

Any assistance/review you, or others on LKML, might be able to provide
would be appreciated.

Thanks,
Mike

Re: [PATCH AUTOSEL 4.20 42/77] dm: fix clone_bio() to trigger blk_recount_segments()

2019-02-27 Thread Mike Snitzer

On Wed, Feb 27 2019 at 12:38pm -0500,
Sasha Levin  wrote:

> On Thu, Feb 14, 2019 at 10:49:09PM -0500, Mike Snitzer wrote:
> >On Thu, Feb 14 2019 at  9:08pm -0500,
> >Sasha Levin  wrote:
> >
> >>From: Mike Snitzer 
> >>
> >>[ Upstream commit 57c36519e4b949f89381053f7283f5d605595b42 ]
> >>
> >>DM's clone_bio() now benefits from using bio_trim() by fixing the fact
> >>that clone_bio() wasn't clearing BIO_SEG_VALID like bio_trim() does;
> >>which triggers blk_recount_segments() via bio_phys_segments().
> >>
> >>Reviewed-by: Ming Lei 
> >>Signed-off-by: Mike Snitzer 
> >>Signed-off-by: Sasha Levin 
> >
> >Please no, I later effectively reverted this change with commit
> >fa8db4948f522 ("dm: don't use bio_trim() afterall")
> 
> I've dropped it, thank you.
> 
> >(As and aside, I really shouldn't have to defend against stable@ bots
> >picking up a commit, like 57c36519e4b949f, that wasn't marked for
> >stable@.)
> 
> Is it the case that this commit isn't appropriate for stable for some
> reason, or was it just buggy?

Commit 57c36519e4b9 exposed a bug elsewhere, as fixed by a truly
"stable" fix:
ff0c129d3b5ec ("dm crypt: don't overallocate the integrity tag space")

So the end result is commit 57c36519e4b9 is just bad to bring to a
"stable" kernel.  It unlocks another bug for no meaningful benefit.

Mike

Re: block: be more careful about status in __bio_chain_endio

2019-02-22 Thread Mike Snitzer

On Fri, Feb 22 2019 at  9:02pm -0500,
John Dorminy  wrote:

> I am perhaps not understanding the intricacies here, or not seeing a
> barrier protecting it, so forgive me if I'm off base. I think reading
> parent->bi_status here is unsafe.
> Consider the following sequence of events on two threads.
> 
> Thread 0 Thread 1
> In __bio_chain_endio:In __bio_chain_endio:
> [A] Child 0 reads parent->bi_status,
> no error.
>  Child bio 1 reads parent, no error 
> seen
>  It sets parent->bi_status to an error
>  It calls bio_put.
> Child bio 0 calls bio_put
> [end __bio_chain_endio]  [end __bio_chain_endio]
>  In bio_chain_endio(), 
> bio_endio(parent)
>  is called, calling 
> bio_remaining_done()
>  which decrements __bi_remaining to 1
>  and returns false, so no further 
> endio
>  stuff is done.
> In bio_chain_endio(), bio_endio(parent)
> is called, calling bio_remaining_done(),
> decrementing parent->__bi_remaining to
>  0, and continuing to finish parent.
> Either for block tracing or for parent's
> bi_end_io(), this thread tries to read
> parent->bi_status again.
> 
> The compiler or the CPU may cache the read from [A], and since there
> are no intervening barriers, parent->bi_status is still believed on
> thread 0 to be success. Thus the bio may still be falsely believed to
> have completed successfully, even though child 1 set an error in it.
> 
> Am I missing a subtlety here?

Either neilb's original or even Jens' suggestion would be fine though.

>   if (!parent->bi_status && bio->bi_status)
>   parent->bi_status = bio->bi_status;

Even if your scenario did play out (which I agree it looks possible)
it'd just degenerate to neilb's version:

>   if (bio->bi_status)
>   parent->bi_status = bio->bi_status;

Which also accomplishes fixing what Neil originally detailed in his
patch header.

Re: block: be more careful about status in __bio_chain_endio

2019-02-22 Thread Mike Snitzer

On Fri, Feb 22 2019 at  5:46pm -0500,
Jens Axboe  wrote:

> On 2/22/19 2:10 PM, Mike Snitzer wrote:
> > On Thu, Feb 15 2018 at  4:09am -0500,
> > NeilBrown  wrote:
> > 
> >>
> >> If two bios are chained under the one parent (with bio_chain())
> >> it is possible that one will succeed and the other will fail.
> >> __bio_chain_endio must ensure that the failure error status
> >> is reported for the whole, rather than the success.
> >>
> >> It currently tries to be careful, but this test is racy.
> >> If both children finish at the same time, they might both see that
> >> parent->bi_status as zero, and so will assign their own status.
> >> If the assignment to parent->bi_status by the successful bio happens
> >> last, the error status will be lost which can lead to silent data
> >> corruption.
> >>
> >> Instead, __bio_chain_endio should only assign a non-zero status
> >> to parent->bi_status.  There is then no need to test the current
> >> value of parent->bi_status - a test that would be racy anyway.
> >>
> >> Note that this bug hasn't been seen in practice.  It was only discovered
> >> by examination after a similar bug was found in dm.c
> >>
> >> Signed-off-by: NeilBrown 
> >> ---
> >>  block/bio.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/block/bio.c b/block/bio.c
> >> index e1708db48258..ad77140edc6f 100644
> >> --- a/block/bio.c
> >> +++ b/block/bio.c
> >> @@ -312,7 +312,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
> >>  {
> >>struct bio *parent = bio->bi_private;
> >>  
> >> -  if (!parent->bi_status)
> >> +  if (bio->bi_status)
> >>parent->bi_status = bio->bi_status;
> >>bio_put(bio);
> >>return parent;
> >> -- 
> >> 2.14.0.rc0.dirty
> >>
> > 
> > Reviewed-by: Mike Snitzer 
> > 
> > Jens, this one slipped through the crack just over a year ago.
> > It is available in patchwork here:
> > https://patchwork.kernel.org/patch/10220727/
> 
> Should this be:
> 
>   if (!parent->bi_status && bio->bi_status)
>   parent->bi_status = bio->bi_status;
> 
> perhaps?

Yeap, even better.  Not seeing any reason to have the last error win,
the first in the chain is likely the most important.

Re: dm crypt: fix memory leak in dm_crypt_integrity_io_alloc() error path

2019-02-22 Thread Mike Snitzer

On Sat, Feb 16 2019 at  4:00pm -0500,
sul...@kerneltoast.com  wrote:

> From: Sultan Alsawaf 
> 
> dm_crypt_integrity_io_alloc() allocates space for an integrity payload but
> doesn't free it in the error path, leaking memory. Add a bio_integrity_free()
> invocation upon error to fix the memory leak.
> 
> Signed-off-by: Sultan Alsawaf 
> ---
>  drivers/md/dm-crypt.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index dd538e6b2..f731e1fe0 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -939,8 +939,10 @@ static int dm_crypt_integrity_io_alloc(struct 
> dm_crypt_io *io, struct bio *bio)
>  
>   ret = bio_integrity_add_page(bio, virt_to_page(io->integrity_metadata),
>tag_len, 
> offset_in_page(io->integrity_metadata));
> - if (unlikely(ret != tag_len))
> + if (unlikely(ret != tag_len)) {
> + bio_integrity_free(bio);
>   return -ENOMEM;
> + }
>  
>   return 0;
>  }

Since commit 7c20f11680a4 bio_integrity_free() is no longer and exported
symbol.

But that aside, this dm-crypt clone bio's endio will clean up the bip
once -ENOMEM return starts to make its way out via
crypt_alloc_buffer()'s bio_put().

Mike

Re: block: be more careful about status in __bio_chain_endio

2019-02-22 Thread Mike Snitzer

On Thu, Feb 15 2018 at  4:09am -0500,
NeilBrown  wrote:

> 
> If two bios are chained under the one parent (with bio_chain())
> it is possible that one will succeed and the other will fail.
> __bio_chain_endio must ensure that the failure error status
> is reported for the whole, rather than the success.
> 
> It currently tries to be careful, but this test is racy.
> If both children finish at the same time, they might both see that
> parent->bi_status as zero, and so will assign their own status.
> If the assignment to parent->bi_status by the successful bio happens
> last, the error status will be lost which can lead to silent data
> corruption.
> 
> Instead, __bio_chain_endio should only assign a non-zero status
> to parent->bi_status.  There is then no need to test the current
> value of parent->bi_status - a test that would be racy anyway.
> 
> Note that this bug hasn't been seen in practice.  It was only discovered
> by examination after a similar bug was found in dm.c
> 
> Signed-off-by: NeilBrown 
> ---
>  block/bio.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index e1708db48258..ad77140edc6f 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -312,7 +312,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
>  {
>   struct bio *parent = bio->bi_private;
>  
> - if (!parent->bi_status)
> + if (bio->bi_status)
>   parent->bi_status = bio->bi_status;
>   bio_put(bio);
>   return parent;
> -- 
> 2.14.0.rc0.dirty
> 

Reviewed-by: Mike Snitzer 

Jens, this one slipped through the crack just over a year ago.
It is available in patchwork here:
https://patchwork.kernel.org/patch/10220727/

Re: [PATCH v11] dm: add support to directly boot to a mapped device

2019-02-21 Thread Mike Snitzer

On Mon, Feb 18 2019 at  1:18pm -0500,
Helen Koike  wrote:

> Add a dm-mod.create= kernel module parameter.
> It allows device-mapper targets to be configured at boot time for use early
> in the boot process (as the root device or otherwise).
> 
> Signed-off-by: Will Drewry 
> Signed-off-by: Kees Cook 
> [rework to use dm_ioctl calls]
> Signed-off-by: Enric Balletbo i Serra 
> [refactored for upstream]
> Signed-off-by: Helen Koike 

Can I get confirmation from Will and Kees that this v11 is safe to carry
their Signed-off-by?  Are you guys "happy" with this patch?

Helen, the patch header is severly lacking.  All the detail you've
provided is outside of the proposed patch header.  That needs fixing.  I
can write a proper header but if you were to beat me to it, say
today.. hint hint ;)  I'd greatly appreciate it.  What follows below is
actually quite good (I tweaked slightly to be more suitable, but
hopefully you get the idea, less "name dropping" is still needed though):

> The need to create an initramfs adds another layer of complexity when
> performing tests or preparing a BSP for a board.
> Taking care of initramfs is always a bit painful, it also occupies space and 
> is
> another level of complexity in the stack.
> A practical example as mentioned by Kees is that Chrome OS has a limited 
> amount
> of storage available for the boot image as it is covered by the static root of
> trust signature.
> This feature is already used by Android and Chrome OS in devices already 
> shipped
> to the market/end-users.
> 
> Fix this by allowing dm devices to be configured in the kernel command
> line parameter for use early in the boot process without an initramfs.
> 
> One of the main difference in this version is that instead of using "dm=",
> I'm using a parameter in the module "dm-mod.create=" (thanks Ezequiel for
> the idea).
> 
> The syntax used in the boot param is based on the concise format from the 
> dmsetup
> tool as described in its man page 
> http://man7.org/linux/man-pages/man8/dmsetup.8.html#CONCISE_FORMAT
> 
> Which is:
> 
> dm-mod.create=[,+][;[,+]+]
> 
> Where,
>   ::= The device name.
>   ::= ---- | ""
>  ::= The device minor number | ""
>  ::= "ro" | "rw"
>  ::=
> 
>::= "verity" | "linear" | ...
> 
> Example, the following could be added in the boot parameters.
> dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" 
> root=/dev/dm-0
> 
> Please check the patch with the documentation on the format.
> 
> The idea to make it compatible with the dmsetup concise format is to make it
> easier for users, allowing just copy & paste from the output of the command:
> 
>dmsetup table --concise /dev/mapper/lroot
> 
> I refactored the code for this version, instead of pretending to be
> userspace and performing ioctls, the code just parse the cmd line
> argument and create the device directly, which simplified and reduced a
> lot the code (and it is much easier to read).
> 
> Also, not all dm targets are allowed. Only the ones that I tested were
> allowed and the ones that doesn't change any block device when the dm is
> create as read-only, i.e. mirror and cache are not allowed, because if
> the user makes a mistake and chose the wrong device to be the mirror or
> choose the wrong device to be the cache, it can corrupt data.
> 
> So the only targets allowed are:
> * crypt
> * delay
> * linear
> * snapshot-origin
> * striped
> * verity
> 
> I wrote a script to perform several tests using qemu that can be found at
> https://gitlab.collabora.com/koike/dm-cmdline-test
> 
> And you can see the result of the tests at:
> https://people.collabora.com/~koike/dm-test-module-param-logs.txt
> At the end of this file:
> "Total successes: 52. Total failures: 0"
> 
> And the output of qemu for each run:
> https://people.collabora.com/~koike/dm-test-module-param-logs.tar.gz
> 
> I think it would nice to integrate these tests in some CI (maybe kernel
> CI?)
> 
> Please let me know your comments.
> 
> Changes in v11:
>  - Configure the device directly instead of performing in IOCTL (this
>  removed a lot of parsing code)
>  - Just enable the targets that were tested.
>  - Simplify/refactor parsing, as a consequence, escaping characters is not
>  allowed (but it wans't properly used in the previous version anyway)
>  - don't use sscanf, the size wans't being limited and we could have a
>  buffer overflow.
>  - change constrained targets list
>  - remove code from init/
>  - use a module parameter instead of a kernel comand line parameter in
>  init/
>  - rename dm-boot to dm-init
> 
> Changes in v10:
>  - https://lore.kernel.org/patchwork/project/lkml/list/?series=371523
>  - new file: drivers/md/dm-boot.c
>  - most of the parsing code was moved from init/do_mounts_dm.c to 
> drivers/md/dm-boot.c
>  - parsing code was in essence replaced by the

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1340 matches

Mail list logo