Re: [dm-devel] [PATCH] fix writing to the filesystem after unmount

2023-09-08 Thread John Stoffel
> "Christian" == Christian Brauner  writes:

>> Well, currently you click some "Eject / safely remove / whatever" button
>> and then you get a "wait" dialog until everything is done after which
>> you're told the stick is safe to remove. What I imagine is that the "wait"
>> dialog needs to be there while there are any (or exclusive at minimum) 
>> openers
>> of the device. Not until umount(2) syscall has returned. And yes, the

> Agreed. umount(2) doesn't give guarantees about a filesystem being
> really gone once it has returned. And it really shouldn't. There's
> too many cases where that doesn't work and it's not a commitment we
> should make.

So how the heck is someone supposed to know, from userspace, that a
filesystem is unmounted?  Just wearing my SysAdmin hat, this strikes
me as really potentially painful and annoying.  But then again, so are
bind mounts from alot of views too.  

Don't people remember how bad it can be when you are trying to
shutdown and system and it hangs because a remote NFS server is down
and not responding?  And your client system hangs completely?  

> And there are ways to wait until superblock shutdown that I've
> mentioned before in other places where it somehow really
> matters. inotify's IN_UMOUNT will notify about superblock
> shutdown. IOW, when it really hits generic_shutdown_super() which
> can only be hit after unfreezing as that requires active references.

Can we maybe invert this discussion and think about it from the
userspace side of things?  How does the user(space) mount/unmount
devices cleanly and reliably?  

> So this really can be used to wait for a filesystem to go away across
> all namespaces, and across filesytem freezing and it's available to
> unprivileged users. Simple example:

> # shell 1
> sudo mount -t xfs /dev/sda /mnt
> sudo mount --bind /mnt /opt
> inotifywait -e unmount /mnt

> #shell 2
> sudo umount /opt # nothing happens in shell 1
> sudo umount /mnt # shell 1 gets woken

So what makes this *useful* to anyone?  Why doesn't the bind mount
A) lock /mnt into place, but B) give you some way of seeing that
there's a bindmount in place?  

>> corner-cases. So does the current behavior, I agree, but improving
>> situation for one usecase while breaking another usecase isn't really a way
>> forward...

> Agreed.

>> Well, the filesystem (struct superblock to be exact) is invisible
>> in /proc/mounts (or whatever), that is true. But it is still very
>> much associated with that block device and if you do 'mount
>>  ', you'll get it back. But yes, the filesystem
>> will not go away

Then should it be unmountable in the first place?  I mean yes, we
always need a way to force an unmount no matter what, even if that
breaks some other process on the system, but for regular use,
shouldn't it just give back an error like:

  /mnt in use by bind mount /opt

or something like that?  Give the poor sysadmin some information on
what's going on here. 

> And now we at least have an api to detect that case and refuse to reuse
> the superblock.

>> until all references to it are dropped and you cannot easily find
>> who holds those references and how to get rid of them.

ding ding ding  I don't want to have to run 'lsof' or something
like that.

> Namespaces make this even messier. You have no easy way of knowing
> whether the filesystem isn't pinned somewhere else through an
> explicit bind-mount or when it was copied during mount namespace
> creation.

This is the biggest downside of namespaces and bind mounts in my
mind.  The lack of traceability back to the source.  

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] RAID4 with no striping mode request

2023-02-13 Thread John Stoffel
> "Kyle" == Kyle Sanderson  writes:

> hi DM and Linux-RAID,
> There have been multiple proprietary solutions (some nearly 20 years
> old now) with a number of (userspace) bugs that are becoming untenable
> for me as an end user. Basically how they work is a closed MD module
> (typically administered through DM) that uses RAID4 for a dedicated
> parity disk across multiple other disks.

You need to explain what you want in *much* beter detail.  Give simple
concrete examples.  From the sound of it, you want RAID6 but with
RAID4 dedicated Parity so you can spin down some of the data disks in
the array?  But if need be, spin up idle disks to recover data if you
lose an active disk?  

Really hard to understand what exactly you're looking for here.


> As there is no striping, the maximum size of the protected data is the
> size of the parity disk (so a set of 4+8+12+16 disks can be protected
> by a single dedicated 16 disk).When a block is written on any disk,
> the parity bit is read from the parity disk again, and updated
> depending on the existing + new bit value (so writing disk + parity
> disk spun up). Additionally, if enough disks are already spun up, the
> parity information can be recalculated from all of the spinning disks,
> resulting in a single write to the parity disk (without a read on the
> parity, doubling throughput). Finally any of the data disks can be
> moved around within the array without impacting parity as the layout
> has not changed. I don't necessarily need all of these features, just
> the ability to remove a disk and still access the data that was on
> there by spinning up every other disk until the rebuild is complete is
> important.

> The benefit of this can be the data disks are all zoned, and you can
> have a fast parity disk and still maintain excellent performance in
> the array (limited only by the speed of the disk in question +
> parity). Additionally, should 2 disks fail, you've either lost the
> parity and data disk, or 2 data disks with the parity and other disks
> not lost.

> I was reading through the DM and MD code and it looks like everything
> may already be there to do this, just needs (significant) stubs to be
> added to support this mode (or new code). Snapraid is a friendly (and
> respectable) implementation of this. Unraid and Synology SHR compete
> in this space, as well as other NAS and enterprise SAN providers.

> Kyle.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 1/4] block: introduce submit_bio_noacct_add_head

2020-12-30 Thread John Stoffel
>>>>> "Danny" == Danny Shih  writes:

Danny> Hi, John,
Danny> Thank you for taking the time to write a review.


Danny> John Stoffel  writes:
>>>>>>> "dannyshih" == dannyshih   writes:
dannyshih> From: Danny Shih 
dannyshih> Porvide a way for stacking block device to re-submit the bio
dannyshih> which sholud be handled firstly.
>> 
>> You're spelling needs to be fixed in these messages.


Danny> Sorry for so many spelling errors.

Danny> The message should be

Danny> "Provide a way for stacking block device to re-submit

Danny> the bio which should be handled first."

Danny> I will fix it.

Great, though my second question is *why* it needs to be handled
first?  What is the difference between stacked and un-stacked devices
and how could it be done in a way that doesn't require a seperate
function like this?




dannyshih> Signed-off-by: Danny Shih 
dannyshih> Reviewed-by: Allen Peng 
dannyshih> Reviewed-by: Alex Wu 
dannyshih> ---
dannyshih> block/blk-core.c   | 44 
+---
dannyshih> include/linux/blkdev.h |  1 +
dannyshih> 2 files changed, 34 insertions(+), 11 deletions(-)
>> 
dannyshih> diff --git a/block/blk-core.c b/block/blk-core.c
dannyshih> index 96e5fcd..693dc83 100644
dannyshih> --- a/block/blk-core.c
dannyshih> +++ b/block/blk-core.c
dannyshih> @@ -1031,16 +1031,7 @@ static blk_qc_t __submit_bio_noacct_mq(struct 
bio *bio)
dannyshih> return ret;
dannyshih> }
>> 
dannyshih> -/**
dannyshih> - * submit_bio_noacct - re-submit a bio to the block device layer 
for I/O
dannyshih> - * @bio:  The bio describing the location in memory and on the 
device.
dannyshih> - *
dannyshih> - * This is a version of submit_bio() that shall only be used for 
I/O that is
dannyshih> - * resubmitted to lower level drivers by stacking block drivers.  
All file
dannyshih> - * systems and other upper level users of the block layer should use
dannyshih> - * submit_bio() instead.
dannyshih> - */
dannyshih> -blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> +static blk_qc_t do_submit_bio_noacct(struct bio *bio, bool add_head)
dannyshih> {
dannyshih> if (!submit_bio_checks(bio))
dannyshih> return BLK_QC_T_NONE;
dannyshih> @@ -1052,7 +1043,10 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> * it is active, and then process them after it returned.
dannyshih> */
dannyshih> if (current->bio_list) {
dannyshih> -bio_list_add(¤t->bio_list[0], bio);
dannyshih> +if (add_head)
dannyshih> +bio_list_add_head(¤t->bio_list[0], bio);
dannyshih> +else
dannyshih> +bio_list_add(¤t->bio_list[0], bio);
dannyshih> return BLK_QC_T_NONE;
dannyshih> }
>> 
dannyshih> @@ -1060,9 +1054,37 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> return __submit_bio_noacct_mq(bio);
dannyshih> return __submit_bio_noacct(bio);
dannyshih> }
dannyshih> +
dannyshih> +/**
dannyshih> + * submit_bio_noacct - re-submit a bio to the block device layer 
for I/O
dannyshih> + * @bio:  The bio describing the location in memory and on the 
device.
dannyshih> + *
dannyshih> + * This is a version of submit_bio() that shall only be used for 
I/O that is
dannyshih> + * resubmitted to lower level drivers by stacking block drivers.  
All file
dannyshih> + * systems and other upper level users of the block layer should use
dannyshih> + * submit_bio() instead.
dannyshih> + */
dannyshih> +blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> +{
dannyshih> +return do_submit_bio_noacct(bio, false);
dannyshih> +}
dannyshih> EXPORT_SYMBOL(submit_bio_noacct);
>> 
>> So why is it named "submit_bio_noacct" when it's supposed to be only
>> used by layers submitting to lower level drivers.  How can this be
>> figured out by drivers automatically, so the writed doesn't have to
>> know about this?


Danny> There is no logical change while using submit_bio_noacct() after my 
Danny> patch. So I didn't change

Danny> the name and the documentation of submit_bio_noacct().


>> 
dannyshih> /**
dannyshih> + * submit_bio_noacct - re-submit a bio, which needs to be handle 
firstly,
dannyshih> + * to the block device layer for I/O
dannyshih> + * @bio:  The bio describing the location in memory and on the 
device.
dannyshih> + *
dannyshih> + * alternative submit_bio_noacct() which add bio to the head of
dannyshih> + * current->bio_list.
dannyshih> + */
>> 
>> Firstly isn't proper english.  Maybe something like:
>> 
>> submit_bio_noacct - re-submit a bio which needs to be handled first
>> because

Re: [dm-devel] [PATCH 1/4] block: introduce submit_bio_noacct_add_head

2020-12-30 Thread John Stoffel
>>>>> "antlists" == antlists   writes:

antlists> On 30/12/2020 00:00, John Stoffel wrote:
dannyshih> From: Danny Shih
dannyshih> Porvide a way for stacking block device to re-submit the bio
dannyshih> which sholud be handled firstly.
>> 
>> You're spelling needs to be fixed in these messages.

antlists>^^

antlists> It is traditional, when correcting someone else's spelling,
antlists> to make one of your own ... :-)

LOL!  Yes, touche!  I'm completely abashed.  

antlists> Sorry, but if we're being pedantic for furriners, it behoves
antlists> us to be correct ourselves :-)

It does for sure, thanks for the nudge.  

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 1/4] block: introduce submit_bio_noacct_add_head

2020-12-29 Thread John Stoffel
> "dannyshih" == dannyshih   writes:

dannyshih> From: Danny Shih 
dannyshih> Porvide a way for stacking block device to re-submit the bio
dannyshih> which sholud be handled firstly.

You're spelling needs to be fixed in these messages.  

dannyshih> Signed-off-by: Danny Shih 
dannyshih> Reviewed-by: Allen Peng 
dannyshih> Reviewed-by: Alex Wu 
dannyshih> ---
dannyshih>  block/blk-core.c   | 44 
+---
dannyshih>  include/linux/blkdev.h |  1 +
dannyshih>  2 files changed, 34 insertions(+), 11 deletions(-)

dannyshih> diff --git a/block/blk-core.c b/block/blk-core.c
dannyshih> index 96e5fcd..693dc83 100644
dannyshih> --- a/block/blk-core.c
dannyshih> +++ b/block/blk-core.c
dannyshih> @@ -1031,16 +1031,7 @@ static blk_qc_t __submit_bio_noacct_mq(struct 
bio *bio)
dannyshih>  return ret;
dannyshih>  }
 
dannyshih> -/**
dannyshih> - * submit_bio_noacct - re-submit a bio to the block device layer 
for I/O
dannyshih> - * @bio:  The bio describing the location in memory and on the 
device.
dannyshih> - *
dannyshih> - * This is a version of submit_bio() that shall only be used for 
I/O that is
dannyshih> - * resubmitted to lower level drivers by stacking block drivers.  
All file
dannyshih> - * systems and other upper level users of the block layer should use
dannyshih> - * submit_bio() instead.
dannyshih> - */
dannyshih> -blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> +static blk_qc_t do_submit_bio_noacct(struct bio *bio, bool add_head)
dannyshih>  {
dannyshih>  if (!submit_bio_checks(bio))
dannyshih>  return BLK_QC_T_NONE;
dannyshih> @@ -1052,7 +1043,10 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih>   * it is active, and then process them after it returned.
dannyshih>   */
dannyshih>  if (current->bio_list) {
dannyshih> -bio_list_add(¤t->bio_list[0], bio);
dannyshih> +if (add_head)
dannyshih> +bio_list_add_head(¤t->bio_list[0], bio);
dannyshih> +else
dannyshih> +bio_list_add(¤t->bio_list[0], bio);
dannyshih>  return BLK_QC_T_NONE;
dannyshih>  }
 
dannyshih> @@ -1060,9 +1054,37 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih>  return __submit_bio_noacct_mq(bio);
dannyshih>  return __submit_bio_noacct(bio);
dannyshih>  }
dannyshih> +
dannyshih> +/**
dannyshih> + * submit_bio_noacct - re-submit a bio to the block device layer 
for I/O
dannyshih> + * @bio:  The bio describing the location in memory and on the 
device.
dannyshih> + *
dannyshih> + * This is a version of submit_bio() that shall only be used for 
I/O that is
dannyshih> + * resubmitted to lower level drivers by stacking block drivers.  
All file
dannyshih> + * systems and other upper level users of the block layer should use
dannyshih> + * submit_bio() instead.
dannyshih> + */
dannyshih> +blk_qc_t submit_bio_noacct(struct bio *bio)
dannyshih> +{
dannyshih> +return do_submit_bio_noacct(bio, false);
dannyshih> +}
dannyshih>  EXPORT_SYMBOL(submit_bio_noacct);

So why is it named "submit_bio_noacct" when it's supposed to be only
used by layers submitting to lower level drivers.  How can this be
figured out by drivers automatically, so the writed doesn't have to
know about this?  
 
dannyshih>  /**
dannyshih> + * submit_bio_noacct - re-submit a bio, which needs to be handle 
firstly,
dannyshih> + * to the block device layer for I/O
dannyshih> + * @bio:  The bio describing the location in memory and on the 
device.
dannyshih> + *
dannyshih> + * alternative submit_bio_noacct() which add bio to the head of
dannyshih> + * current->bio_list.
dannyshih> + */

Firstly isn't proper english.  Maybe something like:

submit_bio_noacct - re-submit a bio which needs to be handled first
because  to the block device layer for I/O

But the name still sucks, and the *reason* the bio needs to be handled
differently isn't well explained. 

dannyshih> +blk_qc_t submit_bio_noacct_add_head(struct bio *bio)
dannyshih> +{
dannyshih> +return do_submit_bio_noacct(bio, true);
dannyshih> +}
dannyshih> +EXPORT_SYMBOL(submit_bio_noacct_add_head);
dannyshih> +
dannyshih> +/**
dannyshih>   * submit_bio - submit a bio to the block device layer for I/O
dannyshih>   * @bio: The &struct bio which describes the I/O
dannyshih>   *
dannyshih> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
dannyshih> index 070de09..b0080d0 100644
dannyshih> --- a/include/linux/blkdev.h
dannyshih> +++ b/include/linux/blkdev.h
dannyshih> @@ -905,6 +905,7 @@ static inline void rq_flush_dcache_pages(struct 
request *rq)
dannyshih>  extern int blk_register_queue(struct gendisk *disk);
dannyshih>  extern void blk_unregister_queue(struct gendisk *disk);
dannyshih>  blk_qc_t submit_bio_noacct(struct bio *bio);
dannyshih> +blk_qc_t submit_bio_noacct_add_head(struct bio *bio);
dannyshih>  extern void blk_rq_init(struct request_queue *q, struct request 
*rq);
dannyshih>  ext

Re: [dm-devel] [general question] rare silent data corruption when writing data

2020-05-08 Thread John Stoffel
>>>>> "Michal" == Michal Soltys  writes:

And of course it should also go to dm-devel@redhat.com, my fault for
not including that as well.  I strongly suspect it's an thin-lv
problem somewhere, but I don't know enough to help chase down the
problem in detail.

John


Michal> note: as suggested, I'm also CCing this to linux-lvm; the full
Michal> context with replies starts at:
Michal> https://www.spinics.net/lists/raid/msg64364.html There is also
Michal> the initial post at the bottom as well.

Michal> On 5/8/20 2:54 AM, John Stoffel wrote:
>>>>>>> "Michal" == Michal Soltys  writes:
>> 
Michal> On 20/05/07 23:01, John Stoffel wrote:
>>>>>>>>> "Roger" == Roger Heflin  writes:
>>>> 
Roger> Have you tried the same file 2x and verified the corruption is in the
Roger> same places and looks the same?
>>>> 
>>>> Are these 1tb files VMDK or COW images of VMs?  How are these files
>>>> made.  And does it ever happen with *smaller* files?  What about if
>>>> you just use a sparse 2tb file and write blocks out past 1tb to see if
>>>> there's a problem?
>> 
Michal> The VMs are always directly on lvm volumes. (e.g.
Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the
Michal> errors after we ran scrub on the filesystem.
>> 
Michal> Yes, the errors were also found on small files.
>> 
>> Those errors are in small files inside the VM, which is running btrfs
>> ontop of block storage provided by your thin-lv, right?
>> 

Michal> Yea, the small files were in this case on that thin-lv.

Michal> We also discovered (yesterday) file corruptions with VM hosting gitlab 
registry - this one was using the same thin-lv underneath, but the guest itself 
was using ext4 (in this case, docker simply reported incorrect sha checksum on 
(so far) 2 layers.

>> 
>> 
>> disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN ->
>> filesystem -> corruption

Michal> Those particular guests, yea. The host case it's just w/o "guest" step.

Michal> But (so far) all corruption ended going via one of the lv-thin layers 
(and via one of md raids).

>> 
>> 
Michal> Since then we recreated the issue directly on the host, just
Michal> by making ext4 filesystem on some LV, then doing write with
Michal> checksum, sync, drop_caches, read and check checksum. The
Michal> errors are, as I mentioned - always a full 4KiB chunks (always
Michal> same content, always same position).
>> 
>> What position?  Is it a 4k, 1.5m or some other consistent offset?  And
>> how far into the file?  And this LV is a plain LV or a thin-lv?   I'm
>> running a debian box at home with RAID1 and I haven't seen this, but
>> I'm not nearly as careful as you.  Can you provide the output of:
>> 

Michal> What I meant that it doesn't "move" when verifying the same file (aka 
different reads from same test file). Between the tests, the errors are of 
course in different places - but it's always some 4KiB piece(s) - that look 
like correct pieces belonging somewhere else.

>> /sbin/lvs --version

Michal>   LVM version: 2.03.02(2) (2018-12-18)
Michal>   Library version: 1.02.155 (2018-12-18)
Michal>   Driver version:  4.41.0
Michal>   Configuration:   ./configure --build=x86_64-linux-gnu --prefix=/usr 
--includedir=${prefix}/include --mandir=${prefix}/share/man 
--infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var 
--disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu 
--libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run 
--disable-maintainer-mode --disable-dependency-tracking --exec-prefix= 
--bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin 
--with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 
--with-cache=internal --with-device-uid=0 --with-device-gid=6 
--with-device-mode=0660 --with-default-pid-dir=/run 
--with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm 
--with-thin=internal --with-thin-check=/usr/sbin/thin_check 
--with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair 
--enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd 
--enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanloc
 k --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline 
--enable-udev_rules --enable-udev_sync

>> 
>> too?
>> 
>> Can you post your:
>> 
>> /sbin/dmsetup status
>> 
>> output too?  There's a better command to use here, but I'm not an
>> export.  You might really want to copy this over to the
>> lin

Re: [dm-devel] [PATCH 10/15] libmultipath: change how the checker async is set

2020-01-21 Thread John Stoffel
>>>>> "Benjamin" == Benjamin Marzinski  writes:

Benjamin> On Mon, Jan 20, 2020 at 03:20:14PM -0500, John Stoffel wrote:
>> >>>>> "Benjamin" == Benjamin Marzinski  writes:
>> 
Benjamin> On Sat, Jan 18, 2020 at 12:43:49PM -0500, John Stoffel wrote:
>> >> >>>>> "Benjamin" == Benjamin Marzinski  writes:
>> >> 
Benjamin> The way that the checkers async mode worked in multipathd didn't make
Benjamin> much sense. When multipathd starts up, all checker classes are in the
Benjamin> sync mode, and any pathinfo() calls on devices would run the checker 
in
Benjamin> sync mode. However, the First time a checker class was used in
Benjamin> checkerloop, it would set that checker class to async (assuming
Benjamin> force_sync wasn't set).  After that, no matter when a checker from 
that
Benjamin> class was called, it would always run in async mode.  Multipathd 
doesn't
Benjamin> need to run checkers in sync mode at all, so don't.
>> >> 
>> >> Sorry, I had a hard time parsing this description, especially the last
>> >> sentence.  Do you mean that checkers should default to async then,
>> >> instead of sync mode?  And from looking at the code, don't you mean
>> >> that you force sync mode?  I guess the question is whether checker
>> >> classes should default sync, or async.  And if they move to async,
>> >> should they stay there?
>> >> 
>> 
Benjamin> Sorry. I mean that right now multipathd runs the checkers from 
pathinfo(),
Benjamin> wait_for_pending_paths() and check_path(). When multipathd starts, all
Benjamin> checkers are in sync mode. The first time a checker is run from
Benjamin> check_path(), it is switched to async mode, and stays there for the 
rest
Benjamin> of the time multipathd is runing.
>> 
Benjamin> There is no need for multipathd to run checkers in sync mode at all, 
so
Benjamin> we shouldn't be doing so for these first checks.
>> 
>> Thanks, that makes the entire issue much more clear.  This raises the
>> question of why there is a sync mode at all then?  In any case, the
>> above makes the issue more understandable.

Benjamin> The multipath command, which uses the same checkers, runs in sync 
mode.

Crystal clean now.  Thanks for your patience!


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 10/15] libmultipath: change how the checker async is set

2020-01-20 Thread John Stoffel
>>>>> "Benjamin" == Benjamin Marzinski  writes:

Benjamin> On Sat, Jan 18, 2020 at 12:43:49PM -0500, John Stoffel wrote:
>> >>>>> "Benjamin" == Benjamin Marzinski  writes:
>> 
Benjamin> The way that the checkers async mode worked in multipathd didn't make
Benjamin> much sense. When multipathd starts up, all checker classes are in the
Benjamin> sync mode, and any pathinfo() calls on devices would run the checker 
in
Benjamin> sync mode. However, the First time a checker class was used in
Benjamin> checkerloop, it would set that checker class to async (assuming
Benjamin> force_sync wasn't set).  After that, no matter when a checker from 
that
Benjamin> class was called, it would always run in async mode.  Multipathd 
doesn't
Benjamin> need to run checkers in sync mode at all, so don't.
>> 
>> Sorry, I had a hard time parsing this description, especially the last
>> sentence.  Do you mean that checkers should default to async then,
>> instead of sync mode?  And from looking at the code, don't you mean
>> that you force sync mode?  I guess the question is whether checker
>> classes should default sync, or async.  And if they move to async,
>> should they stay there?
>> 

Benjamin> Sorry. I mean that right now multipathd runs the checkers from 
pathinfo(),
Benjamin> wait_for_pending_paths() and check_path(). When multipathd starts, all
Benjamin> checkers are in sync mode. The first time a checker is run from
Benjamin> check_path(), it is switched to async mode, and stays there for the 
rest
Benjamin> of the time multipathd is runing.

Benjamin> There is no need for multipathd to run checkers in sync mode at all, 
so
Benjamin> we shouldn't be doing so for these first checks.

Thanks, that makes the entire issue much more clear.  This raises the
question of why there is a sync mode at all then?  In any case, the
above makes the issue more understandable.

John


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 10/15] libmultipath: change how the checker async is set

2020-01-18 Thread John Stoffel
> "Benjamin" == Benjamin Marzinski  writes:

Benjamin> The way that the checkers async mode worked in multipathd didn't make
Benjamin> much sense. When multipathd starts up, all checker classes are in the
Benjamin> sync mode, and any pathinfo() calls on devices would run the checker 
in
Benjamin> sync mode. However, the First time a checker class was used in
Benjamin> checkerloop, it would set that checker class to async (assuming
Benjamin> force_sync wasn't set).  After that, no matter when a checker from 
that
Benjamin> class was called, it would always run in async mode.  Multipathd 
doesn't
Benjamin> need to run checkers in sync mode at all, so don't.

Sorry, I had a hard time parsing this description, especially the last
sentence.  Do you mean that checkers should default to async then,
instead of sync mode?  And from looking at the code, don't you mean
that you force sync mode?  I guess the question is whether checker
classes should default sync, or async.  And if they move to async,
should they stay there?



Benjamin> Signed-off-by: Benjamin Marzinski 
Benjamin> ---
Benjamin>  libmpathpersist/mpath_persist.c |  2 +-
Benjamin>  libmultipath/discovery.c| 10 --
Benjamin>  multipath/main.c|  1 +
Benjamin>  3 files changed, 6 insertions(+), 7 deletions(-)

Benjamin> diff --git a/libmpathpersist/mpath_persist.c 
b/libmpathpersist/mpath_persist.c
Benjamin> index 603cfc3b..b2238f00 100644
Benjamin> --- a/libmpathpersist/mpath_persist.c
Benjamin> +++ b/libmpathpersist/mpath_persist.c
Benjamin> @@ -47,7 +47,7 @@ mpath_lib_init (void)
Benjamin>   condlog(0, "Failed to initialize multipath config.");
Benjamin>   return NULL;
Benjamin>   }
Benjamin> -
Benjamin> + conf->force_sync = 1;
Benjamin>   set_max_fds(conf->max_fds);
 
Benjamin>   return conf;
Benjamin> diff --git a/libmultipath/discovery.c b/libmultipath/discovery.c
Benjamin> index d2773c3a..1ab093f4 100644
Benjamin> --- a/libmultipath/discovery.c
Benjamin> +++ b/libmultipath/discovery.c
Benjamin> @@ -1643,12 +1643,10 @@ get_state (struct path * pp, struct config 
*conf, int daemon, int oldstate)
Benjamin>   if (pp->mpp && !c->mpcontext)
Benjamin>   checker_mp_init(c, &pp->mpp->mpcontext);
Benjamin>   checker_clear_message(c);
Benjamin> - if (daemon) {
Benjamin> - if (conf->force_sync == 0)
Benjamin> - checker_set_async(c);
Benjamin> - else
Benjamin> - checker_set_sync(c);
Benjamin> - }
Benjamin> + if (conf->force_sync == 0)
Benjamin> + checker_set_async(c);
Benjamin> + else
Benjamin> + checker_set_sync(c);
Benjamin>   if (!conf->checker_timeout &&
Benjamin>   sysfs_get_timeout(pp, &(c->timeout)) <= 0)
c-> timeout = DEF_TIMEOUT;
Benjamin> diff --git a/multipath/main.c b/multipath/main.c
Benjamin> index 4f4d8e89..aebabd9b 100644
Benjamin> --- a/multipath/main.c
Benjamin> +++ b/multipath/main.c
Benjamin> @@ -905,6 +905,7 @@ main (int argc, char *argv[])
Benjamin>   exit(RTVL_FAIL);
Benjamin>   multipath_conf = conf;
conf-> retrigger_tries = 0;
Benjamin> + conf->force_sync = 1;
Benjamin>   while ((arg = getopt(argc, argv, 
":adcChl::FfM:v:p:b:BrR:itTquUwW")) != EOF ) {
Benjamin>   switch(arg) {
Benjamin>   case 1: printf("optarg : %s\n",optarg);
Benjamin> -- 
Benjamin> 2.17.2

Benjamin> --
Benjamin> dm-devel mailing list
Benjamin> dm-devel@redhat.com
Benjamin> https://www.redhat.com/mailman/listinfo/dm-devel


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 09/15] libmultipath: add code to get vendor specific vpd data

2020-01-18 Thread John Stoffel
> "Benjamin" == Benjamin Marzinski  writes:

Benjamin> This adds the wildcard 'g' for multipath and path formatted
Benjamin> printing, which returns extra data from a device's vendor
Benjamin> specific vpd page.  The specific vendor vpd page to use, and
Benjamin> the vendor/product id to decode it can be set in the hwentry
Benjamin> with vpd_vendor_pg and vpd_vendor_id. It can be configured
Benjamin> in the devices section of multipath.conf with the vpd_vendor
Benjamin> parameter. Currently, the only devices that use this are HPE
Benjamin> 3PAR arrays, to return the Volume Name.

So I've looked over this patch and I see how you've hard-coded the
3par entry into the various .c files, but the multipathd.conf example
really doesn't give a good example of how to use it, nor does the code
give a good example, or maybe better yet a template set of functions
for parsing custom VPD info pages.  This is really just a nit.  

Also, is the Vendor VPD page always at 0xco, or is that just a 3par
setting?  I wonder if that constant should be pulled into a #define
somewhere, so that it's simpler to update/change as needed?
Especially if other vendors use different pages.

John



Benjamin> Signed-off-by: Benjamin Marzinski 
Benjamin> ---
Benjamin>  libmultipath/config.c  |  4 
Benjamin>  libmultipath/config.h  |  2 ++
Benjamin>  libmultipath/dict.c| 34 ++
Benjamin>  libmultipath/discovery.c   | 34 +-
Benjamin>  libmultipath/hwtable.c |  2 ++
Benjamin>  libmultipath/print.c   | 27 +++
Benjamin>  libmultipath/propsel.c | 24 
Benjamin>  libmultipath/propsel.h |  2 ++
Benjamin>  libmultipath/structs.h |  9 +
Benjamin>  multipath/multipath.conf.5 |  8 
Benjamin>  10 files changed, 145 insertions(+), 1 deletion(-)

Benjamin> diff --git a/libmultipath/config.c b/libmultipath/config.c
Benjamin> index 85626e96..72b8d37c 100644
Benjamin> --- a/libmultipath/config.c
Benjamin> +++ b/libmultipath/config.c
Benjamin> @@ -369,6 +369,8 @@ merge_hwe (struct hwentry * dst, struct hwentry * 
src)
Benjamin>   merge_num(max_sectors_kb);
Benjamin>   merge_num(ghost_delay);
Benjamin>   merge_num(all_tg_pt);
Benjamin> + merge_num(vpd_vendor_pg);
Benjamin> + merge_num(vpd_vendor_id);
Benjamin>   merge_num(san_path_err_threshold);
Benjamin>   merge_num(san_path_err_forget_rate);
Benjamin>   merge_num(san_path_err_recovery_time);
Benjamin> @@ -517,6 +519,8 @@ store_hwe (vector hwtable, struct hwentry * dhwe)
hwe-> detect_prio = dhwe->detect_prio;
hwe-> detect_checker = dhwe->detect_checker;
hwe-> ghost_delay = dhwe->ghost_delay;
Benjamin> + hwe->vpd_vendor_pg = dhwe->vpd_vendor_pg;
Benjamin> + hwe->vpd_vendor_id = dhwe->vpd_vendor_id;
 
Benjamin>   if (dhwe->bl_product && !(hwe->bl_product = 
set_param_str(dhwe->bl_product)))
Benjamin>   goto out;
Benjamin> diff --git a/libmultipath/config.h b/libmultipath/config.h
Benjamin> index e69aa07c..589146de 100644
Benjamin> --- a/libmultipath/config.h
Benjamin> +++ b/libmultipath/config.h
Benjamin> @@ -87,6 +87,8 @@ struct hwentry {
Benjamin>   int max_sectors_kb;
Benjamin>   int ghost_delay;
Benjamin>   int all_tg_pt;
Benjamin> + int vpd_vendor_pg;
Benjamin> + int vpd_vendor_id;
Benjamin>   char * bl_product;
Benjamin>  };
 
Benjamin> diff --git a/libmultipath/dict.c b/libmultipath/dict.c
Benjamin> index 2b046e1d..d6d8b79b 100644
Benjamin> --- a/libmultipath/dict.c
Benjamin> +++ b/libmultipath/dict.c
Benjamin> @@ -1366,6 +1366,39 @@ def_uxsock_timeout_handler(struct config 
*conf, vector strvec)
Benjamin>   return 0;
Benjamin>  }
 
Benjamin> +static int
Benjamin> +hw_vpd_vendor_handler(struct config *conf, vector strvec)
Benjamin> +{
Benjamin> + int rc = 0;
Benjamin> + char *buff;
Benjamin> +
Benjamin> + struct hwentry * hwe = VECTOR_LAST_SLOT(conf->hwtable);
Benjamin> + if (!hwe)
Benjamin> + return 1;
Benjamin> +
Benjamin> + buff = set_value(strvec);
Benjamin> + if (!buff)
Benjamin> + return 1;
Benjamin> + if (strcmp(buff, "hp3par") == 0) {
Benjamin> + hwe->vpd_vendor_pg = 0xc0;
Benjamin> + hwe->vpd_vendor_id = VPD_VP_HP3PAR;
Benjamin> + } else
Benjamin> + rc = 1;
Benjamin> + FREE(buff);
Benjamin> + return rc;
Benjamin> +}
Benjamin> +
Benjamin> +static int
Benjamin> +snprint_hw_vpd_vendor(struct config *conf, char * buff, int len,
Benjamin> +   const void * data)
Benjamin> +{
Benjamin> + const struct hwentry * hwe = (const struct hwentry *)data;
Benjamin> +
Benjamin> + if (hwe->vpd_vendor_pg == 0xc0 && hwe->vpd_vendor_id == 
VPD_VP_HP3PAR)
Benjamin> + return snprintf(buff, len, "hp3par");
Benjamin> + return 0;
Benjamin> +}
Benjamin> +
Benjamin>  /*
Benjamin>   * blacklist block handlers
Benjamin>   */
Benjamin> @@ -

Re: [dm-devel] [PATCH V2] dm-raid: fix updating of max_discard_sectors limit

2019-09-11 Thread John Stoffel
> "Mike" == Mike Snitzer  writes:

Mike> On Wed, Sep 11 2019 at  7:31am -0400,
Mike> Ming Lei  wrote:

>> Unit of 'chunk_size' is byte, instead of sector, so fix it.
>> 
>> Without this fix, too big max_discard_sectors is applied on the request queue
>> of dm-raid, finally raid code has to split the bio again.
>> 
>> This re-split done by raid causes the following nested clone_endio:
>> 
>> 1) one big bio 'A' is submitted to dm queue, and served as the original
>> bio
>> 
>> 2) one new bio 'B' is cloned from the original bio 'A', and .map()
>> is run on this bio of 'B', and B's original bio points to 'A'
>> 
>> 3) raid code sees that 'B' is too big, and split 'B' and re-submit
>> the remainded part of 'B' to dm-raid queue via generic_make_request().
>> 
>> 4) now dm will hanlde 'B' as new original bio, then allocate a new
>> clone bio of 'C' and run .map() on 'C'. Meantime C's original bio
>> points to 'B'.
>> 
>> 5) suppose now 'C' is completed by raid direclty, then the following
>> clone_endio() is called recursively:
>> 
>> clone_endio(C)
-> clone_endio(B)   #B is original bio of 'C'
-> bio_endio(A)
>> 
>> 'A' can be big enough to make handreds of nested clone_endio(), then
>> stack can be corrupted easily.
>> 
>> Cc: 
>> Signed-off-by: Ming Lei 
>> ---
>> V2:
>> - fix commit log a bit
>> 
>> drivers/md/dm-raid.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
>> index 8a60a4a070ac..c26aa4e8207a 100644
>> --- a/drivers/md/dm-raid.c
>> +++ b/drivers/md/dm-raid.c
>> @@ -3749,7 +3749,7 @@ static void raid_io_hints(struct dm_target *ti, struct 
>> queue_limits *limits)
>> */
>> if (rs_is_raid1(rs) || rs_is_raid10(rs)) {
limits-> discard_granularity = chunk_size;
>> -limits->max_discard_sectors = chunk_size;
>> +limits->max_discard_sectors = chunk_size >> 9;
>> }
>> }
>> 
>> -- 
>> 2.20.1
>> 

Mike> Thanks a lot Ming!  But oof, really embarassing oversight on my part!

Mike> FYI, I added a "Fixes:" tag to the commit header and switched to
Mike> shifting by SECTOR_SHIFT instead of 9, staged commit for 5.4 is here:

Mike> 
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.4&id=509818079bf1fefff4ed02d6a1b994e20efc0480

Mike> --
Mike> dm-devel mailing list
Mike> dm-devel@redhat.com
Mike> https://www.redhat.com/mailman/listinfo/dm-devel



Would it make sense to re-name the variable to chunk_size_bytes as
well?  

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Serial console is causing system lock-up

2019-03-07 Thread John Stoffel
> "John" == John Ogness  writes:

John> On 2019-03-06, Steven Rostedt  wrote:
>>> This bug only happens if we select large logbuffer (millions of
>>> characters). With smaller log buffer, there are messages "** X printk
>>> messages dropped", but there's no lockup.
>>> 
>>> The kernel apparently puts 2 million characters into a console log
>>> buffer, then takes some lock and than tries to write all of them to a
>>> slow serial line.
>>> 
>>> [...]
>>> 
>>> The MD-RAID is supposed to recalculate data for the corrupted device
>>> and bring it back to life. However, scrubbing the MD-RAID device
>>> resulted in a lot of reads from the device with bad checksums, these
>>> were reported to the log and killed the machine.
>>> 
>>> I made a patch to dm-integrity to rate-limit the error messages. But
>>> anyway - killing the machine in case of too many log messages seems
>>> bad.  If the log messages are produced faster than the kernel can
>>> write them, the kernel should discard some of them, not kill itself.
>> 
>> Sounds like another aurgment for the new printk design.

John> Assuming the bad checksum messages are considered an emergency
John> (for example, at least loglevel KERN_WARN), then the new printk
John> design would print those messages synchronously to the slow
John> serial line in the context of the driver as the driver is
John> producing them.

John> There wouldn't be a lock-up, but it would definitely slow down
John> the driver. The situation of "messages being produced faster
John> than the kernel can write them" would never exist because the
John> printk() call will only return after the writing is completed. I
John> am curious if that would be acceptable here?

The real problem is the disconnect between serial console speed and
capacity in bits/sec and that of the regular console.  Serial, esp at
9600 baud is just a slow and limited resource which needs to be
handled differently than a graphical console.

I'm also big on ratelimiting messages, even critical warning
messages.  Too much redundant info doesn't help anyone.  And what a
subsystem thinks is critical, may not be critical to the system as a
whole.

In this case, if these checksum messages are telling us that there's
corruption, why isn't dm-integrity going readonly and making the block
device get the filesystem to also go readonly and to stop the damage
right away?

If it's just a warning for the niceness, then please rate limit them,
or summarize them in some more useful way.  Or even log them to
somewhere else than the console once the problem is noted.

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Possible bug in mirror target

2019-02-12 Thread John Stoffel
>>>>> "Zdenek" == Zdenek Kabelac  writes:

Zdenek> Dne 10. 02. 19 v 22:58 John Stoffel napsal(a):
>>>>>>> "Zdenek" == Zdenek Kabelac  writes:
>> 
Zdenek> Dne 05. 02. 19 v 1:47 Drew Hastings napsal(a):
>>>> Hi,
>>>> 
>>>> I'm assuming all user space code is expected to use the handle_errors 
>>>> feature,
>>>> so this isn't that big of a deal. I'm also using 4.19.13, which I think is
>>>> more recent than the latest update to dm-raid1.c
>>>> 
>>>> That said, there may be a bug that causes the entire mirror to crash if 
>>>> there
>>>> is an error on the first leg of the mirror, but not on the 2nd leg.
>>>> 
>>>> Works fine if you do cause IO errors on the 2nd leg:
>>>> 
>>>> root@kernel-dev:~# dmsetup create zerodev --table "0 1073741824 zero"
>>>> root@kernel-dev:~# dmsetup create errordev --table "0 1073741824 error"
>>>> 
>>>> 
>>>> root@kernel-dev:~# dmsetup create mirror-table --table "0 1073741824 mirror
>>>> core 2 524288 nosync 2 /dev/mapper/zerodev 0 /dev/mapper/errordev 0 0"
>> 
Zdenek> There are 2 operational modes for old dm mirror target.
Zdenek> One requires to handle errors externally. Please check i.e. the 
following
Zdenek> guide for mirror target:
>> 
Zdenek> https://wiki.gentoo.org/wiki/Device-mapper
>> 
Zdenek> Lvm2 is using 'dmeventd' to service this error handling
Zdenek> (i.e. dropping the mirror leg or allocating replacement one).
>> 
>> Is it time to remove the dm-mirror target then?  Or to deprecate it?
>> Or to just silently replace it with something that does the right
>> thing when errors happen?  I can't think of why *anyone* would want to
>> use the dm-mirror target as it now seems to work.


Zdenek> The old dm mirror works differently then new dm raid - so if
Zdenek> you have an old one present/running (i.e. LVM mirrored volume)
Zdenek> using this target - you need to have this target present to be
Zdenek> able to access such volume.

Ok, I see that.  I just wonder if A) anyone still uses this target and
B) if it should be deprecated since it doesn't provide the protections
that the dm_raid target provides.

Getting rid of legacy stuff isn't a bad thing... :-)

Zdenek> You can (in lvm2) convert such mirror to use newer dm raid
Zdenek> target - but this requires some extra space (even though very
Zdenek> small metadata volume) which is now required per each mirrored
Zdenek> leg.

Not a bad thing.

Zdenek> Both targets are still maintained and bugfixed.

Zdenek> There is also something about very high complexity of mdraid
Zdenek> code and relative simplicity of old mirror target.

I'm thinking about the complexity of maintaing both in parallel, when
the older one doesn't provide the protections of the newer one. 

Zdenek> As for comment about *anyone* - majority of users are
Zdenek> consuming DM targets via lvm2 - there you get this transition
Zdenek> automatically - when you ask to create mirrored LV - lvm2 will
Zdenek> use (if present in kernel) newer md raid variant.  If anyone
Zdenek> is using DM directly with 'dmsetup' command - it's assumed
Zdenek> these are skilled users familiar with kernel doc (dmsetup is
Zdenek> low-level admins tool).  Lvm2 should be seen as
Zdenek> 'friendly-face' of these DM targets - which otherwise do
Zdenek> require pretty complex setup. If you do not want to use lvm2,
Zdenek> the tool replacing lvm2 needs to reproduce this extra logic.

Right, so how many use the old version?  

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Possible bug in mirror target

2019-02-10 Thread John Stoffel
> "Zdenek" == Zdenek Kabelac  writes:

Zdenek> Dne 05. 02. 19 v 1:47 Drew Hastings napsal(a):
>> Hi,
>> 
>> I'm assuming all user space code is expected to use the handle_errors 
>> feature, 
>> so this isn't that big of a deal. I'm also using 4.19.13, which I think is 
>> more recent than the latest update to dm-raid1.c
>> 
>> That said, there may be a bug that causes the entire mirror to crash if 
>> there 
>> is an error on the first leg of the mirror, but not on the 2nd leg.
>> 
>> Works fine if you do cause IO errors on the 2nd leg:
>> 
>> root@kernel-dev:~# dmsetup create zerodev --table "0 1073741824 zero"
>> root@kernel-dev:~# dmsetup create errordev --table "0 1073741824 error"
>> 
>> 
>> root@kernel-dev:~# dmsetup create mirror-table --table "0 1073741824 mirror 
>> core 2 524288 nosync 2 /dev/mapper/zerodev 0 /dev/mapper/errordev 0 0"

Zdenek> There are 2 operational modes for old dm mirror target.
Zdenek> One requires to handle errors externally. Please check i.e. the 
following 
Zdenek> guide for mirror target:

Zdenek> https://wiki.gentoo.org/wiki/Device-mapper

Zdenek> Lvm2 is using 'dmeventd' to service this error handling
Zdenek> (i.e. dropping the mirror leg or allocating replacement one).

Is it time to remove the dm-mirror target then?  Or to deprecate it?
Or to just silently replace it with something that does the right
thing when errors happen?  I can't think of why *anyone* would want to
use the dm-mirror target as it now seems to work.

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] mm: set PF_LESS_THROTTLE when allocating memory for i/o

2018-07-05 Thread John Stoffel
> "Mikulas" == Mikulas Patocka  writes:

Mikulas> It has been noticed that congestion throttling can slow down
Mikulas> allocations path that participate in the IO and thus help the
Mikulas> memory reclaim.  Stalling those allocation is therefore not
Mikulas> productive. Moreover mempool allocator and md variants of the
Mikulas> same already implement their own throttling which has a
Mikulas> better way to be feedback driven. Stalling at the page
Mikulas> allocator is therefore even counterproductive.

Can you show numbers for this claim?  I would think that throttling
needs to be done as close to the disk as possible, and propogate back
up the layers to have this all work well, so that faster devices (and
the layers stacked on them) will work better without stalling on a
slow device.


Mikulas> PF_LESS_THROTTLE is a task flag denoting allocation context that is
Mikulas> participating in the memory reclaim which fits into these IO paths
Mikulas> model, so use the flag and make the page allocator aware they are
Mikulas> special and they do not really want any dirty data throttling.

Mikulas> The throttling causes stalls on Android - it uses the dm-verity driver 
Mikulas> that uses dm-bufio. Allocations in dm-bufio were observed to sleep in 
Mikulas> wait_iff_congested repeatedly.

Mikulas> Signed-off-by: Mikulas Patocka 
Mikulas> Acked-by: Michal Hocko  # mempool_alloc and bvec_alloc
Mikulas> Cc: sta...@vger.kernel.org

Mikulas> ---
Mikulas>  block/bio.c   |4 
Mikulas>  drivers/md/dm-bufio.c |   14 +++---
Mikulas>  drivers/md/dm-crypt.c |8 
Mikulas>  drivers/md/dm-integrity.c |4 
Mikulas>  drivers/md/dm-kcopyd.c|3 +++
Mikulas>  drivers/md/dm-verity-target.c |4 
Mikulas>  drivers/md/dm-writecache.c|4 
Mikulas>  mm/mempool.c  |4 
Mikulas>  8 files changed, 42 insertions(+), 3 deletions(-)

Mikulas> Index: linux-2.6/mm/mempool.c
Mikulas> ===
Mikulas> --- linux-2.6.orig/mm/mempool.c2018-06-29 03:47:16.29000 
+0200
Mikulas> +++ linux-2.6/mm/mempool.c 2018-06-29 03:47:16.27000 +0200
Mikulas> @@ -369,6 +369,7 @@ void *mempool_alloc(mempool_t *pool, gfp
Mikulas>unsigned long flags;
Mikulas>wait_queue_entry_t wait;
Mikulas>gfp_t gfp_temp;
Mikulas> +  unsigned old_flags;
 
Mikulas>VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
Mikulas>might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
Mikulas> @@ -381,7 +382,10 @@ void *mempool_alloc(mempool_t *pool, gfp
 
Mikulas>  repeat_alloc:
 
Mikulas> +  old_flags = current->flags & PF_LESS_THROTTLE;
Mikulas> +  current->flags |= PF_LESS_THROTTLE;
Mikulas>element = pool->alloc(gfp_temp, pool->pool_data);
Mikulas> +  current_restore_flags(old_flags, PF_LESS_THROTTLE);
Mikulas>if (likely(element != NULL))
Mikulas>return element;
 
Mikulas> Index: linux-2.6/block/bio.c
Mikulas> ===
Mikulas> --- linux-2.6.orig/block/bio.c 2018-06-29 03:47:16.29000 +0200
Mikulas> +++ linux-2.6/block/bio.c  2018-06-29 03:47:16.27000 +0200
Mikulas> @@ -217,6 +217,7 @@ fallback:
Mikulas>} else {
Mikulas>struct biovec_slab *bvs = bvec_slabs + *idx;
Mikulas>gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | 
__GFP_IO);
Mikulas> +  unsigned old_flags;
 
Mikulas>/*
Mikulas> * Make this allocation restricted and don't dump info 
on
Mikulas> @@ -229,7 +230,10 @@ fallback:
Mikulas> * Try a slab allocation. If this fails and 
__GFP_DIRECT_RECLAIM
Mikulas> * is set, retry with the 1-entry mempool
Mikulas> */
Mikulas> +  old_flags = current->flags & PF_LESS_THROTTLE;
Mikulas> +  current->flags |= PF_LESS_THROTTLE;
Mikulas>bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
Mikulas> +  current_restore_flags(old_flags, PF_LESS_THROTTLE);
Mikulas>if (unlikely(!bvl && (gfp_mask & 
__GFP_DIRECT_RECLAIM))) {
Mikulas>*idx = BVEC_POOL_MAX;
Mikulas>goto fallback;
Mikulas> Index: linux-2.6/drivers/md/dm-bufio.c
Mikulas> ===
Mikulas> --- linux-2.6.orig/drivers/md/dm-bufio.c   2018-06-29 
03:47:16.29000 +0200
Mikulas> +++ linux-2.6/drivers/md/dm-bufio.c2018-06-29 03:47:16.27000 
+0200
Mikulas> @@ -356,6 +356,7 @@ static void __cache_size_refresh(void)
Mikulas>  static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t 
gfp_mask,
Mikulas>   unsigned char *data_mode)
Mikulas>  {
Mikulas> +  void *ptr;
Mikulas>if (unlikely(c->slab_cache != NULL)) {
Mikulas>*data_mode = DATA_MODE_SL

Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-05-02 Thread John Stoffel
>>>>> "Mikulas" == Mikulas Patocka  writes:

Mikulas> On Mon, 30 Apr 2018, John Stoffel wrote:

>> >>>>> "Mikulas" == Mikulas Patocka  writes:
>> 
Mikulas> On Thu, 26 Apr 2018, John Stoffel wrote:
>> 
Mikulas> I see your point - and I think the misunderstanding is this.
>> 
>> Thanks.
>> 
Mikulas> This patch is not really helping people to debug existing crashes. It 
is 
Mikulas> not like "you get a crash" - "you google for some keywords" - "you get 
a 
Mikulas> page that suggests to turn this option on" - "you turn it on and solve 
the 
Mikulas> crash".
>> 
Mikulas> What this patch really does is that - it makes the kernel deliberately 
Mikulas> crash in a situation when the code violates the specification, but it 
Mikulas> would not crash otherwise or it would crash very rarely. It helps to 
Mikulas> detect specification violations.
>> 
Mikulas> If the kernel developer (or tester) doesn't use this option, his buggy 
Mikulas> code won't crash - and if it won't crash, he won't fix the bug or 
report 
Mikulas> it. How is the user or developer supposed to learn about this option, 
if 
Mikulas> he gets no crash at all?
>> 
>> So why do we make this a KConfig option at all?

Mikulas> Because other people see the KConfig option (so, they may enable it) 
and 
Mikulas> they don't see the kernel parameter (so, they won't enable it).

Mikulas> Close your eyes and say how many kernel parameters do you remember :-)

>> Just turn it on and let it rip.

Mikulas> I can't test if all the networking drivers use kvmalloc properly, 
because 
Mikulas> I don't have the hardware. You can't test it neither. No one has all 
the 
Mikulas> hardware that is supported by Linux.

Mikulas> Driver issues can only be tested by a mass of users. And if the users 
Mikulas> don't know about the debugging option, they won't enable it.

>> >> I agree with James here.  Looking at the SLAB vs SLUB Kconfig entries
>> >> tells me *nothing* about why I should pick one or the other, as an
>> >> example.

Mikulas> BTW. You can enable slub debugging either with CONFIG_SLUB_DEBUG_ON or 
Mikulas> with the kernel parameter "slub_debug" - and most users who compile 
their 
Mikulas> own kernel use CONFIG_SLUB_DEBUG_ON - just because it is visible.

You miss my point, which is that there's no explanation of what the
difference is between SLAB and SLUB and which I should choose.  The
same goes here.  If the KConfig option doesn't give useful info, it's
useless.

>> Now I also think that Linus has the right idea to not just sprinkle 
>> BUG_ONs into the code, just dump and oops and keep going if you can.  
>> If it's a filesystem or a device, turn it read only so that people 
>> notice right away.

Mikulas> This vmalloc fallback is similar to
Mikulas> CONFIG_DEBUG_KOBJECT_RELEASE.  CONFIG_DEBUG_KOBJECT_RELEASE
Mikulas> changes the behavior of kobject_put in order to cause
Mikulas> deliberate crashes (that wouldn't happen otherwise) in
Mikulas> drivers that misuse kobject_put. In the same sense, we want
Mikulas> to cause deliberate crashes (that wouldn't happen otherwise)
Mikulas> in drivers that misuse kvmalloc.

Mikulas> The crashes will only happen in debugging kernels, not in
Mikulas> production kernels.

Says you.  What about people or distros that enable it
unconditionally?  They're going to get all kinds of reports and then
turn it off again.  Crashing the system isn't the answer here.  

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-05-02 Thread John Stoffel
> "Mike" == Mike Snitzer  writes:

Mike> On Tue, May 01 2018 at  8:36pm -0400,
Mike> Andrew Morton  wrote:

>> On Tue, 24 Apr 2018 12:33:01 -0400 (EDT) Mikulas Patocka 
>>  wrote:
>> 
>> > 
>> > 
>> > On Tue, 24 Apr 2018, Michal Hocko wrote:
>> > 
>> > > On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
>> > > > 
>> > > > 
>> > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
>> > > > 
>> > > > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
>> > > > > 
>> > > > > > Fixing __vmalloc code 
>> > > > > > is easy and it doesn't require cooperation with maintainers.
>> > > > > 
>> > > > > But it is a hack against the intention of the scope api.
>> > > > 
>> > > > It is not!
>> > > 
>> > > This discussion simply doesn't make much sense it seems. The scope API
>> > > is to document the scope of the reclaim recursion critical section. That
>> > > certainly is not a utility function like vmalloc.
>> > 
>> > That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel 
>> > developer) from converting the code to the scope API. You make nonsensical 
>> > excuses.
>> > 
>> 
>> Fun thread!
>> 
>> Winding back to the original problem, I'd state it as
>> 
>> - Caller uses kvmalloc() but passes the address into vmalloc-naive
>> DMA API and
>> 
>> - Caller uses kvmalloc() but passes the address into kfree()
>> 
>> Yes?

Mike> I think so.

>> If so, then...
>> 
>> Is there a way in which, in the kvmalloc-called-kmalloc path, we can
>> tag the slab-allocated memory with a "this memory was allocated with
>> kvmalloc()" flag?  I *think* there's extra per-object storage available
>> with suitable slab/slub debugging options?  Perhaps we could steal one
>> bit from the redzone, dunno.
>> 
>> If so then we can
>> 
>> a) set that flag in kvmalloc() if the kmalloc() call succeeded
>> 
>> b) check for that flag in the DMA code, WARN if it is set.
>> 
>> c) in kvfree(), clear that flag before calling kfree()
>> 
>> d) in kfree(), check for that flag and go WARN() if set.
>> 
>> So both potential bugs are detected all the time, dependent upon
>> CONFIG_SLUB_DEBUG (and perhaps other slub config options).

Mike> Thanks Andrew, definitely the most sane proposal I've seen to resolve
Mike> this.

Cuts to the heart of the issue I think, and seems pretty sane.  Should
the WARN be rate limited as well?

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-30 Thread John Stoffel
>>>>> "Mikulas" == Mikulas Patocka  writes:

Mikulas> On Thu, 26 Apr 2018, John Stoffel wrote:

>> >>>>> "James" == James Bottomley  
>> >>>>> writes:
>> 
James> I may be an atypical developer but I'd rather have a root canal
James> than browse through menuconfig options.  The way to get people
James> to learn about new debugging options is to blog about it (or
James> write an lwn.net article) which google will find the next time
James> I ask it how I debug XXX.  Google (probably as a service to
James> humanity) rarely turns up Kconfig options in response to a
James> query.
>> 
>> I agree with James here.  Looking at the SLAB vs SLUB Kconfig entries
>> tells me *nothing* about why I should pick one or the other, as an
>> example.
>> 
>> John

Mikulas> I see your point - and I think the misunderstanding is this.

Thanks.

Mikulas> This patch is not really helping people to debug existing crashes. It 
is 
Mikulas> not like "you get a crash" - "you google for some keywords" - "you get 
a 
Mikulas> page that suggests to turn this option on" - "you turn it on and solve 
the 
Mikulas> crash".

Mikulas> What this patch really does is that - it makes the kernel deliberately 
Mikulas> crash in a situation when the code violates the specification, but it 
Mikulas> would not crash otherwise or it would crash very rarely. It helps to 
Mikulas> detect specification violations.

Mikulas> If the kernel developer (or tester) doesn't use this option, his buggy 
Mikulas> code won't crash - and if it won't crash, he won't fix the bug or 
report 
Mikulas> it. How is the user or developer supposed to learn about this option, 
if 
Mikulas> he gets no crash at all?

So why do we make this a KConfig option at all?  Just turn it on and
let it rip.  Now I also think that Linus has the right idea to not
just sprinkle BUG_ONs into the code, just dump and oops and keep going
if you can.  If it's a filesystem or a device, turn it read only so
that people notice right away.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-26 Thread John Stoffel
> "James" == James Bottomley  writes:

James> On Wed, 2018-04-25 at 19:00 -0400, Mikulas Patocka wrote:
>> 
>> On Wed, 25 Apr 2018, James Bottomley wrote:
>> 
>> > > > Do we really need the new config option?  This could just be
>> > > > manually  tunable via fault injection IIUC.
>> > > 
>> > > We do, because we want to enable it in RHEL and Fedora debugging
>> > > kernels, so that it will be tested by the users.
>> > > 
>> > > The users won't use some extra magic kernel options or debugfs
>> files.
>> > 
>> > If it can be enabled via a tunable, then the distro can turn it on
>> > without the user having to do anything.  If you want to present the
>> > user with a different boot option, you can (just have the tunable
>> set
>> > on the command line), but being tunable driven means that you don't
>> > have to choose that option, you could automatically enable it under
>> a
>> > range of circumstances.  I think most sane distributions would want
>> > that flexibility.
>> > 
>> > Kconfig proliferation, conversely, is a bit of a nightmare from
>> both
>> > the user and the tester's point of view, so we're trying to avoid
>> it
>> > unless absolutely necessary.
>> > 
>> > James
>> 
>> BTW. even developers who compile their own kernel should have this
>> enabled by a CONFIG option - because if the developer sees the option
>> when browsing through menuconfig, he may enable it. If he doesn't see
>> the option, he won't even know that such an option exists.

James> I may be an atypical developer but I'd rather have a root canal
James> than browse through menuconfig options.  The way to get people
James> to learn about new debugging options is to blog about it (or
James> write an lwn.net article) which google will find the next time
James> I ask it how I debug XXX.  Google (probably as a service to
James> humanity) rarely turns up Kconfig options in response to a
James> query.

I agree with James here.  Looking at the SLAB vs SLUB Kconfig entries
tells me *nothing* about why I should pick one or the other, as an
example.

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-02-01 Thread John Stoffel
> "Bart" == Bart Van Assche  writes:

Bart> On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
>> But regardless of which wins the race, the queue will have been run.
>> Which is all we care about right?

Bart> Running the queue is not sufficient. With this patch applied it can happen
Bart> that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
Bart> concurrent queue runs finish before sufficient device resources are 
available
Bart> to execute the request and that blk_mq_delay_run_hw_queue() does not get
Bart> called at all. If no other activity triggers a queue run, e.g. request
Bart> completion, this will result in a queue stall.

Doesn't this argue that you really want some sort of completions to be
run in this case instead?  Instead of busy looping or waiting for a
set amount of time, just fire off a callback to run once you have the
resources available, no?

I can't provide any code examples, I don't know the code base nearly
well enough.

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] dm-crypt: limit the number of allocated pages

2017-08-21 Thread John Stoffel
>>>>> "Mikulas" == Mikulas Patocka  writes:

Mikulas> On Fri, 18 Aug 2017, John Stoffel wrote:

Mikulas> The limit can't be changed. I expect that no one will run a system 
with 
Mikulas> such a tight requirements that the memory is used up to the last 2%.
>> 
Mikulas> If the user is near hitting the memory limit, he can just add swap 
space 
Mikulas> to solve the problem instead of tuning dm-crypt.
>> 
>> My real concern is that we build up too many outstanding BIOs so that
>> they take along time to process.  I'm already seeing some reports
>> about timeouts after 120 seconds due to stacking of dm-crypt with
>> other stuff.
>> 
>> What I'm trying to say is that as memory grows, you need to slow down
>> the growth of the amount of memory that dm-crypt can allocate.  It's
>> more important to scale by the speed of the IO sub-system than the
>> size of memory.
>> 
>> But hey, you really addresed my concern with smaller memory systems
>> already.
>> 
>> John

Mikulas> This patch doesn't restrict the number of in-flight bios. Lowering the 
Mikulas> memory limit has no effect on the number of bios being allocated by 
some 
Mikulas> other part of the kernel and submitted for dm-crypt.

Good to know.  

Mikulas> Some times ago I made a patch that limits the number of in-flight bios 
in 
Mikulas> device mapper - see this:
Mikulas> 
http://people.redhat.com/~mpatocka/patches/kernel/dm-limit-outstanding-bios/

Interesting.  I'll try to find some time to look this over. 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] dm-crypt: limit the number of allocated pages

2017-08-18 Thread John Stoffel
>>>>> "Mikulas" == Mikulas Patocka  writes:

Mikulas> On Thu, 17 Aug 2017, John Stoffel wrote:

>> >>>>> "Mikulas" == Mikulas Patocka  writes:
>> 
Mikulas> On Mon, 14 Aug 2017, John Stoffel wrote:
>> 
>> >> >>>>> "Mikulas" == Mikulas Patocka  writes:
>> >> 
Mikulas> dm-crypt consumes excessive amount memory when the user attempts to 
zero
Mikulas> a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" 
calls
Mikulas> the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
Mikulas> __blkdev_issue_zeroout sends large amount of write bios that contain 
the
Mikulas> zero page as their payload.
>> >> 
Mikulas> For each incoming page, dm-crypt allocates another page that holds the
Mikulas> encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
Mikulas> allocate the amount of memory that is equal to the size of the device.
Mikulas> This can trigger OOM killer or cause system crash.
>> >> 
Mikulas> This patch fixes the bug by limiting the amount of memory that dm-crypt
Mikulas> allocates to 2% of total system memory.
>> >> 
>> >> Is this limit per-device or system wide?  it's not clear to me if the
>> >> minimum is just one page, or more so it might be nice to clarify
>> >> that.
>> 
Mikulas> The limit is system-wide. The limit is divided by the number
Mikulas> of active dm-crypt devices and each device receives an equal
Mikulas> share.
>> 
>> Ok, thanks for the clarification.  Can it be expanded dynamically?  I
>> could see that for large systems, it might not scale well.  
>> 
Mikulas> There is a lower bound of BIO_MAX_PAGES * 16. The per-device
Mikulas> limit is never lower than this.
>> 
>> Nice.
>> 
>> >> Also, for large memory systems, that's still alot of pages.  Maybe it
>> >> should be more exponential in it's clamping as memory sizes go up?  2%
>> >> of 2T is 4G, which is pretty darn big...
>> 
Mikulas> The more we restrict it, the less parallelism is used in dm-crypt, so 
it 
Mikulas> makes sense to have a big value on machines with many cores.
>> 
>> True... but it's still a concern that it's hardcoded.
>> 
>> Thanks for your replies.

Mikulas> The limit can't be changed. I expect that no one will run a system 
with 
Mikulas> such a tight requirements that the memory is used up to the last 2%.

Mikulas> If the user is near hitting the memory limit, he can just add swap 
space 
Mikulas> to solve the problem instead of tuning dm-crypt.

My real concern is that we build up too many outstanding BIOs so that
they take along time to process.  I'm already seeing some reports
about timeouts after 120 seconds due to stacking of dm-crypt with
other stuff.

What I'm trying to say is that as memory grows, you need to slow down
the growth of the amount of memory that dm-crypt can allocate.  It's
more important to scale by the speed of the IO sub-system than the
size of memory.

But hey, you really addresed my concern with smaller memory systems
already.

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] dm-crypt: limit the number of allocated pages

2017-08-17 Thread John Stoffel
>>>>> "Mikulas" == Mikulas Patocka  writes:

Mikulas> On Mon, 14 Aug 2017, John Stoffel wrote:

>> >>>>> "Mikulas" == Mikulas Patocka  writes:
>> 
Mikulas> dm-crypt consumes excessive amount memory when the user attempts to 
zero
Mikulas> a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" 
calls
Mikulas> the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
Mikulas> __blkdev_issue_zeroout sends large amount of write bios that contain 
the
Mikulas> zero page as their payload.
>> 
Mikulas> For each incoming page, dm-crypt allocates another page that holds the
Mikulas> encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
Mikulas> allocate the amount of memory that is equal to the size of the device.
Mikulas> This can trigger OOM killer or cause system crash.
>> 
Mikulas> This patch fixes the bug by limiting the amount of memory that dm-crypt
Mikulas> allocates to 2% of total system memory.
>> 
>> Is this limit per-device or system wide?  it's not clear to me if the
>> minimum is just one page, or more so it might be nice to clarify
>> that.

Mikulas> The limit is system-wide. The limit is divided by the number
Mikulas> of active dm-crypt devices and each device receives an equal
Mikulas> share.

Ok, thanks for the clarification.  Can it be expanded dynamically?  I
could see that for large systems, it might not scale well.  

Mikulas> There is a lower bound of BIO_MAX_PAGES * 16. The per-device
Mikulas> limit is never lower than this.

Nice.

>> Also, for large memory systems, that's still alot of pages.  Maybe it
>> should be more exponential in it's clamping as memory sizes go up?  2%
>> of 2T is 4G, which is pretty darn big...

Mikulas> The more we restrict it, the less parallelism is used in dm-crypt, so 
it 
Mikulas> makes sense to have a big value on machines with many cores.

True... but it's still a concern that it's hardcoded.

Thanks for your replies.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] dm-crypt: limit the number of allocated pages

2017-08-14 Thread John Stoffel
> "Mikulas" == Mikulas Patocka  writes:

Mikulas> dm-crypt consumes excessive amount memory when the user attempts to 
zero
Mikulas> a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" 
calls
Mikulas> the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout,
Mikulas> __blkdev_issue_zeroout sends large amount of write bios that contain 
the
Mikulas> zero page as their payload.

Mikulas> For each incoming page, dm-crypt allocates another page that holds the
Mikulas> encrypted data, so when processing "blkdiscard -z", dm-crypt tries to
Mikulas> allocate the amount of memory that is equal to the size of the device.
Mikulas> This can trigger OOM killer or cause system crash.

Mikulas> This patch fixes the bug by limiting the amount of memory that dm-crypt
Mikulas> allocates to 2% of total system memory.

Is this limit per-device or system wide?  it's not clear to me if the
minimum is just one page, or more so it might be nice to clarify
that.

Also, for large memory systems, that's still alot of pages.  Maybe it
should be more exponential in it's clamping as memory sizes go up?  2%
of 2T is 4G, which is pretty darn big...


Mikulas> Signed-off-by: Mikulas Patocka 
Mikulas> Cc: sta...@vger.kernel.org

Mikulas> ---
Mikulas>  drivers/md/dm-crypt.c |   60 
+-
Mikulas>  1 file changed, 59 insertions(+), 1 deletion(-)

Mikulas> Index: linux-2.6/drivers/md/dm-crypt.c
Mikulas> ===
Mikulas> --- linux-2.6.orig/drivers/md/dm-crypt.c
Mikulas> +++ linux-2.6/drivers/md/dm-crypt.c
Mikulas> @@ -148,6 +148,8 @@ struct crypt_config {
Mikulas>mempool_t *tag_pool;
Mikulas>unsigned tag_pool_max_sectors;
 
Mikulas> +  struct percpu_counter n_allocated_pages;
Mikulas> +
Mikulas>struct bio_set *bs;
Mikulas>struct mutex bio_alloc_lock;
 
Mikulas> @@ -219,6 +221,12 @@ struct crypt_config {
Mikulas>  #define MAX_TAG_SIZE  480
Mikulas>  #define POOL_ENTRY_SIZE   512
 
Mikulas> +static DEFINE_SPINLOCK(dm_crypt_clients_lock);
Mikulas> +static unsigned dm_crypt_clients_n = 0;
Mikulas> +static volatile unsigned long dm_crypt_pages_per_client;
Mikulas> +#define DM_CRYPT_MEMORY_PERCENT   2
Mikulas> +#define DM_CRYPT_MIN_PAGES_PER_CLIENT (BIO_MAX_PAGES * 16)
Mikulas> +
Mikulas>  static void clone_init(struct dm_crypt_io *, struct bio *);
Mikulas>  static void kcryptd_queue_crypt(struct dm_crypt_io *io);
Mikulas>  static struct scatterlist *crypt_get_sg_data(struct crypt_config *cc,
Mikulas> @@ -2158,6 +2166,37 @@ static int crypt_wipe_key(struct crypt_c
Mikulas>return r;
Mikulas>  }
 
Mikulas> +static void crypt_calculate_pages_per_client(void)
Mikulas> +{
Mikulas> +  unsigned long pages = (totalram_pages - totalhigh_pages) * 
DM_CRYPT_MEMORY_PERCENT / 100;
Mikulas> +  if (!dm_crypt_clients_n)
Mikulas> +  return;
Mikulas> +  pages /= dm_crypt_clients_n;

Would it make sense to use a shift here, or does this only run once on
setup?  And shouldn't you return a value here, if only to make it
clear what you're defaulting to?  

Mikulas> +  if (pages < DM_CRYPT_MIN_PAGES_PER_CLIENT)
Mikulas> +  pages = DM_CRYPT_MIN_PAGES_PER_CLIENT;
Mikulas> +  dm_crypt_pages_per_client = pages;
Mikulas> +}
Mikulas> +
Mikulas> +static void *crypt_page_alloc(gfp_t gfp_mask, void *pool_data)
Mikulas> +{
Mikulas> +  struct crypt_config *cc = pool_data;
Mikulas> +  struct page *page;
Mikulas> +  if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, 
dm_crypt_pages_per_client) >= 0) &&
Mikulas> +  likely(gfp_mask & __GFP_NORETRY))
Mikulas> +  return NULL;
Mikulas> +  page = alloc_page(gfp_mask);
Mikulas> +  if (likely(page != NULL))
Mikulas> +  percpu_counter_add(&cc->n_allocated_pages, 1);
Mikulas> +  return page;
Mikulas> +}
Mikulas> +
Mikulas> +static void crypt_page_free(void *page, void *pool_data)
Mikulas> +{
Mikulas> +  struct crypt_config *cc = pool_data;
Mikulas> +  __free_page(page);
Mikulas> +  percpu_counter_sub(&cc->n_allocated_pages, 1);
Mikulas> +}
Mikulas> +
Mikulas>  static void crypt_dtr(struct dm_target *ti)
Mikulas>  {
Mikulas>struct crypt_config *cc = ti->private;
Mikulas> @@ -2184,6 +2223,10 @@ static void crypt_dtr(struct dm_target *
Mikulas>mempool_destroy(cc->req_pool);
Mikulas>mempool_destroy(cc->tag_pool);
 
Mikulas> +  if (cc->page_pool)
Mikulas> +  WARN_ON(percpu_counter_sum(&cc->n_allocated_pages) != 
0);
Mikulas> +  percpu_counter_destroy(&cc->n_allocated_pages);
Mikulas> +
Mikulas>if (cc->iv_gen_ops && cc->iv_gen_ops->dtr)
cc-> iv_gen_ops->dtr(cc);
 
Mikulas> @@ -2198,6 +2241,12 @@ static void crypt_dtr(struct dm_target *
 
Mikulas>/* Must zero key material before freeing */
Mikulas>kzfree(cc);
Mikulas> +
Mikulas> +  spin_lock(&dm_crypt_c

Re: [dm-devel] dm-integrity: fix inefficient allocation of stack space

2017-07-20 Thread John Stoffel
>>>>> "Mikulas" == Mikulas Patocka  writes:

Mikulas> On Wed, 19 Jul 2017, John Stoffel wrote:

>> I'd like to argue that you should never use BUG_ON at all, esp since
>> if you have integrity running on just one critical device, but have
>> other devices that work just fine, bringing down the entire system
>> because you don't think things are ok is terrible.
>> 
>> We should rip out ALL the BUG_ONs in the dm-integrity and just do
>> proper error handling instead.  For testing, sure, use them on your
>> code.  But please, not for general use!
>> 
>> John

Mikulas> Linus sometimes argued against BUG_ON and people are
Mikulas> repeating it after him like sheep.

I understand Linus' rants on BUG_ON and I agree with them.  He doesn't
argue that they're not appropriate at times, but just using them
because a random driver or module freaks out is NOT appropriate.  

Mikulas> If a programmer made a bug in his code, he should fix that
Mikulas> bug, not write additional recovery code for the bug.

If a programmer has a bug which kills the system for no damn good
reason, then I object.  Don't think that dm-integrity is oh so
important that you need to kill the system because of a mistake.

Try to be robust.  90% of all code is error handling.

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] dm-integrity: fix inefficient allocation of stack space

2017-07-20 Thread John Stoffel
>>>>> "Mikulas" == Mikulas Patocka  writes:

Mikulas> On Wed, 19 Jul 2017, John Stoffel wrote:

>> >>>>> "Mikulas" == Mikulas Patocka  writes:
>> 
Mikulas> When using block size greater than 512 bytes, the dm-integrity target
Mikulas> allocates journal space inefficiently, it allocates one entry for each
Mikulas> 512-byte chunk of data, fills entries for each block of data and leaves
Mikulas> the remaining entries unused. This doesn't cause data corruption, but 
it
Mikulas> causes severe performance degradation.
>> 
Mikulas> This patch fixes the journal allocation. As a safety it also
Mikulas> adds BUG that fires if the variables representing journal
Mikulas> usage get out of sync (it's better to crash than continue and
Mikulas> corrupt data, so BUG is justified).
>> 
>> No, I don't agree.  You should lock down the block device(s) using the
>> dm-integrity target, but you should NOT take down the entire system
>> because of this.  Just return a failure up the stack that forces the
>> device into read only mode so that there's a chance to debug this
>> issue.
>> 
>> Using a BUG_ON, especially for code that isn't proven, is just wrong.
>> Do a WARN_ONCE() and then return an error instead.
>> 
>> John

Mikulas> That BUG can only be triggered by a bug in the code, it can't be 
triggered 
Mikulas> by bad data on the disk, so why should we write code to recover from 
it?

Because it's damn obnoxious to kill a system because of bad coding,
without giving the user a chance to recover or even investigate
the root cause.   If it's a bug in the code, just warn the user and
then try to continue.  If you can't continue, then just make the block
device go read-only to protect the data.

Again, why do you think BUG_ON is appropriate thing to do?  What
advantage does it have for you?

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm-integrity: fix inefficient allocation of stack space

2017-07-19 Thread John Stoffel
>>>>> "Mike" == Mike Snitzer  writes:

Mike> On Wed, Jul 19 2017 at  2:39pm -0400,
Mike> John Stoffel  wrote:

>> >>>>> "Mikulas" == Mikulas Patocka  writes:
>> 
Mikulas> When using block size greater than 512 bytes, the dm-integrity target
Mikulas> allocates journal space inefficiently, it allocates one entry for each
Mikulas> 512-byte chunk of data, fills entries for each block of data and leaves
Mikulas> the remaining entries unused. This doesn't cause data corruption, but 
it
Mikulas> causes severe performance degradation.
>> 
Mikulas> This patch fixes the journal allocation. As a safety it also
Mikulas> adds BUG that fires if the variables representing journal
Mikulas> usage get out of sync (it's better to crash than continue and
Mikulas> corrupt data, so BUG is justified).
>> 
>> No, I don't agree.  You should lock down the block device(s) using the
>> dm-integrity target, but you should NOT take down the entire system
>> because of this.  Just return a failure up the stack that forces the
>> device into read only mode so that there's a chance to debug this
>> issue.
>> 
>> Using a BUG_ON, especially for code that isn't proven, is just wrong.
>> Do a WARN_ONCE() and then return an error instead.

Mike> I'll remove the BUG_ON from this stable@ fix.  Mikulas, please have a
Mike> look at handling the failure more gracefully (maybe like John suggests).
Mike> In general, there are too many BUG_ON()s in the dm-integrity code.  I
Mike> let them slide for inclusion but it would probably be a good idea to
Mike> look at eliminating them and putting the dm-integrity device into "fail
Mike> mode" in the face of critical errors -- much like dm-thinp and dm-cache
Mike> has.

I'd like to argue that you should never use BUG_ON at all, esp since
if you have integrity running on just one critical device, but have
other devices that work just fine, bringing down the entire system
because you don't think things are ok is terrible.

We should rip out ALL the BUG_ONs in the dm-integrity and just do
proper error handling instead.  For testing, sure, use them on your
code.  But please, not for general use!

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] dm-integrity: fix inefficient allocation of stack space

2017-07-19 Thread John Stoffel
> "Mikulas" == Mikulas Patocka  writes:

Mikulas> When using block size greater than 512 bytes, the dm-integrity target
Mikulas> allocates journal space inefficiently, it allocates one entry for each
Mikulas> 512-byte chunk of data, fills entries for each block of data and leaves
Mikulas> the remaining entries unused. This doesn't cause data corruption, but 
it
Mikulas> causes severe performance degradation.

Mikulas> This patch fixes the journal allocation. As a safety it also
Mikulas> adds BUG that fires if the variables representing journal
Mikulas> usage get out of sync (it's better to crash than continue and
Mikulas> corrupt data, so BUG is justified).

No, I don't agree.  You should lock down the block device(s) using the
dm-integrity target, but you should NOT take down the entire system
because of this.  Just return a failure up the stack that forces the
device into read only mode so that there's a chance to debug this
issue.

Using a BUG_ON, especially for code that isn't proven, is just wrong.
Do a WARN_ONCE() and then return an error instead.

John



Mikulas> Signed-off-by: Mikulas Patocka 
Mikulas> Cc: sta...@vger.kernel.org
Mikulas> Fixes: 7eada909bfd7 ("dm: add integrity target")

Mikulas> ---
Mikulas>  drivers/md/dm-integrity.c |7 ---
Mikulas>  1 file changed, 4 insertions(+), 3 deletions(-)

Mikulas> Index: linux-2.6/drivers/md/dm-integrity.c
Mikulas> ===
Mikulas> --- linux-2.6.orig/drivers/md/dm-integrity.c
Mikulas> +++ linux-2.6/drivers/md/dm-integrity.c
Mikulas> @@ -1589,14 +1589,14 @@ retry:
Mikulas>unsigned next_entry, i, pos;
Mikulas>unsigned ws, we;
 
Mikulas> -  dio->range.n_sectors = 
min(dio->range.n_sectors, ic->free_sectors);
Mikulas> +  dio->range.n_sectors = 
min(dio->range.n_sectors, ic->free_sectors << ic->sb->log2_sectors_per_block);
Mikulas>if (unlikely(!dio->range.n_sectors))
Mikulas>goto sleep;
Mikulas> -  ic->free_sectors -= dio->range.n_sectors;
Mikulas> +  ic->free_sectors -= dio->range.n_sectors >> 
ic->sb->log2_sectors_per_block;
Mikulas>journal_section = ic->free_section;
Mikulas>journal_entry = ic->free_section_entry;
 
Mikulas> -  next_entry = ic->free_section_entry + 
dio->range.n_sectors;
Mikulas> +  next_entry = ic->free_section_entry + 
(dio->range.n_sectors >> ic->sb->log2_sectors_per_block);
ic-> free_section_entry = next_entry % ic->journal_section_entries;
ic-> free_section += next_entry / ic->journal_section_entries;
ic-> n_uncommitted_sections += next_entry / ic->journal_section_entries;
Mikulas> @@ -1727,6 +1727,7 @@ static void pad_uncommitted(struct dm_in
Mikulas>wraparound_section(ic, &ic->free_section);
ic-> n_uncommitted_sections++;
Mikulas>}
Mikulas> +  BUG_ON((ic->n_uncommitted_sections + ic->n_committed_sections) 
* ic->journal_section_entries + ic->free_sectors != ic->journal_sections * 
ic->journal_section_entries);
Mikulas>  }
 
Mikulas>  static void integrity_commit(struct work_struct *w)

Mikulas> --
Mikulas> dm-devel mailing list
Mikulas> dm-devel@redhat.com
Mikulas> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm-cache issue

2016-11-15 Thread John Stoffel

Alexander> I am in a process of evaluating dm-cache for our backup system.

Why are you bothering to cache your backup destination?  What
filesystems are you using for your destinations?

Alexander> Currently I have an issue when restart the backup
Alexander> server. The server is booting for hours, because of IO
Alexander> load. It seems is triggered a flush from SSD disk (that is
Alexander> used for a cache device) to the raid controllers (they are
Alexander> with slow SATA disks).  I have 10 cached logical volumes in
Alexander> writethrough mode, each with 2T of data over 2 raid
Alexander> controllers. I use a single SSD disk for the cache.  The
Alexander> backup system is with lvm2-2.02.164-1 & kernel 4.4.30.

How big is the SSD cache?  Do you have any output from the bootup
(syslog, dmesg, etc) you can share.  

Alexander> Do you have any ideas why such flush is triggered? In
Alexander> writethrough cache mode we shouldn't have dirty blocks in
Alexander> the cache.  

I wonder if it makes more sense to use 'lvcache' instead, since you
can add/remove cache from LVs at will.  With bcache, once you setup a
volume with the cache, you're stuck with it.

Also, did you mirror your SSD incase it fails?

In my experience, using RAID6 on a bunch of SATA disks for backup is
fine, you're writing large chunks of data in a few sets of files, so
fragmentation and the Read-Modify-Write cycle is much less painful,
since you tend to stream your writes to the storage, which is
perfectly fine with SATA drives.

Also, bcache skips sequential IO by default, which is what you want
when doing backups... so I'm wondering what your thinking was here for
using this?

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] getting feedback on dm-cache statistics tool

2016-05-27 Thread John Stoffel
>>>>> "Ben" == Ben England  writes:

Ben> John, thanks for reply, comments inline...
Ben> - Original Message -
>> From: "John Stoffel" 
>> To: "Ben England" 
>> Cc: dm-devel@redhat.com
>> Sent: Friday, May 27, 2016 11:15:51 AM
>> Subject: Re: [dm-devel] getting feedback on dm-cache statistics tool
>> 
>> 
Ben> I have a git repo containing a tool I wrote to look at dm-cache
Ben> statistics, not the raw counters in "dmsetup status" but derived
Ben> values that are more directly useful to system administrators and
Ben> developers that want to see whether dm-cache is doing what they
Ben> want it to.  https://github.com/bengland2/dmcache-stat
>> 
Ben> Any feedback on this?  Anything missing?  I don't mind adding or
Ben> taking pull requests.  If some other tool can provide the same
Ben> functionality within RHEL then I'm happy to use that.
>> 
>> I'm using Debian Jessie on x86_64 running kernel 4.4.0-rc7 (self
>> compiled) and it's dying with an error:
>> 

Ben> I don't have a Debian system handy, can you do "dmsetup status"
Ben> on it and mail me the log so I know what the format difference
Ben> is?

It's mostly because I think I have bad UUIDs on my LVM volumes,
because I've basically just upgraded this system for the past six
years, without building new volumes/LVMs.

Here's the output though:

> sudo dmsetup status
[sudo] password for john:
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVLvpBly2guNYp3XUqTVCgGCHequQOBEf9"
should be mangled b.
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVkuRxXnQASgLcdsdFmK9OviYa88Q6buOU"
should be mangled b.
quad-swap_1: 0 15622144 linear
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVM32OGu9FYyW2J5u8Q1zSNh826zw6BnFX"
should be mangled b.
quad-root: 0 78118912 linear
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVabjo7c4R5l6twigsqqfc55LVN6XVag4W-cdata"
should be man.
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVIWMZ8hjG32sj1LIdQja6QWB4OWkUvUCl"
should be mangled b.
bacula-backups: 0 5857345536 linear
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVkOhak3U4uF82eWXH5JY1F3VhHzMHYV8c-cmeta"
should be man.
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVR1c11WDl2e86Ko9OmuWYRIUNXExHQnX1"
should be mangled b.
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVvdF6NWzpggiPLD202jpru363Z5LfB5Lo"
should be mangled b.
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wV40890zsM04q5zMIs0qdeCDfG9fc9FwbF"
should be mangled b.
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVrdBEgHTcrXLHDWNrbFotKPeLt2TfbRVo"
should be mangled b.
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVi37feovUsvWbysOk8PI2bbdjqokieGx2-cdata"
should be man.
quad-var: 0 39059456 linear
quad-var: 39059456 83886080 linear
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVnp1To0klyekIS85gueDeq2EsYg5Osj44"
should be mangled b.
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVas2sZsZM8mw0VXDgx03ryaq351RUEL4j-cmeta"
should be man.
bacula-incrs: 0 5662310400 linear
The UUID
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVbbcg3pVFZyDngdRLS4aECbc571Lyb5MI"
should be mangled b.
Command failed



>> > sudo ./dmcache_stat.py 10 0
>> volname, size(GiB), policy, mode
>> The UUID
>> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVLvpBly2guNYp3XUqTVCgGCHequQOBEf9"
>> should be mangled b.
>> The UUID
>> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVkuRxXnQASgLcdsdFmK9OviYa88Q6buOU"
>> should be mangled b.
>> The UUID
>> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVM32OGu9FYyW2J5u8Q1zSNh826zw6BnFX"
>> should be mangled b.
>> The UUID
>> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVabjo7c4R5l6twigsqqfc55LVN6XVag4W-cdata"
>> should be man.
>> The UUID
>> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVIWMZ8hjG32sj1LIdQja6QWB4OWkUvUCl"
>> should be mangled b.
>> The UUID
>> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVkOhak3U4uF82eWXH5JY1F3VhHzMHYV8c-cmeta"
>> should be man.
>> The UUID
>> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVR1c11WDl2e86Ko9OmuWYRIUNXExHQnX1"
>> should be mangled b.
>> The UUID
>> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVvdF6NWzpggiPLD202jpru363Z5LfB5Lo"
>> should be mangled b.
>> The UUID
>> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wV40890zsM04q5zMIs0qdeCDfG9fc9FwbF"
>> should be mangled b.
>> The UUID
>> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVrdBEgHTcrXLHDWNrbFotKPeLt2TfbRVo"
>> should be mangled b.
>> The UUID
>> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVi37feovUsvWbysOk8PI2bbdjqokieGx2-cdata"
>> should be man.
>> The U

Re: [dm-devel] getting feedback on dm-cache statistics tool

2016-05-27 Thread John Stoffel

Ben> I have a git repo containing a tool I wrote to look at dm-cache
Ben> statistics, not the raw counters in "dmsetup status" but derived
Ben> values that are more directly useful to system administrators and
Ben> developers that want to see whether dm-cache is doing what they
Ben> want it to.  https://github.com/bengland2/dmcache-stat

Ben> Any feedback on this?  Anything missing?  I don't mind adding or
Ben> taking pull requests.  If some other tool can provide the same
Ben> functionality within RHEL then I'm happy to use that.

I'm using Debian Jessie on x86_64 running kernel 4.4.0-rc7 (self
compiled) and it's dying with an error:

> sudo ./dmcache_stat.py 10 0
volname, size(GiB), policy, mode
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVLvpBly2guNYp3XUqTVCgGCHequQOBEf9" should 
be mangled b.
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVkuRxXnQASgLcdsdFmK9OviYa88Q6buOU" should 
be mangled b.
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVM32OGu9FYyW2J5u8Q1zSNh826zw6BnFX" should 
be mangled b.
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVabjo7c4R5l6twigsqqfc55LVN6XVag4W-cdata" 
should be man.
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVIWMZ8hjG32sj1LIdQja6QWB4OWkUvUCl" should 
be mangled b.
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVkOhak3U4uF82eWXH5JY1F3VhHzMHYV8c-cmeta" 
should be man.
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVR1c11WDl2e86Ko9OmuWYRIUNXExHQnX1" should 
be mangled b.
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVvdF6NWzpggiPLD202jpru363Z5LfB5Lo" should 
be mangled b.
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wV40890zsM04q5zMIs0qdeCDfG9fc9FwbF" should 
be mangled b.
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVrdBEgHTcrXLHDWNrbFotKPeLt2TfbRVo" should 
be mangled b.
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVi37feovUsvWbysOk8PI2bbdjqokieGx2-cdata" 
should be man.
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVnp1To0klyekIS85gueDeq2EsYg5Osj44" should 
be mangled b.
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVas2sZsZM8mw0VXDgx03ryaq351RUEL4j-cmeta" 
should be man.
The UUID 
"LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVbbcg3pVFZyDngdRLS4aECbc571Lyb5MI" should 
be mangled b.
Command failed
Traceback (most recent call last):
  File "./dmcache_stat.py", line 198, in 
  s2 = poll_dmcache()
  File "./dmcache_stat.py", line 172, in poll_dmcache
dmsetup_out = subprocess.check_output(['dmsetup', 'status'])
  File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
raise CalledProcessError(retcode, cmd, output=output)
  subprocess.CalledProcessError: Command '['dmsetup', 'status']' returned 
non-zero exit status 1


But this is partly because my UUIDs aren't being handled properly and I've been 
loathe to follow the process for rebuilding them because I'm scared.  And it's 
my main fileserver.  Maybe this weekend if I have time.


The other comment is that the usage should be more like:

dmcache-stat  []

where if you don't provide the count, it defaults to going forever, like 
iostat/vmstat, etc.

And now that I think of it, I'm using lvmcache, not dmcache or bcache...

> sudo lvs
  LV   VG Attr   LSize   Pool OriginData%  
Meta%  Move Log Cpy%Synct
backups  bacula -wi-ao   2.73t
incrsbacula -wi-ao   2.64t
drupal   data   -wi-ao  50.00g
home data   Cwi-aoC--- 550.00g homecacheLV  [home_corig]
homecacheLV  data   Cwi---C---  50.00g
localdata   Cwi-aoC--- 335.00g localcacheLV [local_corig]
localcacheLV data   Cwi---C---  50.00g
minecraftdata   -wc-ao  20.00g
nas  data   -wi-ao 600.00g
pete data   -wi-a- 800.00g
vm1  data   -wc-ao  20.00g
winxpdata   -wi-a- 170.00g
root quad   -wi-ao  37.25g
swap_1   quad   -wi-ao   7.45g
var  quad   -wi-ao  58.62g

Though looking at this list, I really should just nuke that ancient WinXP VM 
image I have.  LOL.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 01/17] multipathd: use /run instead of /var/run

2016-03-30 Thread John Stoffel
>>>>> "Benjamin" == Benjamin Marzinski  writes:

Benjamin> On Tue, Mar 29, 2016 at 09:57:25AM -0400, John Stoffel wrote:
>> 
Benjamin> /var/run is now usually a symlink to /run.  If /var is on a separate
Benjamin> filesytem, when multipathd starts up, it might end up writing to
Benjamin> /var/run before the /var filesytem is mounted and thus not have its
Benjamin> pidfile accessible at /var/run afterwards.  On most distrubutions /run
Benjamin> is now a tmpfs and should always be available before multipathd is
Benjamin> started, so multipath should just write there directly, instead of
Benjamin> through the symlink.
>> 
>> I'm not sure I agree with this, it really smells more like a
>> distribution problem, than a multipathd problem.  If multipathd starts
>> up in an initramfs, how does it get reset to the proper /var/run or
>> /run directory then?

Benjamin> Huh? /run is a tmpfs on almost all distros now. It gets
Benjamin> remounted when you switch the root from the initramfs, but
Benjamin> it doesn't go away. So there isn't a "proper" /run that
Benjamin> appears later. /var/run just gets symlinked to it.

Gah!  I apologize for the noise, I did a quick check on some RHEL5/6
systems (which are ancient I agree!) and didn't look at newer
debian/mint systems as well.  And I do see that /run is there.  

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 10/17] multipathd: delay reloads during creation

2016-03-29 Thread John Stoffel
> "Benjamin" == Benjamin Marzinski  writes:

Benjamin> lvm needs PV devices to not be suspended while the udev
Benjamin> rules are running, for them to be correctly identified as
Benjamin> PVs. However, multipathd will often be in a situation where
Benjamin> it will create a multipath device upon seeing a path, and
Benjamin> then immediately reload the device upon seeing another path.
Benjamin> If multipath is reloading a device while processing the udev
Benjamin> event from its creation, lvm can fail to identify it as a
Benjamin> PV. This can cause systems to fail to boot. Unfortunately,
Benjamin> using udev synchronization cookies to solve this issue would
Benjamin> cause a host of other issues that could only be avoided by a
Benjamin> pretty substantial change in how multipathd does locking and
Benjamin> event processing. The good news is that multipathd is
Benjamin> already listening to udev events itself, and can make sure
Benjamin> that it isn't reloading when it shouldn't be.

Benjamin> This patch makes multipathd delay or refuse any reloads that
Benjamin> would happen between the time when it creates a device, and
Benjamin> when it receives the change uevent from the device
Benjamin> creation. The only reloads that it refuses are from the
Benjamin> multipathd interactive commands that make no sense on a not
Benjamin> fully started device.  Otherwise, it processes the event or
Benjamin> command, and sets a flag to either mark that device for an
Benjamin> update, or to signal that multipathd needs a
Benjamin> reconfigure. When the udev event for the creation arrives,
Benjamin> multipath will reload the device if necessary. If a
Benjamin> reconfigure has been requested, and no devices are currently
Benjamin> being created, multipathd will also do the reconfigure then.

Benjamin> Also this patch adds a configurable timer
Benjamin> "missing_uev_msg_delay" defaulting to 30 seconds. If the
Benjamin> udev creation event has not arrived after this timeout has
Benjamin> triggered, multipathd will start printing messages alerting
Benjamin> the user of this every "missing_uev_msg_delay" seconds.

Should this really keep printing this message every 30 seconds for
eternity?  I would think that having it give up after 30 * N seconds
would be better instead.  I'm worried that this might block or slow
down system boots forever, instead of at least failing and falling
through so that maybe something can be recovered here.

Basically, what can the user do if they start getting these messages?
We should prompt them with a possible cause/solution if at all
possible.

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 01/17] multipathd: use /run instead of /var/run

2016-03-29 Thread John Stoffel

Benjamin> /var/run is now usually a symlink to /run.  If /var is on a separate
Benjamin> filesytem, when multipathd starts up, it might end up writing to
Benjamin> /var/run before the /var filesytem is mounted and thus not have its
Benjamin> pidfile accessible at /var/run afterwards.  On most distrubutions /run
Benjamin> is now a tmpfs and should always be available before multipathd is
Benjamin> started, so multipath should just write there directly, instead of
Benjamin> through the symlink.

I'm not sure I agree with this, it really smells more like a
distribution problem, than a multipathd problem.  If multipathd starts
up in an initramfs, how does it get reset to the proper /var/run or
/run directory then?

And especially if it's a symlink, we shouldn't care at all either.

What distros are using /run instead of /var/run these days?

John


Benjamin> If /var/run is not a symlink, continue using it.

Benjamin> Signed-off-by: Benjamin Marzinski 
Benjamin> ---
Benjamin>  Makefile.inc| 10 +-
Benjamin>  libmultipath/defaults.h |  2 +-
Benjamin>  multipathd/multipathd.init.suse |  2 +-
Benjamin>  3 files changed, 11 insertions(+), 3 deletions(-)

Benjamin> diff --git a/Makefile.inc b/Makefile.inc
Benjamin> index c3ed73f..357dd33 100644
Benjamin> --- a/Makefile.inc
Benjamin> +++ b/Makefile.inc
Benjamin> @@ -21,6 +21,14 @@ ifndef LIB
Benjamin>   endif
Benjamin>  endif
 
Benjamin> +ifndef RUN
Benjamin> + ifeq ($(shell test -L /var/run -o ! -d /var/run && echo 1),1)
Benjamin> + RUN=run
Benjamin> + else
Benjamin> + RUN=var/run
Benjamin> + endif
Benjamin> +endif
Benjamin> +
Benjamin>  ifndef SYSTEMD
Benjamin>   ifeq ($(shell systemctl --version > /dev/null 2>&1 && echo 1), 
1)
Benjamin>   SYSTEMD = $(shell systemctl --version 2> /dev/null |  
sed -n 's/systemd \([0-9]*\)/\1/p')
Benjamin> @@ -54,7 +62,7 @@ ifndef RPM_OPT_FLAGS
Benjamin>  endif
 
Benjamin>  OPTFLAGS = $(RPM_OPT_FLAGS) -Wunused -Wstrict-prototypes
Benjamin> -CFLAGS= $(OPTFLAGS) -fPIC -DLIB_STRING=\"${LIB}\"
Benjamin> +CFLAGS= $(OPTFLAGS) -fPIC -DLIB_STRING=\"${LIB}\" 
-DRUN_DIR=\"${RUN}\"
Benjamin>  SHARED_FLAGS = -shared
 
Benjamin>  %.o: %.c
Benjamin> diff --git a/libmultipath/defaults.h b/libmultipath/defaults.h
Benjamin> index 8902f40..cf1d5be 100644
Benjamin> --- a/libmultipath/defaults.h
Benjamin> +++ b/libmultipath/defaults.h
Benjamin> @@ -26,7 +26,7 @@
Benjamin>  #define MAX_CHECKINT(a)  (a << 2)
 
Benjamin>  #define MAX_DEV_LOSS_TMO 0x7FFF
Benjamin> -#define DEFAULT_PIDFILE  "/var/run/multipathd.pid"
Benjamin> +#define DEFAULT_PIDFILE  "/" RUN_DIR "/multipathd.pid"
Benjamin>  #define DEFAULT_SOCKET   
"/org/kernel/linux/storage/multipathd"
Benjamin>  #define DEFAULT_CONFIGFILE   "/etc/multipath.conf"
Benjamin>  #define DEFAULT_BINDINGS_FILE"/etc/multipath/bindings"
Benjamin> diff --git a/multipathd/multipathd.init.suse 
b/multipathd/multipathd.init.suse
Benjamin> index d1319b1..ed699fa 100644
Benjamin> --- a/multipathd/multipathd.init.suse
Benjamin> +++ b/multipathd/multipathd.init.suse
Benjamin> @@ -17,7 +17,7 @@
 
Benjamin>  PATH=/bin:/usr/bin:/sbin:/usr/sbin
Benjamin>  DAEMON=/sbin/multipathd
Benjamin> -PIDFILE=/var/run/multipathd.pid
Benjamin> +PIDFILE=/run/multipathd.pid
Benjamin>  MPATH_INIT_TIMEOUT=10
Benjamin>  ARGS=""
 
Benjamin> -- 
Benjamin> 1.8.3.1

Benjamin> --
Benjamin> dm-devel mailing list
Benjamin> dm-devel@redhat.com
Benjamin> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] [dm-cache] Make the mq policy an alias for smq

2016-02-11 Thread John Stoffel
>>>>> "Joe" == Joe Thornber  writes:

Joe> On Wed, Feb 10, 2016 at 12:06:00PM -0500, John Stoffel wrote:
>> 
>> Can you add in some documentation on how you tell which dm_cache
>> policy is actually being used, and how to measure it, etc?  It's a
>> black box and some info would be nice.

Joe> You can get some stats on the cache performance via the status
Joe> ioctl (eg, dmsetup status ...).  This will tell you about
Joe> promotions/demotion to the cache, write hits, read hits etc.

I've looked at this output, but I can't make heads nor tail of it, nor
does the docs (v4.4-rc7) seem to provide any useful headings.

# dmsetup --manglename none status --target cache
data-home: 0 1153433600 cache 8 2443/32768 128 819200/819200 379170
190334 1010554 121237 232 232 0 1 writeback 2 migration_threshold 2048
smq 0 rw -
data-local: 0 702545920 cache 8 2443/32768 128 804148/819200 574195
28503 81586 19118 0 85798 0 1 writeback 2 migration_threshold 2048 smq
0 rw -

But at least I see that I'm using smq for my cache policy.  

Joe> There is documentation on cache policies in
Joe> Documentation/device-mapper/cache-policies.txt

Joe> 
https://github.com/jthornber/linux-2.6/blob/2016-02-10-thin-dev-on-4.4/Documentation/device-mapper/cache-policies.txt

Joe> As for knowing which policy is running; I'm not sure what to say.
Joe> It'll be the one you ask for.  If the above patch goes in, then
Joe> there'll be a kernel version where mq becomes the same as smq.
Joe> I'll bump the policy version number to make it clear that mq has
Joe> undergone a big change.

I never explicitly asked for any policy when using lvcache, so that's
why I was asking.  *grin*

Using the -c or -C options doesn't really change things.  I see
there's a --noheadings option, but that's not useful without the
inverse --headings option so I can figure out what the results mean
without looking into the kernel source code.

Should I pull down the source and try to make up a patch for this
case?

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] [dm-cache] Make the mq policy an alias for smq

2016-02-10 Thread John Stoffel

Can you add in some documentation on how you tell which dm_cache
policy is actually being used, and how to measure it, etc?  It's a
black box and some info would be nice.

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm-cache + storage client experiment

2016-01-20 Thread John Stoffel
> "mlody3k" == mlody3k   writes:

mlody3k> Hi dm developers!
mlody3k> I'm during the experiment with dm-cache and storage client and I would 
be very grateful if some of
mlody3k> You will help me with some of my questions :). 

mlody3k> Simply, I would like to implement the write-through caching in the 
client where:
mlody3k>  - dm-cache is only responsible to check whether requested data is in 
the cache device (return
mlody3k> requested data if hit, ENODATA if miss - do not read from origin) - in 
case of miss, my client
mlody3k> will read data from the origin.
mlody3k>  - dm-cache will write only to the cache device - client will write 
data to the origin.

So instead, why don't you just have two different mount points, one
being /cache, the other being /slow, and have your app look for data
in /cache first, then pull from /slow.  If you write data, then you
just write it yourself where you want.

dm-cache is more of a block level tool, and you're looking for more of
a file level tool from what I can see.

Probably what you really want is to be able to tune and tweak the
algorithm inside dm-cache to do what you want.  In that case, I don't
have a clue how this would happen unfortunately.



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel