Re: [dm-devel] [PATCH] fix writing to the filesystem after unmount
> "Christian" == Christian Brauner writes: >> Well, currently you click some "Eject / safely remove / whatever" button >> and then you get a "wait" dialog until everything is done after which >> you're told the stick is safe to remove. What I imagine is that the "wait" >> dialog needs to be there while there are any (or exclusive at minimum) >> openers >> of the device. Not until umount(2) syscall has returned. And yes, the > Agreed. umount(2) doesn't give guarantees about a filesystem being > really gone once it has returned. And it really shouldn't. There's > too many cases where that doesn't work and it's not a commitment we > should make. So how the heck is someone supposed to know, from userspace, that a filesystem is unmounted? Just wearing my SysAdmin hat, this strikes me as really potentially painful and annoying. But then again, so are bind mounts from alot of views too. Don't people remember how bad it can be when you are trying to shutdown and system and it hangs because a remote NFS server is down and not responding? And your client system hangs completely? > And there are ways to wait until superblock shutdown that I've > mentioned before in other places where it somehow really > matters. inotify's IN_UMOUNT will notify about superblock > shutdown. IOW, when it really hits generic_shutdown_super() which > can only be hit after unfreezing as that requires active references. Can we maybe invert this discussion and think about it from the userspace side of things? How does the user(space) mount/unmount devices cleanly and reliably? > So this really can be used to wait for a filesystem to go away across > all namespaces, and across filesytem freezing and it's available to > unprivileged users. Simple example: > # shell 1 > sudo mount -t xfs /dev/sda /mnt > sudo mount --bind /mnt /opt > inotifywait -e unmount /mnt > #shell 2 > sudo umount /opt # nothing happens in shell 1 > sudo umount /mnt # shell 1 gets woken So what makes this *useful* to anyone? Why doesn't the bind mount A) lock /mnt into place, but B) give you some way of seeing that there's a bindmount in place? >> corner-cases. So does the current behavior, I agree, but improving >> situation for one usecase while breaking another usecase isn't really a way >> forward... > Agreed. >> Well, the filesystem (struct superblock to be exact) is invisible >> in /proc/mounts (or whatever), that is true. But it is still very >> much associated with that block device and if you do 'mount >> ', you'll get it back. But yes, the filesystem >> will not go away Then should it be unmountable in the first place? I mean yes, we always need a way to force an unmount no matter what, even if that breaks some other process on the system, but for regular use, shouldn't it just give back an error like: /mnt in use by bind mount /opt or something like that? Give the poor sysadmin some information on what's going on here. > And now we at least have an api to detect that case and refuse to reuse > the superblock. >> until all references to it are dropped and you cannot easily find >> who holds those references and how to get rid of them. ding ding ding I don't want to have to run 'lsof' or something like that. > Namespaces make this even messier. You have no easy way of knowing > whether the filesystem isn't pinned somewhere else through an > explicit bind-mount or when it was copied during mount namespace > creation. This is the biggest downside of namespaces and bind mounts in my mind. The lack of traceability back to the source. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] RAID4 with no striping mode request
> "Kyle" == Kyle Sanderson writes: > hi DM and Linux-RAID, > There have been multiple proprietary solutions (some nearly 20 years > old now) with a number of (userspace) bugs that are becoming untenable > for me as an end user. Basically how they work is a closed MD module > (typically administered through DM) that uses RAID4 for a dedicated > parity disk across multiple other disks. You need to explain what you want in *much* beter detail. Give simple concrete examples. From the sound of it, you want RAID6 but with RAID4 dedicated Parity so you can spin down some of the data disks in the array? But if need be, spin up idle disks to recover data if you lose an active disk? Really hard to understand what exactly you're looking for here. > As there is no striping, the maximum size of the protected data is the > size of the parity disk (so a set of 4+8+12+16 disks can be protected > by a single dedicated 16 disk).When a block is written on any disk, > the parity bit is read from the parity disk again, and updated > depending on the existing + new bit value (so writing disk + parity > disk spun up). Additionally, if enough disks are already spun up, the > parity information can be recalculated from all of the spinning disks, > resulting in a single write to the parity disk (without a read on the > parity, doubling throughput). Finally any of the data disks can be > moved around within the array without impacting parity as the layout > has not changed. I don't necessarily need all of these features, just > the ability to remove a disk and still access the data that was on > there by spinning up every other disk until the rebuild is complete is > important. > The benefit of this can be the data disks are all zoned, and you can > have a fast parity disk and still maintain excellent performance in > the array (limited only by the speed of the disk in question + > parity). Additionally, should 2 disks fail, you've either lost the > parity and data disk, or 2 data disks with the parity and other disks > not lost. > I was reading through the DM and MD code and it looks like everything > may already be there to do this, just needs (significant) stubs to be > added to support this mode (or new code). Snapraid is a friendly (and > respectable) implementation of this. Unraid and Synology SHR compete > in this space, as well as other NAS and enterprise SAN providers. > Kyle. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 1/4] block: introduce submit_bio_noacct_add_head
>>>>> "Danny" == Danny Shih writes: Danny> Hi, John, Danny> Thank you for taking the time to write a review. Danny> John Stoffel writes: >>>>>>> "dannyshih" == dannyshih writes: dannyshih> From: Danny Shih dannyshih> Porvide a way for stacking block device to re-submit the bio dannyshih> which sholud be handled firstly. >> >> You're spelling needs to be fixed in these messages. Danny> Sorry for so many spelling errors. Danny> The message should be Danny> "Provide a way for stacking block device to re-submit Danny> the bio which should be handled first." Danny> I will fix it. Great, though my second question is *why* it needs to be handled first? What is the difference between stacked and un-stacked devices and how could it be done in a way that doesn't require a seperate function like this? dannyshih> Signed-off-by: Danny Shih dannyshih> Reviewed-by: Allen Peng dannyshih> Reviewed-by: Alex Wu dannyshih> --- dannyshih> block/blk-core.c | 44 +--- dannyshih> include/linux/blkdev.h | 1 + dannyshih> 2 files changed, 34 insertions(+), 11 deletions(-) >> dannyshih> diff --git a/block/blk-core.c b/block/blk-core.c dannyshih> index 96e5fcd..693dc83 100644 dannyshih> --- a/block/blk-core.c dannyshih> +++ b/block/blk-core.c dannyshih> @@ -1031,16 +1031,7 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio) dannyshih> return ret; dannyshih> } >> dannyshih> -/** dannyshih> - * submit_bio_noacct - re-submit a bio to the block device layer for I/O dannyshih> - * @bio: The bio describing the location in memory and on the device. dannyshih> - * dannyshih> - * This is a version of submit_bio() that shall only be used for I/O that is dannyshih> - * resubmitted to lower level drivers by stacking block drivers. All file dannyshih> - * systems and other upper level users of the block layer should use dannyshih> - * submit_bio() instead. dannyshih> - */ dannyshih> -blk_qc_t submit_bio_noacct(struct bio *bio) dannyshih> +static blk_qc_t do_submit_bio_noacct(struct bio *bio, bool add_head) dannyshih> { dannyshih> if (!submit_bio_checks(bio)) dannyshih> return BLK_QC_T_NONE; dannyshih> @@ -1052,7 +1043,10 @@ blk_qc_t submit_bio_noacct(struct bio *bio) dannyshih> * it is active, and then process them after it returned. dannyshih> */ dannyshih> if (current->bio_list) { dannyshih> -bio_list_add(¤t->bio_list[0], bio); dannyshih> +if (add_head) dannyshih> +bio_list_add_head(¤t->bio_list[0], bio); dannyshih> +else dannyshih> +bio_list_add(¤t->bio_list[0], bio); dannyshih> return BLK_QC_T_NONE; dannyshih> } >> dannyshih> @@ -1060,9 +1054,37 @@ blk_qc_t submit_bio_noacct(struct bio *bio) dannyshih> return __submit_bio_noacct_mq(bio); dannyshih> return __submit_bio_noacct(bio); dannyshih> } dannyshih> + dannyshih> +/** dannyshih> + * submit_bio_noacct - re-submit a bio to the block device layer for I/O dannyshih> + * @bio: The bio describing the location in memory and on the device. dannyshih> + * dannyshih> + * This is a version of submit_bio() that shall only be used for I/O that is dannyshih> + * resubmitted to lower level drivers by stacking block drivers. All file dannyshih> + * systems and other upper level users of the block layer should use dannyshih> + * submit_bio() instead. dannyshih> + */ dannyshih> +blk_qc_t submit_bio_noacct(struct bio *bio) dannyshih> +{ dannyshih> +return do_submit_bio_noacct(bio, false); dannyshih> +} dannyshih> EXPORT_SYMBOL(submit_bio_noacct); >> >> So why is it named "submit_bio_noacct" when it's supposed to be only >> used by layers submitting to lower level drivers. How can this be >> figured out by drivers automatically, so the writed doesn't have to >> know about this? Danny> There is no logical change while using submit_bio_noacct() after my Danny> patch. So I didn't change Danny> the name and the documentation of submit_bio_noacct(). >> dannyshih> /** dannyshih> + * submit_bio_noacct - re-submit a bio, which needs to be handle firstly, dannyshih> + * to the block device layer for I/O dannyshih> + * @bio: The bio describing the location in memory and on the device. dannyshih> + * dannyshih> + * alternative submit_bio_noacct() which add bio to the head of dannyshih> + * current->bio_list. dannyshih> + */ >> >> Firstly isn't proper english. Maybe something like: >> >> submit_bio_noacct - re-submit a bio which needs to be handled first >> because
Re: [dm-devel] [PATCH 1/4] block: introduce submit_bio_noacct_add_head
>>>>> "antlists" == antlists writes: antlists> On 30/12/2020 00:00, John Stoffel wrote: dannyshih> From: Danny Shih dannyshih> Porvide a way for stacking block device to re-submit the bio dannyshih> which sholud be handled firstly. >> >> You're spelling needs to be fixed in these messages. antlists>^^ antlists> It is traditional, when correcting someone else's spelling, antlists> to make one of your own ... :-) LOL! Yes, touche! I'm completely abashed. antlists> Sorry, but if we're being pedantic for furriners, it behoves antlists> us to be correct ourselves :-) It does for sure, thanks for the nudge. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 1/4] block: introduce submit_bio_noacct_add_head
> "dannyshih" == dannyshih writes: dannyshih> From: Danny Shih dannyshih> Porvide a way for stacking block device to re-submit the bio dannyshih> which sholud be handled firstly. You're spelling needs to be fixed in these messages. dannyshih> Signed-off-by: Danny Shih dannyshih> Reviewed-by: Allen Peng dannyshih> Reviewed-by: Alex Wu dannyshih> --- dannyshih> block/blk-core.c | 44 +--- dannyshih> include/linux/blkdev.h | 1 + dannyshih> 2 files changed, 34 insertions(+), 11 deletions(-) dannyshih> diff --git a/block/blk-core.c b/block/blk-core.c dannyshih> index 96e5fcd..693dc83 100644 dannyshih> --- a/block/blk-core.c dannyshih> +++ b/block/blk-core.c dannyshih> @@ -1031,16 +1031,7 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio) dannyshih> return ret; dannyshih> } dannyshih> -/** dannyshih> - * submit_bio_noacct - re-submit a bio to the block device layer for I/O dannyshih> - * @bio: The bio describing the location in memory and on the device. dannyshih> - * dannyshih> - * This is a version of submit_bio() that shall only be used for I/O that is dannyshih> - * resubmitted to lower level drivers by stacking block drivers. All file dannyshih> - * systems and other upper level users of the block layer should use dannyshih> - * submit_bio() instead. dannyshih> - */ dannyshih> -blk_qc_t submit_bio_noacct(struct bio *bio) dannyshih> +static blk_qc_t do_submit_bio_noacct(struct bio *bio, bool add_head) dannyshih> { dannyshih> if (!submit_bio_checks(bio)) dannyshih> return BLK_QC_T_NONE; dannyshih> @@ -1052,7 +1043,10 @@ blk_qc_t submit_bio_noacct(struct bio *bio) dannyshih> * it is active, and then process them after it returned. dannyshih> */ dannyshih> if (current->bio_list) { dannyshih> -bio_list_add(¤t->bio_list[0], bio); dannyshih> +if (add_head) dannyshih> +bio_list_add_head(¤t->bio_list[0], bio); dannyshih> +else dannyshih> +bio_list_add(¤t->bio_list[0], bio); dannyshih> return BLK_QC_T_NONE; dannyshih> } dannyshih> @@ -1060,9 +1054,37 @@ blk_qc_t submit_bio_noacct(struct bio *bio) dannyshih> return __submit_bio_noacct_mq(bio); dannyshih> return __submit_bio_noacct(bio); dannyshih> } dannyshih> + dannyshih> +/** dannyshih> + * submit_bio_noacct - re-submit a bio to the block device layer for I/O dannyshih> + * @bio: The bio describing the location in memory and on the device. dannyshih> + * dannyshih> + * This is a version of submit_bio() that shall only be used for I/O that is dannyshih> + * resubmitted to lower level drivers by stacking block drivers. All file dannyshih> + * systems and other upper level users of the block layer should use dannyshih> + * submit_bio() instead. dannyshih> + */ dannyshih> +blk_qc_t submit_bio_noacct(struct bio *bio) dannyshih> +{ dannyshih> +return do_submit_bio_noacct(bio, false); dannyshih> +} dannyshih> EXPORT_SYMBOL(submit_bio_noacct); So why is it named "submit_bio_noacct" when it's supposed to be only used by layers submitting to lower level drivers. How can this be figured out by drivers automatically, so the writed doesn't have to know about this? dannyshih> /** dannyshih> + * submit_bio_noacct - re-submit a bio, which needs to be handle firstly, dannyshih> + * to the block device layer for I/O dannyshih> + * @bio: The bio describing the location in memory and on the device. dannyshih> + * dannyshih> + * alternative submit_bio_noacct() which add bio to the head of dannyshih> + * current->bio_list. dannyshih> + */ Firstly isn't proper english. Maybe something like: submit_bio_noacct - re-submit a bio which needs to be handled first because to the block device layer for I/O But the name still sucks, and the *reason* the bio needs to be handled differently isn't well explained. dannyshih> +blk_qc_t submit_bio_noacct_add_head(struct bio *bio) dannyshih> +{ dannyshih> +return do_submit_bio_noacct(bio, true); dannyshih> +} dannyshih> +EXPORT_SYMBOL(submit_bio_noacct_add_head); dannyshih> + dannyshih> +/** dannyshih> * submit_bio - submit a bio to the block device layer for I/O dannyshih> * @bio: The &struct bio which describes the I/O dannyshih> * dannyshih> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h dannyshih> index 070de09..b0080d0 100644 dannyshih> --- a/include/linux/blkdev.h dannyshih> +++ b/include/linux/blkdev.h dannyshih> @@ -905,6 +905,7 @@ static inline void rq_flush_dcache_pages(struct request *rq) dannyshih> extern int blk_register_queue(struct gendisk *disk); dannyshih> extern void blk_unregister_queue(struct gendisk *disk); dannyshih> blk_qc_t submit_bio_noacct(struct bio *bio); dannyshih> +blk_qc_t submit_bio_noacct_add_head(struct bio *bio); dannyshih> extern void blk_rq_init(struct request_queue *q, struct request *rq); dannyshih> ext
Re: [dm-devel] [general question] rare silent data corruption when writing data
>>>>> "Michal" == Michal Soltys writes: And of course it should also go to dm-devel@redhat.com, my fault for not including that as well. I strongly suspect it's an thin-lv problem somewhere, but I don't know enough to help chase down the problem in detail. John Michal> note: as suggested, I'm also CCing this to linux-lvm; the full Michal> context with replies starts at: Michal> https://www.spinics.net/lists/raid/msg64364.html There is also Michal> the initial post at the bottom as well. Michal> On 5/8/20 2:54 AM, John Stoffel wrote: >>>>>>> "Michal" == Michal Soltys writes: >> Michal> On 20/05/07 23:01, John Stoffel wrote: >>>>>>>>> "Roger" == Roger Heflin writes: >>>> Roger> Have you tried the same file 2x and verified the corruption is in the Roger> same places and looks the same? >>>> >>>> Are these 1tb files VMDK or COW images of VMs? How are these files >>>> made. And does it ever happen with *smaller* files? What about if >>>> you just use a sparse 2tb file and write blocks out past 1tb to see if >>>> there's a problem? >> Michal> The VMs are always directly on lvm volumes. (e.g. Michal> /dev/mapper/vg0-gitlab). The guest (btrfs inside the guest) detected the Michal> errors after we ran scrub on the filesystem. >> Michal> Yes, the errors were also found on small files. >> >> Those errors are in small files inside the VM, which is running btrfs >> ontop of block storage provided by your thin-lv, right? >> Michal> Yea, the small files were in this case on that thin-lv. Michal> We also discovered (yesterday) file corruptions with VM hosting gitlab registry - this one was using the same thin-lv underneath, but the guest itself was using ext4 (in this case, docker simply reported incorrect sha checksum on (so far) 2 layers. >> >> >> disks -> md raid5 -> pv -> vg -> lv-thin -> guest QCOW/LUN -> >> filesystem -> corruption Michal> Those particular guests, yea. The host case it's just w/o "guest" step. Michal> But (so far) all corruption ended going via one of the lv-thin layers (and via one of md raids). >> >> Michal> Since then we recreated the issue directly on the host, just Michal> by making ext4 filesystem on some LV, then doing write with Michal> checksum, sync, drop_caches, read and check checksum. The Michal> errors are, as I mentioned - always a full 4KiB chunks (always Michal> same content, always same position). >> >> What position? Is it a 4k, 1.5m or some other consistent offset? And >> how far into the file? And this LV is a plain LV or a thin-lv? I'm >> running a debian box at home with RAID1 and I haven't seen this, but >> I'm not nearly as careful as you. Can you provide the output of: >> Michal> What I meant that it doesn't "move" when verifying the same file (aka different reads from same test file). Between the tests, the errors are of course in different places - but it's always some 4KiB piece(s) - that look like correct pieces belonging somewhere else. >> /sbin/lvs --version Michal> LVM version: 2.03.02(2) (2018-12-18) Michal> Library version: 1.02.155 (2018-12-18) Michal> Driver version: 4.41.0 Michal> Configuration: ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --exec-prefix= --bindir=/bin --libdir=/lib/x86_64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/x86_64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-dbus-service --enable-lvmlockd-dlm --enable-lvmlockd-sanloc k --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-readline --enable-udev_rules --enable-udev_sync >> >> too? >> >> Can you post your: >> >> /sbin/dmsetup status >> >> output too? There's a better command to use here, but I'm not an >> export. You might really want to copy this over to the >> lin
Re: [dm-devel] [PATCH 10/15] libmultipath: change how the checker async is set
>>>>> "Benjamin" == Benjamin Marzinski writes: Benjamin> On Mon, Jan 20, 2020 at 03:20:14PM -0500, John Stoffel wrote: >> >>>>> "Benjamin" == Benjamin Marzinski writes: >> Benjamin> On Sat, Jan 18, 2020 at 12:43:49PM -0500, John Stoffel wrote: >> >> >>>>> "Benjamin" == Benjamin Marzinski writes: >> >> Benjamin> The way that the checkers async mode worked in multipathd didn't make Benjamin> much sense. When multipathd starts up, all checker classes are in the Benjamin> sync mode, and any pathinfo() calls on devices would run the checker in Benjamin> sync mode. However, the First time a checker class was used in Benjamin> checkerloop, it would set that checker class to async (assuming Benjamin> force_sync wasn't set). After that, no matter when a checker from that Benjamin> class was called, it would always run in async mode. Multipathd doesn't Benjamin> need to run checkers in sync mode at all, so don't. >> >> >> >> Sorry, I had a hard time parsing this description, especially the last >> >> sentence. Do you mean that checkers should default to async then, >> >> instead of sync mode? And from looking at the code, don't you mean >> >> that you force sync mode? I guess the question is whether checker >> >> classes should default sync, or async. And if they move to async, >> >> should they stay there? >> >> >> Benjamin> Sorry. I mean that right now multipathd runs the checkers from pathinfo(), Benjamin> wait_for_pending_paths() and check_path(). When multipathd starts, all Benjamin> checkers are in sync mode. The first time a checker is run from Benjamin> check_path(), it is switched to async mode, and stays there for the rest Benjamin> of the time multipathd is runing. >> Benjamin> There is no need for multipathd to run checkers in sync mode at all, so Benjamin> we shouldn't be doing so for these first checks. >> >> Thanks, that makes the entire issue much more clear. This raises the >> question of why there is a sync mode at all then? In any case, the >> above makes the issue more understandable. Benjamin> The multipath command, which uses the same checkers, runs in sync mode. Crystal clean now. Thanks for your patience! -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 10/15] libmultipath: change how the checker async is set
>>>>> "Benjamin" == Benjamin Marzinski writes: Benjamin> On Sat, Jan 18, 2020 at 12:43:49PM -0500, John Stoffel wrote: >> >>>>> "Benjamin" == Benjamin Marzinski writes: >> Benjamin> The way that the checkers async mode worked in multipathd didn't make Benjamin> much sense. When multipathd starts up, all checker classes are in the Benjamin> sync mode, and any pathinfo() calls on devices would run the checker in Benjamin> sync mode. However, the First time a checker class was used in Benjamin> checkerloop, it would set that checker class to async (assuming Benjamin> force_sync wasn't set). After that, no matter when a checker from that Benjamin> class was called, it would always run in async mode. Multipathd doesn't Benjamin> need to run checkers in sync mode at all, so don't. >> >> Sorry, I had a hard time parsing this description, especially the last >> sentence. Do you mean that checkers should default to async then, >> instead of sync mode? And from looking at the code, don't you mean >> that you force sync mode? I guess the question is whether checker >> classes should default sync, or async. And if they move to async, >> should they stay there? >> Benjamin> Sorry. I mean that right now multipathd runs the checkers from pathinfo(), Benjamin> wait_for_pending_paths() and check_path(). When multipathd starts, all Benjamin> checkers are in sync mode. The first time a checker is run from Benjamin> check_path(), it is switched to async mode, and stays there for the rest Benjamin> of the time multipathd is runing. Benjamin> There is no need for multipathd to run checkers in sync mode at all, so Benjamin> we shouldn't be doing so for these first checks. Thanks, that makes the entire issue much more clear. This raises the question of why there is a sync mode at all then? In any case, the above makes the issue more understandable. John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 10/15] libmultipath: change how the checker async is set
> "Benjamin" == Benjamin Marzinski writes: Benjamin> The way that the checkers async mode worked in multipathd didn't make Benjamin> much sense. When multipathd starts up, all checker classes are in the Benjamin> sync mode, and any pathinfo() calls on devices would run the checker in Benjamin> sync mode. However, the First time a checker class was used in Benjamin> checkerloop, it would set that checker class to async (assuming Benjamin> force_sync wasn't set). After that, no matter when a checker from that Benjamin> class was called, it would always run in async mode. Multipathd doesn't Benjamin> need to run checkers in sync mode at all, so don't. Sorry, I had a hard time parsing this description, especially the last sentence. Do you mean that checkers should default to async then, instead of sync mode? And from looking at the code, don't you mean that you force sync mode? I guess the question is whether checker classes should default sync, or async. And if they move to async, should they stay there? Benjamin> Signed-off-by: Benjamin Marzinski Benjamin> --- Benjamin> libmpathpersist/mpath_persist.c | 2 +- Benjamin> libmultipath/discovery.c| 10 -- Benjamin> multipath/main.c| 1 + Benjamin> 3 files changed, 6 insertions(+), 7 deletions(-) Benjamin> diff --git a/libmpathpersist/mpath_persist.c b/libmpathpersist/mpath_persist.c Benjamin> index 603cfc3b..b2238f00 100644 Benjamin> --- a/libmpathpersist/mpath_persist.c Benjamin> +++ b/libmpathpersist/mpath_persist.c Benjamin> @@ -47,7 +47,7 @@ mpath_lib_init (void) Benjamin> condlog(0, "Failed to initialize multipath config."); Benjamin> return NULL; Benjamin> } Benjamin> - Benjamin> + conf->force_sync = 1; Benjamin> set_max_fds(conf->max_fds); Benjamin> return conf; Benjamin> diff --git a/libmultipath/discovery.c b/libmultipath/discovery.c Benjamin> index d2773c3a..1ab093f4 100644 Benjamin> --- a/libmultipath/discovery.c Benjamin> +++ b/libmultipath/discovery.c Benjamin> @@ -1643,12 +1643,10 @@ get_state (struct path * pp, struct config *conf, int daemon, int oldstate) Benjamin> if (pp->mpp && !c->mpcontext) Benjamin> checker_mp_init(c, &pp->mpp->mpcontext); Benjamin> checker_clear_message(c); Benjamin> - if (daemon) { Benjamin> - if (conf->force_sync == 0) Benjamin> - checker_set_async(c); Benjamin> - else Benjamin> - checker_set_sync(c); Benjamin> - } Benjamin> + if (conf->force_sync == 0) Benjamin> + checker_set_async(c); Benjamin> + else Benjamin> + checker_set_sync(c); Benjamin> if (!conf->checker_timeout && Benjamin> sysfs_get_timeout(pp, &(c->timeout)) <= 0) c-> timeout = DEF_TIMEOUT; Benjamin> diff --git a/multipath/main.c b/multipath/main.c Benjamin> index 4f4d8e89..aebabd9b 100644 Benjamin> --- a/multipath/main.c Benjamin> +++ b/multipath/main.c Benjamin> @@ -905,6 +905,7 @@ main (int argc, char *argv[]) Benjamin> exit(RTVL_FAIL); Benjamin> multipath_conf = conf; conf-> retrigger_tries = 0; Benjamin> + conf->force_sync = 1; Benjamin> while ((arg = getopt(argc, argv, ":adcChl::FfM:v:p:b:BrR:itTquUwW")) != EOF ) { Benjamin> switch(arg) { Benjamin> case 1: printf("optarg : %s\n",optarg); Benjamin> -- Benjamin> 2.17.2 Benjamin> -- Benjamin> dm-devel mailing list Benjamin> dm-devel@redhat.com Benjamin> https://www.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 09/15] libmultipath: add code to get vendor specific vpd data
> "Benjamin" == Benjamin Marzinski writes: Benjamin> This adds the wildcard 'g' for multipath and path formatted Benjamin> printing, which returns extra data from a device's vendor Benjamin> specific vpd page. The specific vendor vpd page to use, and Benjamin> the vendor/product id to decode it can be set in the hwentry Benjamin> with vpd_vendor_pg and vpd_vendor_id. It can be configured Benjamin> in the devices section of multipath.conf with the vpd_vendor Benjamin> parameter. Currently, the only devices that use this are HPE Benjamin> 3PAR arrays, to return the Volume Name. So I've looked over this patch and I see how you've hard-coded the 3par entry into the various .c files, but the multipathd.conf example really doesn't give a good example of how to use it, nor does the code give a good example, or maybe better yet a template set of functions for parsing custom VPD info pages. This is really just a nit. Also, is the Vendor VPD page always at 0xco, or is that just a 3par setting? I wonder if that constant should be pulled into a #define somewhere, so that it's simpler to update/change as needed? Especially if other vendors use different pages. John Benjamin> Signed-off-by: Benjamin Marzinski Benjamin> --- Benjamin> libmultipath/config.c | 4 Benjamin> libmultipath/config.h | 2 ++ Benjamin> libmultipath/dict.c| 34 ++ Benjamin> libmultipath/discovery.c | 34 +- Benjamin> libmultipath/hwtable.c | 2 ++ Benjamin> libmultipath/print.c | 27 +++ Benjamin> libmultipath/propsel.c | 24 Benjamin> libmultipath/propsel.h | 2 ++ Benjamin> libmultipath/structs.h | 9 + Benjamin> multipath/multipath.conf.5 | 8 Benjamin> 10 files changed, 145 insertions(+), 1 deletion(-) Benjamin> diff --git a/libmultipath/config.c b/libmultipath/config.c Benjamin> index 85626e96..72b8d37c 100644 Benjamin> --- a/libmultipath/config.c Benjamin> +++ b/libmultipath/config.c Benjamin> @@ -369,6 +369,8 @@ merge_hwe (struct hwentry * dst, struct hwentry * src) Benjamin> merge_num(max_sectors_kb); Benjamin> merge_num(ghost_delay); Benjamin> merge_num(all_tg_pt); Benjamin> + merge_num(vpd_vendor_pg); Benjamin> + merge_num(vpd_vendor_id); Benjamin> merge_num(san_path_err_threshold); Benjamin> merge_num(san_path_err_forget_rate); Benjamin> merge_num(san_path_err_recovery_time); Benjamin> @@ -517,6 +519,8 @@ store_hwe (vector hwtable, struct hwentry * dhwe) hwe-> detect_prio = dhwe->detect_prio; hwe-> detect_checker = dhwe->detect_checker; hwe-> ghost_delay = dhwe->ghost_delay; Benjamin> + hwe->vpd_vendor_pg = dhwe->vpd_vendor_pg; Benjamin> + hwe->vpd_vendor_id = dhwe->vpd_vendor_id; Benjamin> if (dhwe->bl_product && !(hwe->bl_product = set_param_str(dhwe->bl_product))) Benjamin> goto out; Benjamin> diff --git a/libmultipath/config.h b/libmultipath/config.h Benjamin> index e69aa07c..589146de 100644 Benjamin> --- a/libmultipath/config.h Benjamin> +++ b/libmultipath/config.h Benjamin> @@ -87,6 +87,8 @@ struct hwentry { Benjamin> int max_sectors_kb; Benjamin> int ghost_delay; Benjamin> int all_tg_pt; Benjamin> + int vpd_vendor_pg; Benjamin> + int vpd_vendor_id; Benjamin> char * bl_product; Benjamin> }; Benjamin> diff --git a/libmultipath/dict.c b/libmultipath/dict.c Benjamin> index 2b046e1d..d6d8b79b 100644 Benjamin> --- a/libmultipath/dict.c Benjamin> +++ b/libmultipath/dict.c Benjamin> @@ -1366,6 +1366,39 @@ def_uxsock_timeout_handler(struct config *conf, vector strvec) Benjamin> return 0; Benjamin> } Benjamin> +static int Benjamin> +hw_vpd_vendor_handler(struct config *conf, vector strvec) Benjamin> +{ Benjamin> + int rc = 0; Benjamin> + char *buff; Benjamin> + Benjamin> + struct hwentry * hwe = VECTOR_LAST_SLOT(conf->hwtable); Benjamin> + if (!hwe) Benjamin> + return 1; Benjamin> + Benjamin> + buff = set_value(strvec); Benjamin> + if (!buff) Benjamin> + return 1; Benjamin> + if (strcmp(buff, "hp3par") == 0) { Benjamin> + hwe->vpd_vendor_pg = 0xc0; Benjamin> + hwe->vpd_vendor_id = VPD_VP_HP3PAR; Benjamin> + } else Benjamin> + rc = 1; Benjamin> + FREE(buff); Benjamin> + return rc; Benjamin> +} Benjamin> + Benjamin> +static int Benjamin> +snprint_hw_vpd_vendor(struct config *conf, char * buff, int len, Benjamin> + const void * data) Benjamin> +{ Benjamin> + const struct hwentry * hwe = (const struct hwentry *)data; Benjamin> + Benjamin> + if (hwe->vpd_vendor_pg == 0xc0 && hwe->vpd_vendor_id == VPD_VP_HP3PAR) Benjamin> + return snprintf(buff, len, "hp3par"); Benjamin> + return 0; Benjamin> +} Benjamin> + Benjamin> /* Benjamin> * blacklist block handlers Benjamin> */ Benjamin> @@ -
Re: [dm-devel] [PATCH V2] dm-raid: fix updating of max_discard_sectors limit
> "Mike" == Mike Snitzer writes: Mike> On Wed, Sep 11 2019 at 7:31am -0400, Mike> Ming Lei wrote: >> Unit of 'chunk_size' is byte, instead of sector, so fix it. >> >> Without this fix, too big max_discard_sectors is applied on the request queue >> of dm-raid, finally raid code has to split the bio again. >> >> This re-split done by raid causes the following nested clone_endio: >> >> 1) one big bio 'A' is submitted to dm queue, and served as the original >> bio >> >> 2) one new bio 'B' is cloned from the original bio 'A', and .map() >> is run on this bio of 'B', and B's original bio points to 'A' >> >> 3) raid code sees that 'B' is too big, and split 'B' and re-submit >> the remainded part of 'B' to dm-raid queue via generic_make_request(). >> >> 4) now dm will hanlde 'B' as new original bio, then allocate a new >> clone bio of 'C' and run .map() on 'C'. Meantime C's original bio >> points to 'B'. >> >> 5) suppose now 'C' is completed by raid direclty, then the following >> clone_endio() is called recursively: >> >> clone_endio(C) -> clone_endio(B) #B is original bio of 'C' -> bio_endio(A) >> >> 'A' can be big enough to make handreds of nested clone_endio(), then >> stack can be corrupted easily. >> >> Cc: >> Signed-off-by: Ming Lei >> --- >> V2: >> - fix commit log a bit >> >> drivers/md/dm-raid.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c >> index 8a60a4a070ac..c26aa4e8207a 100644 >> --- a/drivers/md/dm-raid.c >> +++ b/drivers/md/dm-raid.c >> @@ -3749,7 +3749,7 @@ static void raid_io_hints(struct dm_target *ti, struct >> queue_limits *limits) >> */ >> if (rs_is_raid1(rs) || rs_is_raid10(rs)) { limits-> discard_granularity = chunk_size; >> -limits->max_discard_sectors = chunk_size; >> +limits->max_discard_sectors = chunk_size >> 9; >> } >> } >> >> -- >> 2.20.1 >> Mike> Thanks a lot Ming! But oof, really embarassing oversight on my part! Mike> FYI, I added a "Fixes:" tag to the commit header and switched to Mike> shifting by SECTOR_SHIFT instead of 9, staged commit for 5.4 is here: Mike> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.4&id=509818079bf1fefff4ed02d6a1b994e20efc0480 Mike> -- Mike> dm-devel mailing list Mike> dm-devel@redhat.com Mike> https://www.redhat.com/mailman/listinfo/dm-devel Would it make sense to re-name the variable to chunk_size_bytes as well? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Serial console is causing system lock-up
> "John" == John Ogness writes: John> On 2019-03-06, Steven Rostedt wrote: >>> This bug only happens if we select large logbuffer (millions of >>> characters). With smaller log buffer, there are messages "** X printk >>> messages dropped", but there's no lockup. >>> >>> The kernel apparently puts 2 million characters into a console log >>> buffer, then takes some lock and than tries to write all of them to a >>> slow serial line. >>> >>> [...] >>> >>> The MD-RAID is supposed to recalculate data for the corrupted device >>> and bring it back to life. However, scrubbing the MD-RAID device >>> resulted in a lot of reads from the device with bad checksums, these >>> were reported to the log and killed the machine. >>> >>> I made a patch to dm-integrity to rate-limit the error messages. But >>> anyway - killing the machine in case of too many log messages seems >>> bad. If the log messages are produced faster than the kernel can >>> write them, the kernel should discard some of them, not kill itself. >> >> Sounds like another aurgment for the new printk design. John> Assuming the bad checksum messages are considered an emergency John> (for example, at least loglevel KERN_WARN), then the new printk John> design would print those messages synchronously to the slow John> serial line in the context of the driver as the driver is John> producing them. John> There wouldn't be a lock-up, but it would definitely slow down John> the driver. The situation of "messages being produced faster John> than the kernel can write them" would never exist because the John> printk() call will only return after the writing is completed. I John> am curious if that would be acceptable here? The real problem is the disconnect between serial console speed and capacity in bits/sec and that of the regular console. Serial, esp at 9600 baud is just a slow and limited resource which needs to be handled differently than a graphical console. I'm also big on ratelimiting messages, even critical warning messages. Too much redundant info doesn't help anyone. And what a subsystem thinks is critical, may not be critical to the system as a whole. In this case, if these checksum messages are telling us that there's corruption, why isn't dm-integrity going readonly and making the block device get the filesystem to also go readonly and to stop the damage right away? If it's just a warning for the niceness, then please rate limit them, or summarize them in some more useful way. Or even log them to somewhere else than the console once the problem is noted. John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Possible bug in mirror target
>>>>> "Zdenek" == Zdenek Kabelac writes: Zdenek> Dne 10. 02. 19 v 22:58 John Stoffel napsal(a): >>>>>>> "Zdenek" == Zdenek Kabelac writes: >> Zdenek> Dne 05. 02. 19 v 1:47 Drew Hastings napsal(a): >>>> Hi, >>>> >>>> I'm assuming all user space code is expected to use the handle_errors >>>> feature, >>>> so this isn't that big of a deal. I'm also using 4.19.13, which I think is >>>> more recent than the latest update to dm-raid1.c >>>> >>>> That said, there may be a bug that causes the entire mirror to crash if >>>> there >>>> is an error on the first leg of the mirror, but not on the 2nd leg. >>>> >>>> Works fine if you do cause IO errors on the 2nd leg: >>>> >>>> root@kernel-dev:~# dmsetup create zerodev --table "0 1073741824 zero" >>>> root@kernel-dev:~# dmsetup create errordev --table "0 1073741824 error" >>>> >>>> >>>> root@kernel-dev:~# dmsetup create mirror-table --table "0 1073741824 mirror >>>> core 2 524288 nosync 2 /dev/mapper/zerodev 0 /dev/mapper/errordev 0 0" >> Zdenek> There are 2 operational modes for old dm mirror target. Zdenek> One requires to handle errors externally. Please check i.e. the following Zdenek> guide for mirror target: >> Zdenek> https://wiki.gentoo.org/wiki/Device-mapper >> Zdenek> Lvm2 is using 'dmeventd' to service this error handling Zdenek> (i.e. dropping the mirror leg or allocating replacement one). >> >> Is it time to remove the dm-mirror target then? Or to deprecate it? >> Or to just silently replace it with something that does the right >> thing when errors happen? I can't think of why *anyone* would want to >> use the dm-mirror target as it now seems to work. Zdenek> The old dm mirror works differently then new dm raid - so if Zdenek> you have an old one present/running (i.e. LVM mirrored volume) Zdenek> using this target - you need to have this target present to be Zdenek> able to access such volume. Ok, I see that. I just wonder if A) anyone still uses this target and B) if it should be deprecated since it doesn't provide the protections that the dm_raid target provides. Getting rid of legacy stuff isn't a bad thing... :-) Zdenek> You can (in lvm2) convert such mirror to use newer dm raid Zdenek> target - but this requires some extra space (even though very Zdenek> small metadata volume) which is now required per each mirrored Zdenek> leg. Not a bad thing. Zdenek> Both targets are still maintained and bugfixed. Zdenek> There is also something about very high complexity of mdraid Zdenek> code and relative simplicity of old mirror target. I'm thinking about the complexity of maintaing both in parallel, when the older one doesn't provide the protections of the newer one. Zdenek> As for comment about *anyone* - majority of users are Zdenek> consuming DM targets via lvm2 - there you get this transition Zdenek> automatically - when you ask to create mirrored LV - lvm2 will Zdenek> use (if present in kernel) newer md raid variant. If anyone Zdenek> is using DM directly with 'dmsetup' command - it's assumed Zdenek> these are skilled users familiar with kernel doc (dmsetup is Zdenek> low-level admins tool). Lvm2 should be seen as Zdenek> 'friendly-face' of these DM targets - which otherwise do Zdenek> require pretty complex setup. If you do not want to use lvm2, Zdenek> the tool replacing lvm2 needs to reproduce this extra logic. Right, so how many use the old version? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Possible bug in mirror target
> "Zdenek" == Zdenek Kabelac writes: Zdenek> Dne 05. 02. 19 v 1:47 Drew Hastings napsal(a): >> Hi, >> >> I'm assuming all user space code is expected to use the handle_errors >> feature, >> so this isn't that big of a deal. I'm also using 4.19.13, which I think is >> more recent than the latest update to dm-raid1.c >> >> That said, there may be a bug that causes the entire mirror to crash if >> there >> is an error on the first leg of the mirror, but not on the 2nd leg. >> >> Works fine if you do cause IO errors on the 2nd leg: >> >> root@kernel-dev:~# dmsetup create zerodev --table "0 1073741824 zero" >> root@kernel-dev:~# dmsetup create errordev --table "0 1073741824 error" >> >> >> root@kernel-dev:~# dmsetup create mirror-table --table "0 1073741824 mirror >> core 2 524288 nosync 2 /dev/mapper/zerodev 0 /dev/mapper/errordev 0 0" Zdenek> There are 2 operational modes for old dm mirror target. Zdenek> One requires to handle errors externally. Please check i.e. the following Zdenek> guide for mirror target: Zdenek> https://wiki.gentoo.org/wiki/Device-mapper Zdenek> Lvm2 is using 'dmeventd' to service this error handling Zdenek> (i.e. dropping the mirror leg or allocating replacement one). Is it time to remove the dm-mirror target then? Or to deprecate it? Or to just silently replace it with something that does the right thing when errors happen? I can't think of why *anyone* would want to use the dm-mirror target as it now seems to work. John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] mm: set PF_LESS_THROTTLE when allocating memory for i/o
> "Mikulas" == Mikulas Patocka writes: Mikulas> It has been noticed that congestion throttling can slow down Mikulas> allocations path that participate in the IO and thus help the Mikulas> memory reclaim. Stalling those allocation is therefore not Mikulas> productive. Moreover mempool allocator and md variants of the Mikulas> same already implement their own throttling which has a Mikulas> better way to be feedback driven. Stalling at the page Mikulas> allocator is therefore even counterproductive. Can you show numbers for this claim? I would think that throttling needs to be done as close to the disk as possible, and propogate back up the layers to have this all work well, so that faster devices (and the layers stacked on them) will work better without stalling on a slow device. Mikulas> PF_LESS_THROTTLE is a task flag denoting allocation context that is Mikulas> participating in the memory reclaim which fits into these IO paths Mikulas> model, so use the flag and make the page allocator aware they are Mikulas> special and they do not really want any dirty data throttling. Mikulas> The throttling causes stalls on Android - it uses the dm-verity driver Mikulas> that uses dm-bufio. Allocations in dm-bufio were observed to sleep in Mikulas> wait_iff_congested repeatedly. Mikulas> Signed-off-by: Mikulas Patocka Mikulas> Acked-by: Michal Hocko # mempool_alloc and bvec_alloc Mikulas> Cc: sta...@vger.kernel.org Mikulas> --- Mikulas> block/bio.c |4 Mikulas> drivers/md/dm-bufio.c | 14 +++--- Mikulas> drivers/md/dm-crypt.c |8 Mikulas> drivers/md/dm-integrity.c |4 Mikulas> drivers/md/dm-kcopyd.c|3 +++ Mikulas> drivers/md/dm-verity-target.c |4 Mikulas> drivers/md/dm-writecache.c|4 Mikulas> mm/mempool.c |4 Mikulas> 8 files changed, 42 insertions(+), 3 deletions(-) Mikulas> Index: linux-2.6/mm/mempool.c Mikulas> === Mikulas> --- linux-2.6.orig/mm/mempool.c2018-06-29 03:47:16.29000 +0200 Mikulas> +++ linux-2.6/mm/mempool.c 2018-06-29 03:47:16.27000 +0200 Mikulas> @@ -369,6 +369,7 @@ void *mempool_alloc(mempool_t *pool, gfp Mikulas>unsigned long flags; Mikulas>wait_queue_entry_t wait; Mikulas>gfp_t gfp_temp; Mikulas> + unsigned old_flags; Mikulas>VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO); Mikulas>might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM); Mikulas> @@ -381,7 +382,10 @@ void *mempool_alloc(mempool_t *pool, gfp Mikulas> repeat_alloc: Mikulas> + old_flags = current->flags & PF_LESS_THROTTLE; Mikulas> + current->flags |= PF_LESS_THROTTLE; Mikulas>element = pool->alloc(gfp_temp, pool->pool_data); Mikulas> + current_restore_flags(old_flags, PF_LESS_THROTTLE); Mikulas>if (likely(element != NULL)) Mikulas>return element; Mikulas> Index: linux-2.6/block/bio.c Mikulas> === Mikulas> --- linux-2.6.orig/block/bio.c 2018-06-29 03:47:16.29000 +0200 Mikulas> +++ linux-2.6/block/bio.c 2018-06-29 03:47:16.27000 +0200 Mikulas> @@ -217,6 +217,7 @@ fallback: Mikulas>} else { Mikulas>struct biovec_slab *bvs = bvec_slabs + *idx; Mikulas>gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_IO); Mikulas> + unsigned old_flags; Mikulas>/* Mikulas> * Make this allocation restricted and don't dump info on Mikulas> @@ -229,7 +230,10 @@ fallback: Mikulas> * Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM Mikulas> * is set, retry with the 1-entry mempool Mikulas> */ Mikulas> + old_flags = current->flags & PF_LESS_THROTTLE; Mikulas> + current->flags |= PF_LESS_THROTTLE; Mikulas>bvl = kmem_cache_alloc(bvs->slab, __gfp_mask); Mikulas> + current_restore_flags(old_flags, PF_LESS_THROTTLE); Mikulas>if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) { Mikulas>*idx = BVEC_POOL_MAX; Mikulas>goto fallback; Mikulas> Index: linux-2.6/drivers/md/dm-bufio.c Mikulas> === Mikulas> --- linux-2.6.orig/drivers/md/dm-bufio.c 2018-06-29 03:47:16.29000 +0200 Mikulas> +++ linux-2.6/drivers/md/dm-bufio.c2018-06-29 03:47:16.27000 +0200 Mikulas> @@ -356,6 +356,7 @@ static void __cache_size_refresh(void) Mikulas> static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask, Mikulas> unsigned char *data_mode) Mikulas> { Mikulas> + void *ptr; Mikulas>if (unlikely(c->slab_cache != NULL)) { Mikulas>*data_mode = DATA_MODE_SL
Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
>>>>> "Mikulas" == Mikulas Patocka writes: Mikulas> On Mon, 30 Apr 2018, John Stoffel wrote: >> >>>>> "Mikulas" == Mikulas Patocka writes: >> Mikulas> On Thu, 26 Apr 2018, John Stoffel wrote: >> Mikulas> I see your point - and I think the misunderstanding is this. >> >> Thanks. >> Mikulas> This patch is not really helping people to debug existing crashes. It is Mikulas> not like "you get a crash" - "you google for some keywords" - "you get a Mikulas> page that suggests to turn this option on" - "you turn it on and solve the Mikulas> crash". >> Mikulas> What this patch really does is that - it makes the kernel deliberately Mikulas> crash in a situation when the code violates the specification, but it Mikulas> would not crash otherwise or it would crash very rarely. It helps to Mikulas> detect specification violations. >> Mikulas> If the kernel developer (or tester) doesn't use this option, his buggy Mikulas> code won't crash - and if it won't crash, he won't fix the bug or report Mikulas> it. How is the user or developer supposed to learn about this option, if Mikulas> he gets no crash at all? >> >> So why do we make this a KConfig option at all? Mikulas> Because other people see the KConfig option (so, they may enable it) and Mikulas> they don't see the kernel parameter (so, they won't enable it). Mikulas> Close your eyes and say how many kernel parameters do you remember :-) >> Just turn it on and let it rip. Mikulas> I can't test if all the networking drivers use kvmalloc properly, because Mikulas> I don't have the hardware. You can't test it neither. No one has all the Mikulas> hardware that is supported by Linux. Mikulas> Driver issues can only be tested by a mass of users. And if the users Mikulas> don't know about the debugging option, they won't enable it. >> >> I agree with James here. Looking at the SLAB vs SLUB Kconfig entries >> >> tells me *nothing* about why I should pick one or the other, as an >> >> example. Mikulas> BTW. You can enable slub debugging either with CONFIG_SLUB_DEBUG_ON or Mikulas> with the kernel parameter "slub_debug" - and most users who compile their Mikulas> own kernel use CONFIG_SLUB_DEBUG_ON - just because it is visible. You miss my point, which is that there's no explanation of what the difference is between SLAB and SLUB and which I should choose. The same goes here. If the KConfig option doesn't give useful info, it's useless. >> Now I also think that Linus has the right idea to not just sprinkle >> BUG_ONs into the code, just dump and oops and keep going if you can. >> If it's a filesystem or a device, turn it read only so that people >> notice right away. Mikulas> This vmalloc fallback is similar to Mikulas> CONFIG_DEBUG_KOBJECT_RELEASE. CONFIG_DEBUG_KOBJECT_RELEASE Mikulas> changes the behavior of kobject_put in order to cause Mikulas> deliberate crashes (that wouldn't happen otherwise) in Mikulas> drivers that misuse kobject_put. In the same sense, we want Mikulas> to cause deliberate crashes (that wouldn't happen otherwise) Mikulas> in drivers that misuse kvmalloc. Mikulas> The crashes will only happen in debugging kernels, not in Mikulas> production kernels. Says you. What about people or distros that enable it unconditionally? They're going to get all kinds of reports and then turn it off again. Crashing the system isn't the answer here. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
> "Mike" == Mike Snitzer writes: Mike> On Tue, May 01 2018 at 8:36pm -0400, Mike> Andrew Morton wrote: >> On Tue, 24 Apr 2018 12:33:01 -0400 (EDT) Mikulas Patocka >> wrote: >> >> > >> > >> > On Tue, 24 Apr 2018, Michal Hocko wrote: >> > >> > > On Tue 24-04-18 11:30:40, Mikulas Patocka wrote: >> > > > >> > > > >> > > > On Tue, 24 Apr 2018, Michal Hocko wrote: >> > > > >> > > > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote: >> > > > > >> > > > > > Fixing __vmalloc code >> > > > > > is easy and it doesn't require cooperation with maintainers. >> > > > > >> > > > > But it is a hack against the intention of the scope api. >> > > > >> > > > It is not! >> > > >> > > This discussion simply doesn't make much sense it seems. The scope API >> > > is to document the scope of the reclaim recursion critical section. That >> > > certainly is not a utility function like vmalloc. >> > >> > That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel >> > developer) from converting the code to the scope API. You make nonsensical >> > excuses. >> > >> >> Fun thread! >> >> Winding back to the original problem, I'd state it as >> >> - Caller uses kvmalloc() but passes the address into vmalloc-naive >> DMA API and >> >> - Caller uses kvmalloc() but passes the address into kfree() >> >> Yes? Mike> I think so. >> If so, then... >> >> Is there a way in which, in the kvmalloc-called-kmalloc path, we can >> tag the slab-allocated memory with a "this memory was allocated with >> kvmalloc()" flag? I *think* there's extra per-object storage available >> with suitable slab/slub debugging options? Perhaps we could steal one >> bit from the redzone, dunno. >> >> If so then we can >> >> a) set that flag in kvmalloc() if the kmalloc() call succeeded >> >> b) check for that flag in the DMA code, WARN if it is set. >> >> c) in kvfree(), clear that flag before calling kfree() >> >> d) in kfree(), check for that flag and go WARN() if set. >> >> So both potential bugs are detected all the time, dependent upon >> CONFIG_SLUB_DEBUG (and perhaps other slub config options). Mike> Thanks Andrew, definitely the most sane proposal I've seen to resolve Mike> this. Cuts to the heart of the issue I think, and seems pretty sane. Should the WARN be rate limited as well? John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
>>>>> "Mikulas" == Mikulas Patocka writes: Mikulas> On Thu, 26 Apr 2018, John Stoffel wrote: >> >>>>> "James" == James Bottomley >> >>>>> writes: >> James> I may be an atypical developer but I'd rather have a root canal James> than browse through menuconfig options. The way to get people James> to learn about new debugging options is to blog about it (or James> write an lwn.net article) which google will find the next time James> I ask it how I debug XXX. Google (probably as a service to James> humanity) rarely turns up Kconfig options in response to a James> query. >> >> I agree with James here. Looking at the SLAB vs SLUB Kconfig entries >> tells me *nothing* about why I should pick one or the other, as an >> example. >> >> John Mikulas> I see your point - and I think the misunderstanding is this. Thanks. Mikulas> This patch is not really helping people to debug existing crashes. It is Mikulas> not like "you get a crash" - "you google for some keywords" - "you get a Mikulas> page that suggests to turn this option on" - "you turn it on and solve the Mikulas> crash". Mikulas> What this patch really does is that - it makes the kernel deliberately Mikulas> crash in a situation when the code violates the specification, but it Mikulas> would not crash otherwise or it would crash very rarely. It helps to Mikulas> detect specification violations. Mikulas> If the kernel developer (or tester) doesn't use this option, his buggy Mikulas> code won't crash - and if it won't crash, he won't fix the bug or report Mikulas> it. How is the user or developer supposed to learn about this option, if Mikulas> he gets no crash at all? So why do we make this a KConfig option at all? Just turn it on and let it rip. Now I also think that Linus has the right idea to not just sprinkle BUG_ONs into the code, just dump and oops and keep going if you can. If it's a filesystem or a device, turn it read only so that people notice right away. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
> "James" == James Bottomley writes: James> On Wed, 2018-04-25 at 19:00 -0400, Mikulas Patocka wrote: >> >> On Wed, 25 Apr 2018, James Bottomley wrote: >> >> > > > Do we really need the new config option? This could just be >> > > > manually tunable via fault injection IIUC. >> > > >> > > We do, because we want to enable it in RHEL and Fedora debugging >> > > kernels, so that it will be tested by the users. >> > > >> > > The users won't use some extra magic kernel options or debugfs >> files. >> > >> > If it can be enabled via a tunable, then the distro can turn it on >> > without the user having to do anything. If you want to present the >> > user with a different boot option, you can (just have the tunable >> set >> > on the command line), but being tunable driven means that you don't >> > have to choose that option, you could automatically enable it under >> a >> > range of circumstances. I think most sane distributions would want >> > that flexibility. >> > >> > Kconfig proliferation, conversely, is a bit of a nightmare from >> both >> > the user and the tester's point of view, so we're trying to avoid >> it >> > unless absolutely necessary. >> > >> > James >> >> BTW. even developers who compile their own kernel should have this >> enabled by a CONFIG option - because if the developer sees the option >> when browsing through menuconfig, he may enable it. If he doesn't see >> the option, he won't even know that such an option exists. James> I may be an atypical developer but I'd rather have a root canal James> than browse through menuconfig options. The way to get people James> to learn about new debugging options is to blog about it (or James> write an lwn.net article) which google will find the next time James> I ask it how I debug XXX. Google (probably as a service to James> humanity) rarely turns up Kconfig options in response to a James> query. I agree with James here. Looking at the SLAB vs SLUB Kconfig entries tells me *nothing* about why I should pick one or the other, as an example. John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE
> "Bart" == Bart Van Assche writes: Bart> On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote: >> But regardless of which wins the race, the queue will have been run. >> Which is all we care about right? Bart> Running the queue is not sufficient. With this patch applied it can happen Bart> that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more Bart> concurrent queue runs finish before sufficient device resources are available Bart> to execute the request and that blk_mq_delay_run_hw_queue() does not get Bart> called at all. If no other activity triggers a queue run, e.g. request Bart> completion, this will result in a queue stall. Doesn't this argue that you really want some sort of completions to be run in this case instead? Instead of busy looping or waiting for a set amount of time, just fire off a callback to run once you have the resources available, no? I can't provide any code examples, I don't know the code base nearly well enough. John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] dm-crypt: limit the number of allocated pages
>>>>> "Mikulas" == Mikulas Patocka writes: Mikulas> On Fri, 18 Aug 2017, John Stoffel wrote: Mikulas> The limit can't be changed. I expect that no one will run a system with Mikulas> such a tight requirements that the memory is used up to the last 2%. >> Mikulas> If the user is near hitting the memory limit, he can just add swap space Mikulas> to solve the problem instead of tuning dm-crypt. >> >> My real concern is that we build up too many outstanding BIOs so that >> they take along time to process. I'm already seeing some reports >> about timeouts after 120 seconds due to stacking of dm-crypt with >> other stuff. >> >> What I'm trying to say is that as memory grows, you need to slow down >> the growth of the amount of memory that dm-crypt can allocate. It's >> more important to scale by the speed of the IO sub-system than the >> size of memory. >> >> But hey, you really addresed my concern with smaller memory systems >> already. >> >> John Mikulas> This patch doesn't restrict the number of in-flight bios. Lowering the Mikulas> memory limit has no effect on the number of bios being allocated by some Mikulas> other part of the kernel and submitted for dm-crypt. Good to know. Mikulas> Some times ago I made a patch that limits the number of in-flight bios in Mikulas> device mapper - see this: Mikulas> http://people.redhat.com/~mpatocka/patches/kernel/dm-limit-outstanding-bios/ Interesting. I'll try to find some time to look this over. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] dm-crypt: limit the number of allocated pages
>>>>> "Mikulas" == Mikulas Patocka writes: Mikulas> On Thu, 17 Aug 2017, John Stoffel wrote: >> >>>>> "Mikulas" == Mikulas Patocka writes: >> Mikulas> On Mon, 14 Aug 2017, John Stoffel wrote: >> >> >> >>>>> "Mikulas" == Mikulas Patocka writes: >> >> Mikulas> dm-crypt consumes excessive amount memory when the user attempts to zero Mikulas> a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" calls Mikulas> the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout, Mikulas> __blkdev_issue_zeroout sends large amount of write bios that contain the Mikulas> zero page as their payload. >> >> Mikulas> For each incoming page, dm-crypt allocates another page that holds the Mikulas> encrypted data, so when processing "blkdiscard -z", dm-crypt tries to Mikulas> allocate the amount of memory that is equal to the size of the device. Mikulas> This can trigger OOM killer or cause system crash. >> >> Mikulas> This patch fixes the bug by limiting the amount of memory that dm-crypt Mikulas> allocates to 2% of total system memory. >> >> >> >> Is this limit per-device or system wide? it's not clear to me if the >> >> minimum is just one page, or more so it might be nice to clarify >> >> that. >> Mikulas> The limit is system-wide. The limit is divided by the number Mikulas> of active dm-crypt devices and each device receives an equal Mikulas> share. >> >> Ok, thanks for the clarification. Can it be expanded dynamically? I >> could see that for large systems, it might not scale well. >> Mikulas> There is a lower bound of BIO_MAX_PAGES * 16. The per-device Mikulas> limit is never lower than this. >> >> Nice. >> >> >> Also, for large memory systems, that's still alot of pages. Maybe it >> >> should be more exponential in it's clamping as memory sizes go up? 2% >> >> of 2T is 4G, which is pretty darn big... >> Mikulas> The more we restrict it, the less parallelism is used in dm-crypt, so it Mikulas> makes sense to have a big value on machines with many cores. >> >> True... but it's still a concern that it's hardcoded. >> >> Thanks for your replies. Mikulas> The limit can't be changed. I expect that no one will run a system with Mikulas> such a tight requirements that the memory is used up to the last 2%. Mikulas> If the user is near hitting the memory limit, he can just add swap space Mikulas> to solve the problem instead of tuning dm-crypt. My real concern is that we build up too many outstanding BIOs so that they take along time to process. I'm already seeing some reports about timeouts after 120 seconds due to stacking of dm-crypt with other stuff. What I'm trying to say is that as memory grows, you need to slow down the growth of the amount of memory that dm-crypt can allocate. It's more important to scale by the speed of the IO sub-system than the size of memory. But hey, you really addresed my concern with smaller memory systems already. John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] dm-crypt: limit the number of allocated pages
>>>>> "Mikulas" == Mikulas Patocka writes: Mikulas> On Mon, 14 Aug 2017, John Stoffel wrote: >> >>>>> "Mikulas" == Mikulas Patocka writes: >> Mikulas> dm-crypt consumes excessive amount memory when the user attempts to zero Mikulas> a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" calls Mikulas> the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout, Mikulas> __blkdev_issue_zeroout sends large amount of write bios that contain the Mikulas> zero page as their payload. >> Mikulas> For each incoming page, dm-crypt allocates another page that holds the Mikulas> encrypted data, so when processing "blkdiscard -z", dm-crypt tries to Mikulas> allocate the amount of memory that is equal to the size of the device. Mikulas> This can trigger OOM killer or cause system crash. >> Mikulas> This patch fixes the bug by limiting the amount of memory that dm-crypt Mikulas> allocates to 2% of total system memory. >> >> Is this limit per-device or system wide? it's not clear to me if the >> minimum is just one page, or more so it might be nice to clarify >> that. Mikulas> The limit is system-wide. The limit is divided by the number Mikulas> of active dm-crypt devices and each device receives an equal Mikulas> share. Ok, thanks for the clarification. Can it be expanded dynamically? I could see that for large systems, it might not scale well. Mikulas> There is a lower bound of BIO_MAX_PAGES * 16. The per-device Mikulas> limit is never lower than this. Nice. >> Also, for large memory systems, that's still alot of pages. Maybe it >> should be more exponential in it's clamping as memory sizes go up? 2% >> of 2T is 4G, which is pretty darn big... Mikulas> The more we restrict it, the less parallelism is used in dm-crypt, so it Mikulas> makes sense to have a big value on machines with many cores. True... but it's still a concern that it's hardcoded. Thanks for your replies. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] dm-crypt: limit the number of allocated pages
> "Mikulas" == Mikulas Patocka writes: Mikulas> dm-crypt consumes excessive amount memory when the user attempts to zero Mikulas> a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" calls Mikulas> the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout, Mikulas> __blkdev_issue_zeroout sends large amount of write bios that contain the Mikulas> zero page as their payload. Mikulas> For each incoming page, dm-crypt allocates another page that holds the Mikulas> encrypted data, so when processing "blkdiscard -z", dm-crypt tries to Mikulas> allocate the amount of memory that is equal to the size of the device. Mikulas> This can trigger OOM killer or cause system crash. Mikulas> This patch fixes the bug by limiting the amount of memory that dm-crypt Mikulas> allocates to 2% of total system memory. Is this limit per-device or system wide? it's not clear to me if the minimum is just one page, or more so it might be nice to clarify that. Also, for large memory systems, that's still alot of pages. Maybe it should be more exponential in it's clamping as memory sizes go up? 2% of 2T is 4G, which is pretty darn big... Mikulas> Signed-off-by: Mikulas Patocka Mikulas> Cc: sta...@vger.kernel.org Mikulas> --- Mikulas> drivers/md/dm-crypt.c | 60 +- Mikulas> 1 file changed, 59 insertions(+), 1 deletion(-) Mikulas> Index: linux-2.6/drivers/md/dm-crypt.c Mikulas> === Mikulas> --- linux-2.6.orig/drivers/md/dm-crypt.c Mikulas> +++ linux-2.6/drivers/md/dm-crypt.c Mikulas> @@ -148,6 +148,8 @@ struct crypt_config { Mikulas>mempool_t *tag_pool; Mikulas>unsigned tag_pool_max_sectors; Mikulas> + struct percpu_counter n_allocated_pages; Mikulas> + Mikulas>struct bio_set *bs; Mikulas>struct mutex bio_alloc_lock; Mikulas> @@ -219,6 +221,12 @@ struct crypt_config { Mikulas> #define MAX_TAG_SIZE 480 Mikulas> #define POOL_ENTRY_SIZE 512 Mikulas> +static DEFINE_SPINLOCK(dm_crypt_clients_lock); Mikulas> +static unsigned dm_crypt_clients_n = 0; Mikulas> +static volatile unsigned long dm_crypt_pages_per_client; Mikulas> +#define DM_CRYPT_MEMORY_PERCENT 2 Mikulas> +#define DM_CRYPT_MIN_PAGES_PER_CLIENT (BIO_MAX_PAGES * 16) Mikulas> + Mikulas> static void clone_init(struct dm_crypt_io *, struct bio *); Mikulas> static void kcryptd_queue_crypt(struct dm_crypt_io *io); Mikulas> static struct scatterlist *crypt_get_sg_data(struct crypt_config *cc, Mikulas> @@ -2158,6 +2166,37 @@ static int crypt_wipe_key(struct crypt_c Mikulas>return r; Mikulas> } Mikulas> +static void crypt_calculate_pages_per_client(void) Mikulas> +{ Mikulas> + unsigned long pages = (totalram_pages - totalhigh_pages) * DM_CRYPT_MEMORY_PERCENT / 100; Mikulas> + if (!dm_crypt_clients_n) Mikulas> + return; Mikulas> + pages /= dm_crypt_clients_n; Would it make sense to use a shift here, or does this only run once on setup? And shouldn't you return a value here, if only to make it clear what you're defaulting to? Mikulas> + if (pages < DM_CRYPT_MIN_PAGES_PER_CLIENT) Mikulas> + pages = DM_CRYPT_MIN_PAGES_PER_CLIENT; Mikulas> + dm_crypt_pages_per_client = pages; Mikulas> +} Mikulas> + Mikulas> +static void *crypt_page_alloc(gfp_t gfp_mask, void *pool_data) Mikulas> +{ Mikulas> + struct crypt_config *cc = pool_data; Mikulas> + struct page *page; Mikulas> + if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, dm_crypt_pages_per_client) >= 0) && Mikulas> + likely(gfp_mask & __GFP_NORETRY)) Mikulas> + return NULL; Mikulas> + page = alloc_page(gfp_mask); Mikulas> + if (likely(page != NULL)) Mikulas> + percpu_counter_add(&cc->n_allocated_pages, 1); Mikulas> + return page; Mikulas> +} Mikulas> + Mikulas> +static void crypt_page_free(void *page, void *pool_data) Mikulas> +{ Mikulas> + struct crypt_config *cc = pool_data; Mikulas> + __free_page(page); Mikulas> + percpu_counter_sub(&cc->n_allocated_pages, 1); Mikulas> +} Mikulas> + Mikulas> static void crypt_dtr(struct dm_target *ti) Mikulas> { Mikulas>struct crypt_config *cc = ti->private; Mikulas> @@ -2184,6 +2223,10 @@ static void crypt_dtr(struct dm_target * Mikulas>mempool_destroy(cc->req_pool); Mikulas>mempool_destroy(cc->tag_pool); Mikulas> + if (cc->page_pool) Mikulas> + WARN_ON(percpu_counter_sum(&cc->n_allocated_pages) != 0); Mikulas> + percpu_counter_destroy(&cc->n_allocated_pages); Mikulas> + Mikulas>if (cc->iv_gen_ops && cc->iv_gen_ops->dtr) cc-> iv_gen_ops->dtr(cc); Mikulas> @@ -2198,6 +2241,12 @@ static void crypt_dtr(struct dm_target * Mikulas>/* Must zero key material before freeing */ Mikulas>kzfree(cc); Mikulas> + Mikulas> + spin_lock(&dm_crypt_c
Re: [dm-devel] dm-integrity: fix inefficient allocation of stack space
>>>>> "Mikulas" == Mikulas Patocka writes: Mikulas> On Wed, 19 Jul 2017, John Stoffel wrote: >> I'd like to argue that you should never use BUG_ON at all, esp since >> if you have integrity running on just one critical device, but have >> other devices that work just fine, bringing down the entire system >> because you don't think things are ok is terrible. >> >> We should rip out ALL the BUG_ONs in the dm-integrity and just do >> proper error handling instead. For testing, sure, use them on your >> code. But please, not for general use! >> >> John Mikulas> Linus sometimes argued against BUG_ON and people are Mikulas> repeating it after him like sheep. I understand Linus' rants on BUG_ON and I agree with them. He doesn't argue that they're not appropriate at times, but just using them because a random driver or module freaks out is NOT appropriate. Mikulas> If a programmer made a bug in his code, he should fix that Mikulas> bug, not write additional recovery code for the bug. If a programmer has a bug which kills the system for no damn good reason, then I object. Don't think that dm-integrity is oh so important that you need to kill the system because of a mistake. Try to be robust. 90% of all code is error handling. John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] dm-integrity: fix inefficient allocation of stack space
>>>>> "Mikulas" == Mikulas Patocka writes: Mikulas> On Wed, 19 Jul 2017, John Stoffel wrote: >> >>>>> "Mikulas" == Mikulas Patocka writes: >> Mikulas> When using block size greater than 512 bytes, the dm-integrity target Mikulas> allocates journal space inefficiently, it allocates one entry for each Mikulas> 512-byte chunk of data, fills entries for each block of data and leaves Mikulas> the remaining entries unused. This doesn't cause data corruption, but it Mikulas> causes severe performance degradation. >> Mikulas> This patch fixes the journal allocation. As a safety it also Mikulas> adds BUG that fires if the variables representing journal Mikulas> usage get out of sync (it's better to crash than continue and Mikulas> corrupt data, so BUG is justified). >> >> No, I don't agree. You should lock down the block device(s) using the >> dm-integrity target, but you should NOT take down the entire system >> because of this. Just return a failure up the stack that forces the >> device into read only mode so that there's a chance to debug this >> issue. >> >> Using a BUG_ON, especially for code that isn't proven, is just wrong. >> Do a WARN_ONCE() and then return an error instead. >> >> John Mikulas> That BUG can only be triggered by a bug in the code, it can't be triggered Mikulas> by bad data on the disk, so why should we write code to recover from it? Because it's damn obnoxious to kill a system because of bad coding, without giving the user a chance to recover or even investigate the root cause. If it's a bug in the code, just warn the user and then try to continue. If you can't continue, then just make the block device go read-only to protect the data. Again, why do you think BUG_ON is appropriate thing to do? What advantage does it have for you? John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm-integrity: fix inefficient allocation of stack space
>>>>> "Mike" == Mike Snitzer writes: Mike> On Wed, Jul 19 2017 at 2:39pm -0400, Mike> John Stoffel wrote: >> >>>>> "Mikulas" == Mikulas Patocka writes: >> Mikulas> When using block size greater than 512 bytes, the dm-integrity target Mikulas> allocates journal space inefficiently, it allocates one entry for each Mikulas> 512-byte chunk of data, fills entries for each block of data and leaves Mikulas> the remaining entries unused. This doesn't cause data corruption, but it Mikulas> causes severe performance degradation. >> Mikulas> This patch fixes the journal allocation. As a safety it also Mikulas> adds BUG that fires if the variables representing journal Mikulas> usage get out of sync (it's better to crash than continue and Mikulas> corrupt data, so BUG is justified). >> >> No, I don't agree. You should lock down the block device(s) using the >> dm-integrity target, but you should NOT take down the entire system >> because of this. Just return a failure up the stack that forces the >> device into read only mode so that there's a chance to debug this >> issue. >> >> Using a BUG_ON, especially for code that isn't proven, is just wrong. >> Do a WARN_ONCE() and then return an error instead. Mike> I'll remove the BUG_ON from this stable@ fix. Mikulas, please have a Mike> look at handling the failure more gracefully (maybe like John suggests). Mike> In general, there are too many BUG_ON()s in the dm-integrity code. I Mike> let them slide for inclusion but it would probably be a good idea to Mike> look at eliminating them and putting the dm-integrity device into "fail Mike> mode" in the face of critical errors -- much like dm-thinp and dm-cache Mike> has. I'd like to argue that you should never use BUG_ON at all, esp since if you have integrity running on just one critical device, but have other devices that work just fine, bringing down the entire system because you don't think things are ok is terrible. We should rip out ALL the BUG_ONs in the dm-integrity and just do proper error handling instead. For testing, sure, use them on your code. But please, not for general use! John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] dm-integrity: fix inefficient allocation of stack space
> "Mikulas" == Mikulas Patocka writes: Mikulas> When using block size greater than 512 bytes, the dm-integrity target Mikulas> allocates journal space inefficiently, it allocates one entry for each Mikulas> 512-byte chunk of data, fills entries for each block of data and leaves Mikulas> the remaining entries unused. This doesn't cause data corruption, but it Mikulas> causes severe performance degradation. Mikulas> This patch fixes the journal allocation. As a safety it also Mikulas> adds BUG that fires if the variables representing journal Mikulas> usage get out of sync (it's better to crash than continue and Mikulas> corrupt data, so BUG is justified). No, I don't agree. You should lock down the block device(s) using the dm-integrity target, but you should NOT take down the entire system because of this. Just return a failure up the stack that forces the device into read only mode so that there's a chance to debug this issue. Using a BUG_ON, especially for code that isn't proven, is just wrong. Do a WARN_ONCE() and then return an error instead. John Mikulas> Signed-off-by: Mikulas Patocka Mikulas> Cc: sta...@vger.kernel.org Mikulas> Fixes: 7eada909bfd7 ("dm: add integrity target") Mikulas> --- Mikulas> drivers/md/dm-integrity.c |7 --- Mikulas> 1 file changed, 4 insertions(+), 3 deletions(-) Mikulas> Index: linux-2.6/drivers/md/dm-integrity.c Mikulas> === Mikulas> --- linux-2.6.orig/drivers/md/dm-integrity.c Mikulas> +++ linux-2.6/drivers/md/dm-integrity.c Mikulas> @@ -1589,14 +1589,14 @@ retry: Mikulas>unsigned next_entry, i, pos; Mikulas>unsigned ws, we; Mikulas> - dio->range.n_sectors = min(dio->range.n_sectors, ic->free_sectors); Mikulas> + dio->range.n_sectors = min(dio->range.n_sectors, ic->free_sectors << ic->sb->log2_sectors_per_block); Mikulas>if (unlikely(!dio->range.n_sectors)) Mikulas>goto sleep; Mikulas> - ic->free_sectors -= dio->range.n_sectors; Mikulas> + ic->free_sectors -= dio->range.n_sectors >> ic->sb->log2_sectors_per_block; Mikulas>journal_section = ic->free_section; Mikulas>journal_entry = ic->free_section_entry; Mikulas> - next_entry = ic->free_section_entry + dio->range.n_sectors; Mikulas> + next_entry = ic->free_section_entry + (dio->range.n_sectors >> ic->sb->log2_sectors_per_block); ic-> free_section_entry = next_entry % ic->journal_section_entries; ic-> free_section += next_entry / ic->journal_section_entries; ic-> n_uncommitted_sections += next_entry / ic->journal_section_entries; Mikulas> @@ -1727,6 +1727,7 @@ static void pad_uncommitted(struct dm_in Mikulas>wraparound_section(ic, &ic->free_section); ic-> n_uncommitted_sections++; Mikulas>} Mikulas> + BUG_ON((ic->n_uncommitted_sections + ic->n_committed_sections) * ic->journal_section_entries + ic->free_sectors != ic->journal_sections * ic->journal_section_entries); Mikulas> } Mikulas> static void integrity_commit(struct work_struct *w) Mikulas> -- Mikulas> dm-devel mailing list Mikulas> dm-devel@redhat.com Mikulas> https://www.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm-cache issue
Alexander> I am in a process of evaluating dm-cache for our backup system. Why are you bothering to cache your backup destination? What filesystems are you using for your destinations? Alexander> Currently I have an issue when restart the backup Alexander> server. The server is booting for hours, because of IO Alexander> load. It seems is triggered a flush from SSD disk (that is Alexander> used for a cache device) to the raid controllers (they are Alexander> with slow SATA disks). I have 10 cached logical volumes in Alexander> writethrough mode, each with 2T of data over 2 raid Alexander> controllers. I use a single SSD disk for the cache. The Alexander> backup system is with lvm2-2.02.164-1 & kernel 4.4.30. How big is the SSD cache? Do you have any output from the bootup (syslog, dmesg, etc) you can share. Alexander> Do you have any ideas why such flush is triggered? In Alexander> writethrough cache mode we shouldn't have dirty blocks in Alexander> the cache. I wonder if it makes more sense to use 'lvcache' instead, since you can add/remove cache from LVs at will. With bcache, once you setup a volume with the cache, you're stuck with it. Also, did you mirror your SSD incase it fails? In my experience, using RAID6 on a bunch of SATA disks for backup is fine, you're writing large chunks of data in a few sets of files, so fragmentation and the Read-Modify-Write cycle is much less painful, since you tend to stream your writes to the storage, which is perfectly fine with SATA drives. Also, bcache skips sequential IO by default, which is what you want when doing backups... so I'm wondering what your thinking was here for using this? John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] getting feedback on dm-cache statistics tool
>>>>> "Ben" == Ben England writes: Ben> John, thanks for reply, comments inline... Ben> - Original Message - >> From: "John Stoffel" >> To: "Ben England" >> Cc: dm-devel@redhat.com >> Sent: Friday, May 27, 2016 11:15:51 AM >> Subject: Re: [dm-devel] getting feedback on dm-cache statistics tool >> >> Ben> I have a git repo containing a tool I wrote to look at dm-cache Ben> statistics, not the raw counters in "dmsetup status" but derived Ben> values that are more directly useful to system administrators and Ben> developers that want to see whether dm-cache is doing what they Ben> want it to. https://github.com/bengland2/dmcache-stat >> Ben> Any feedback on this? Anything missing? I don't mind adding or Ben> taking pull requests. If some other tool can provide the same Ben> functionality within RHEL then I'm happy to use that. >> >> I'm using Debian Jessie on x86_64 running kernel 4.4.0-rc7 (self >> compiled) and it's dying with an error: >> Ben> I don't have a Debian system handy, can you do "dmsetup status" Ben> on it and mail me the log so I know what the format difference Ben> is? It's mostly because I think I have bad UUIDs on my LVM volumes, because I've basically just upgraded this system for the past six years, without building new volumes/LVMs. Here's the output though: > sudo dmsetup status [sudo] password for john: The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVLvpBly2guNYp3XUqTVCgGCHequQOBEf9" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVkuRxXnQASgLcdsdFmK9OviYa88Q6buOU" should be mangled b. quad-swap_1: 0 15622144 linear The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVM32OGu9FYyW2J5u8Q1zSNh826zw6BnFX" should be mangled b. quad-root: 0 78118912 linear The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVabjo7c4R5l6twigsqqfc55LVN6XVag4W-cdata" should be man. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVIWMZ8hjG32sj1LIdQja6QWB4OWkUvUCl" should be mangled b. bacula-backups: 0 5857345536 linear The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVkOhak3U4uF82eWXH5JY1F3VhHzMHYV8c-cmeta" should be man. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVR1c11WDl2e86Ko9OmuWYRIUNXExHQnX1" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVvdF6NWzpggiPLD202jpru363Z5LfB5Lo" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wV40890zsM04q5zMIs0qdeCDfG9fc9FwbF" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVrdBEgHTcrXLHDWNrbFotKPeLt2TfbRVo" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVi37feovUsvWbysOk8PI2bbdjqokieGx2-cdata" should be man. quad-var: 0 39059456 linear quad-var: 39059456 83886080 linear The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVnp1To0klyekIS85gueDeq2EsYg5Osj44" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVas2sZsZM8mw0VXDgx03ryaq351RUEL4j-cmeta" should be man. bacula-incrs: 0 5662310400 linear The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVbbcg3pVFZyDngdRLS4aECbc571Lyb5MI" should be mangled b. Command failed >> > sudo ./dmcache_stat.py 10 0 >> volname, size(GiB), policy, mode >> The UUID >> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVLvpBly2guNYp3XUqTVCgGCHequQOBEf9" >> should be mangled b. >> The UUID >> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVkuRxXnQASgLcdsdFmK9OviYa88Q6buOU" >> should be mangled b. >> The UUID >> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVM32OGu9FYyW2J5u8Q1zSNh826zw6BnFX" >> should be mangled b. >> The UUID >> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVabjo7c4R5l6twigsqqfc55LVN6XVag4W-cdata" >> should be man. >> The UUID >> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVIWMZ8hjG32sj1LIdQja6QWB4OWkUvUCl" >> should be mangled b. >> The UUID >> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVkOhak3U4uF82eWXH5JY1F3VhHzMHYV8c-cmeta" >> should be man. >> The UUID >> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVR1c11WDl2e86Ko9OmuWYRIUNXExHQnX1" >> should be mangled b. >> The UUID >> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVvdF6NWzpggiPLD202jpru363Z5LfB5Lo" >> should be mangled b. >> The UUID >> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wV40890zsM04q5zMIs0qdeCDfG9fc9FwbF" >> should be mangled b. >> The UUID >> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVrdBEgHTcrXLHDWNrbFotKPeLt2TfbRVo" >> should be mangled b. >> The UUID >> "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVi37feovUsvWbysOk8PI2bbdjqokieGx2-cdata" >> should be man. >> The U
Re: [dm-devel] getting feedback on dm-cache statistics tool
Ben> I have a git repo containing a tool I wrote to look at dm-cache Ben> statistics, not the raw counters in "dmsetup status" but derived Ben> values that are more directly useful to system administrators and Ben> developers that want to see whether dm-cache is doing what they Ben> want it to. https://github.com/bengland2/dmcache-stat Ben> Any feedback on this? Anything missing? I don't mind adding or Ben> taking pull requests. If some other tool can provide the same Ben> functionality within RHEL then I'm happy to use that. I'm using Debian Jessie on x86_64 running kernel 4.4.0-rc7 (self compiled) and it's dying with an error: > sudo ./dmcache_stat.py 10 0 volname, size(GiB), policy, mode The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVLvpBly2guNYp3XUqTVCgGCHequQOBEf9" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVkuRxXnQASgLcdsdFmK9OviYa88Q6buOU" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVM32OGu9FYyW2J5u8Q1zSNh826zw6BnFX" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVabjo7c4R5l6twigsqqfc55LVN6XVag4W-cdata" should be man. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVIWMZ8hjG32sj1LIdQja6QWB4OWkUvUCl" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVkOhak3U4uF82eWXH5JY1F3VhHzMHYV8c-cmeta" should be man. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVR1c11WDl2e86Ko9OmuWYRIUNXExHQnX1" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVvdF6NWzpggiPLD202jpru363Z5LfB5Lo" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wV40890zsM04q5zMIs0qdeCDfG9fc9FwbF" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVrdBEgHTcrXLHDWNrbFotKPeLt2TfbRVo" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVi37feovUsvWbysOk8PI2bbdjqokieGx2-cdata" should be man. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVnp1To0klyekIS85gueDeq2EsYg5Osj44" should be mangled b. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVas2sZsZM8mw0VXDgx03ryaq351RUEL4j-cmeta" should be man. The UUID "LVM-HWDbRMPFL85!mVbEi4wru5#G7YfWK3wVbbcg3pVFZyDngdRLS4aECbc571Lyb5MI" should be mangled b. Command failed Traceback (most recent call last): File "./dmcache_stat.py", line 198, in s2 = poll_dmcache() File "./dmcache_stat.py", line 172, in poll_dmcache dmsetup_out = subprocess.check_output(['dmsetup', 'status']) File "/usr/lib/python2.7/subprocess.py", line 573, in check_output raise CalledProcessError(retcode, cmd, output=output) subprocess.CalledProcessError: Command '['dmsetup', 'status']' returned non-zero exit status 1 But this is partly because my UUIDs aren't being handled properly and I've been loathe to follow the process for rebuilding them because I'm scared. And it's my main fileserver. Maybe this weekend if I have time. The other comment is that the usage should be more like: dmcache-stat [] where if you don't provide the count, it defaults to going forever, like iostat/vmstat, etc. And now that I think of it, I'm using lvmcache, not dmcache or bcache... > sudo lvs LV VG Attr LSize Pool OriginData% Meta% Move Log Cpy%Synct backups bacula -wi-ao 2.73t incrsbacula -wi-ao 2.64t drupal data -wi-ao 50.00g home data Cwi-aoC--- 550.00g homecacheLV [home_corig] homecacheLV data Cwi---C--- 50.00g localdata Cwi-aoC--- 335.00g localcacheLV [local_corig] localcacheLV data Cwi---C--- 50.00g minecraftdata -wc-ao 20.00g nas data -wi-ao 600.00g pete data -wi-a- 800.00g vm1 data -wc-ao 20.00g winxpdata -wi-a- 170.00g root quad -wi-ao 37.25g swap_1 quad -wi-ao 7.45g var quad -wi-ao 58.62g Though looking at this list, I really should just nuke that ancient WinXP VM image I have. LOL. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 01/17] multipathd: use /run instead of /var/run
>>>>> "Benjamin" == Benjamin Marzinski writes: Benjamin> On Tue, Mar 29, 2016 at 09:57:25AM -0400, John Stoffel wrote: >> Benjamin> /var/run is now usually a symlink to /run. If /var is on a separate Benjamin> filesytem, when multipathd starts up, it might end up writing to Benjamin> /var/run before the /var filesytem is mounted and thus not have its Benjamin> pidfile accessible at /var/run afterwards. On most distrubutions /run Benjamin> is now a tmpfs and should always be available before multipathd is Benjamin> started, so multipath should just write there directly, instead of Benjamin> through the symlink. >> >> I'm not sure I agree with this, it really smells more like a >> distribution problem, than a multipathd problem. If multipathd starts >> up in an initramfs, how does it get reset to the proper /var/run or >> /run directory then? Benjamin> Huh? /run is a tmpfs on almost all distros now. It gets Benjamin> remounted when you switch the root from the initramfs, but Benjamin> it doesn't go away. So there isn't a "proper" /run that Benjamin> appears later. /var/run just gets symlinked to it. Gah! I apologize for the noise, I did a quick check on some RHEL5/6 systems (which are ancient I agree!) and didn't look at newer debian/mint systems as well. And I do see that /run is there. John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 10/17] multipathd: delay reloads during creation
> "Benjamin" == Benjamin Marzinski writes: Benjamin> lvm needs PV devices to not be suspended while the udev Benjamin> rules are running, for them to be correctly identified as Benjamin> PVs. However, multipathd will often be in a situation where Benjamin> it will create a multipath device upon seeing a path, and Benjamin> then immediately reload the device upon seeing another path. Benjamin> If multipath is reloading a device while processing the udev Benjamin> event from its creation, lvm can fail to identify it as a Benjamin> PV. This can cause systems to fail to boot. Unfortunately, Benjamin> using udev synchronization cookies to solve this issue would Benjamin> cause a host of other issues that could only be avoided by a Benjamin> pretty substantial change in how multipathd does locking and Benjamin> event processing. The good news is that multipathd is Benjamin> already listening to udev events itself, and can make sure Benjamin> that it isn't reloading when it shouldn't be. Benjamin> This patch makes multipathd delay or refuse any reloads that Benjamin> would happen between the time when it creates a device, and Benjamin> when it receives the change uevent from the device Benjamin> creation. The only reloads that it refuses are from the Benjamin> multipathd interactive commands that make no sense on a not Benjamin> fully started device. Otherwise, it processes the event or Benjamin> command, and sets a flag to either mark that device for an Benjamin> update, or to signal that multipathd needs a Benjamin> reconfigure. When the udev event for the creation arrives, Benjamin> multipath will reload the device if necessary. If a Benjamin> reconfigure has been requested, and no devices are currently Benjamin> being created, multipathd will also do the reconfigure then. Benjamin> Also this patch adds a configurable timer Benjamin> "missing_uev_msg_delay" defaulting to 30 seconds. If the Benjamin> udev creation event has not arrived after this timeout has Benjamin> triggered, multipathd will start printing messages alerting Benjamin> the user of this every "missing_uev_msg_delay" seconds. Should this really keep printing this message every 30 seconds for eternity? I would think that having it give up after 30 * N seconds would be better instead. I'm worried that this might block or slow down system boots forever, instead of at least failing and falling through so that maybe something can be recovered here. Basically, what can the user do if they start getting these messages? We should prompt them with a possible cause/solution if at all possible. John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 01/17] multipathd: use /run instead of /var/run
Benjamin> /var/run is now usually a symlink to /run. If /var is on a separate Benjamin> filesytem, when multipathd starts up, it might end up writing to Benjamin> /var/run before the /var filesytem is mounted and thus not have its Benjamin> pidfile accessible at /var/run afterwards. On most distrubutions /run Benjamin> is now a tmpfs and should always be available before multipathd is Benjamin> started, so multipath should just write there directly, instead of Benjamin> through the symlink. I'm not sure I agree with this, it really smells more like a distribution problem, than a multipathd problem. If multipathd starts up in an initramfs, how does it get reset to the proper /var/run or /run directory then? And especially if it's a symlink, we shouldn't care at all either. What distros are using /run instead of /var/run these days? John Benjamin> If /var/run is not a symlink, continue using it. Benjamin> Signed-off-by: Benjamin Marzinski Benjamin> --- Benjamin> Makefile.inc| 10 +- Benjamin> libmultipath/defaults.h | 2 +- Benjamin> multipathd/multipathd.init.suse | 2 +- Benjamin> 3 files changed, 11 insertions(+), 3 deletions(-) Benjamin> diff --git a/Makefile.inc b/Makefile.inc Benjamin> index c3ed73f..357dd33 100644 Benjamin> --- a/Makefile.inc Benjamin> +++ b/Makefile.inc Benjamin> @@ -21,6 +21,14 @@ ifndef LIB Benjamin> endif Benjamin> endif Benjamin> +ifndef RUN Benjamin> + ifeq ($(shell test -L /var/run -o ! -d /var/run && echo 1),1) Benjamin> + RUN=run Benjamin> + else Benjamin> + RUN=var/run Benjamin> + endif Benjamin> +endif Benjamin> + Benjamin> ifndef SYSTEMD Benjamin> ifeq ($(shell systemctl --version > /dev/null 2>&1 && echo 1), 1) Benjamin> SYSTEMD = $(shell systemctl --version 2> /dev/null | sed -n 's/systemd \([0-9]*\)/\1/p') Benjamin> @@ -54,7 +62,7 @@ ifndef RPM_OPT_FLAGS Benjamin> endif Benjamin> OPTFLAGS = $(RPM_OPT_FLAGS) -Wunused -Wstrict-prototypes Benjamin> -CFLAGS= $(OPTFLAGS) -fPIC -DLIB_STRING=\"${LIB}\" Benjamin> +CFLAGS= $(OPTFLAGS) -fPIC -DLIB_STRING=\"${LIB}\" -DRUN_DIR=\"${RUN}\" Benjamin> SHARED_FLAGS = -shared Benjamin> %.o: %.c Benjamin> diff --git a/libmultipath/defaults.h b/libmultipath/defaults.h Benjamin> index 8902f40..cf1d5be 100644 Benjamin> --- a/libmultipath/defaults.h Benjamin> +++ b/libmultipath/defaults.h Benjamin> @@ -26,7 +26,7 @@ Benjamin> #define MAX_CHECKINT(a) (a << 2) Benjamin> #define MAX_DEV_LOSS_TMO 0x7FFF Benjamin> -#define DEFAULT_PIDFILE "/var/run/multipathd.pid" Benjamin> +#define DEFAULT_PIDFILE "/" RUN_DIR "/multipathd.pid" Benjamin> #define DEFAULT_SOCKET "/org/kernel/linux/storage/multipathd" Benjamin> #define DEFAULT_CONFIGFILE "/etc/multipath.conf" Benjamin> #define DEFAULT_BINDINGS_FILE"/etc/multipath/bindings" Benjamin> diff --git a/multipathd/multipathd.init.suse b/multipathd/multipathd.init.suse Benjamin> index d1319b1..ed699fa 100644 Benjamin> --- a/multipathd/multipathd.init.suse Benjamin> +++ b/multipathd/multipathd.init.suse Benjamin> @@ -17,7 +17,7 @@ Benjamin> PATH=/bin:/usr/bin:/sbin:/usr/sbin Benjamin> DAEMON=/sbin/multipathd Benjamin> -PIDFILE=/var/run/multipathd.pid Benjamin> +PIDFILE=/run/multipathd.pid Benjamin> MPATH_INIT_TIMEOUT=10 Benjamin> ARGS="" Benjamin> -- Benjamin> 1.8.3.1 Benjamin> -- Benjamin> dm-devel mailing list Benjamin> dm-devel@redhat.com Benjamin> https://www.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] [dm-cache] Make the mq policy an alias for smq
>>>>> "Joe" == Joe Thornber writes: Joe> On Wed, Feb 10, 2016 at 12:06:00PM -0500, John Stoffel wrote: >> >> Can you add in some documentation on how you tell which dm_cache >> policy is actually being used, and how to measure it, etc? It's a >> black box and some info would be nice. Joe> You can get some stats on the cache performance via the status Joe> ioctl (eg, dmsetup status ...). This will tell you about Joe> promotions/demotion to the cache, write hits, read hits etc. I've looked at this output, but I can't make heads nor tail of it, nor does the docs (v4.4-rc7) seem to provide any useful headings. # dmsetup --manglename none status --target cache data-home: 0 1153433600 cache 8 2443/32768 128 819200/819200 379170 190334 1010554 121237 232 232 0 1 writeback 2 migration_threshold 2048 smq 0 rw - data-local: 0 702545920 cache 8 2443/32768 128 804148/819200 574195 28503 81586 19118 0 85798 0 1 writeback 2 migration_threshold 2048 smq 0 rw - But at least I see that I'm using smq for my cache policy. Joe> There is documentation on cache policies in Joe> Documentation/device-mapper/cache-policies.txt Joe> https://github.com/jthornber/linux-2.6/blob/2016-02-10-thin-dev-on-4.4/Documentation/device-mapper/cache-policies.txt Joe> As for knowing which policy is running; I'm not sure what to say. Joe> It'll be the one you ask for. If the above patch goes in, then Joe> there'll be a kernel version where mq becomes the same as smq. Joe> I'll bump the policy version number to make it clear that mq has Joe> undergone a big change. I never explicitly asked for any policy when using lvcache, so that's why I was asking. *grin* Using the -c or -C options doesn't really change things. I see there's a --noheadings option, but that's not useful without the inverse --headings option so I can figure out what the results mean without looking into the kernel source code. Should I pull down the source and try to make up a patch for this case? John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] [dm-cache] Make the mq policy an alias for smq
Can you add in some documentation on how you tell which dm_cache policy is actually being used, and how to measure it, etc? It's a black box and some info would be nice. John -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm-cache + storage client experiment
> "mlody3k" == mlody3k writes: mlody3k> Hi dm developers! mlody3k> I'm during the experiment with dm-cache and storage client and I would be very grateful if some of mlody3k> You will help me with some of my questions :). mlody3k> Simply, I would like to implement the write-through caching in the client where: mlody3k> - dm-cache is only responsible to check whether requested data is in the cache device (return mlody3k> requested data if hit, ENODATA if miss - do not read from origin) - in case of miss, my client mlody3k> will read data from the origin. mlody3k> - dm-cache will write only to the cache device - client will write data to the origin. So instead, why don't you just have two different mount points, one being /cache, the other being /slow, and have your app look for data in /cache first, then pull from /slow. If you write data, then you just write it yourself where you want. dm-cache is more of a block level tool, and you're looking for more of a file level tool from what I can see. Probably what you really want is to be able to tune and tweak the algorithm inside dm-cache to do what you want. In that case, I don't have a clue how this would happen unfortunately. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel