[dm-devel] [PATCH 1/1] dm: remove unnecessary (void*) conversions
Pointer variables of void * type do not require type cast. Signed-off-by: XU pengfei --- drivers/md/dm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index b424a6ee27ba..669f2821376a 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2139,7 +2139,7 @@ static void event_callback(void *context) { unsigned long flags; LIST_HEAD(uevents); - struct mapped_device *md = (struct mapped_device *) context; + struct mapped_device *md = context; spin_lock_irqsave(>uevent_lock, flags); list_splice_init(>uevent_list, ); -- 2.18.2 -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] On allowing remounting the partition containing dm-mapped boot disk as read-write
Firstly, thankyou for your reply, However, I still think you are mistaken. "There is no double re-mounting of the filesystem". I try again, but in case my writing isn't convincing. Let's agree to disagree, until I have more conclusive proof, either to what confirm what you say, or what I say. I have captured some logs from the initramfs. Included near the end of this post. I also tried my antiventoy idea. I believe it worked !!. It is included at the end of this post. The logs help to see whats going on. > Besides caching at the dm layer, there is caching at the filesystem > layer (files, directories and other stuff). The caching at the DM > layer is not the issue. If at all there is an issue, its at DM layer, But, if you say there is no issue at the DM layer, then maybe there is no issue at all. The 2 filesystems in question, one that sits inside a partition of the dm-mapped vdisk-file and one that sits in the outside partition, do not touch/interfere with one another's operation and metadata structures as they change and modify their own filesystem structures. Any 'block write' from the dm-mapped disk does not go through the outside filesystem driver, it goes directly to the blocks on disk. > > If rootfs mounts the dm-device someplace as a > (ntfs)filesystem(readonly), and then later you also mount that same > (ntfs)filesystem (read-write) someplace then you have mounted the > filesystem twice (not a shared bind mount that appears to be 2 mounts > but in reality is only a single mount), and the filesystem code is not > safe when there are 2 unique separate mount instances, the fs caching > is not shared. This is not what is happening. The same filesystem is not being re-mounted. In fact, by using rd.break=pre-mount, in the initramfs, before pivot, I have confirmed that no mounts have been done. Other than initramfs filesystem, no disk filesystem exists. But, using a block table a device map has been created directly to the blocks on disk. so one can issue : dm table /dev/mapper/ventoy and see the block map. The blocks correspond the the exact block-locations of the file on disk. After pivot, the filesystem in one the partitions inside /dev/mapper/ventoy (specifically /dev/mapper/ventoy4), becomes the rootfs of the OS being booted up. The outside /dev/sdc1 is still unmounted. But, after booting-up and logging-in, it cannot be mounted, Because, I think, the fs driver cannot get exclusive access to all blocks that constitute /dev/sdc1 because the device mapper is holding onto some blocks. That's what, I think, the ventoy dm_patch overrides. (you say its because of a clash with a prior filesystem mount, which is incorrect) > > ie if the filesystem is already mounted as a filesystem someplace It is not already mounted as a filesystem someplace. > (mapping a dm device is not mounting) and you mount it a 2nd time then > you are at risk of corruption. As filesytems are separate, there won't be corruption for that reason. It is not a remount. > > For the mount command to get a busy indicates that something has > already mounted it someplace and has locked the device(to prevent a > 2nd corrupting mount from happening), and the entire purpose of the > module appears to be to disable the safety provided by this locking > and allow one to mount the fs again. I've grabbed /proc/mounts to be sure of what I am saying, I think, this mount-busy is not due to a filesystem-vs-filesystem clash but a dm-map-vs-filesystem clash. My guess is, the dm_patch is a crude way to make the devicemapper think the blocks as still available. > And note that with double mounts (usually I have seen/debugged the > crash/ouage from them with shared SAN storage) they can "APPEAR" to > kind of work if you do not use them too much, but eventually the 2 > instances of filesystem code step on each other and it ends badly > (kernel crash, data corruption, loss of the filesystem). I know how terrible, a double mount can get. Not, exactly a double mount, but once I experienced something similar. I had a virtualbox VM in windows accessing a btrfs partition on a disk partition (pass-through). put VM to sleep/suspend (not shutdown), forgot about it. shutdown windows, and rebooted in linux. There was probably un-committed writes in its journal. In Linux, simple mounting failed, linux warned me, but I unwittingly force mounted the btrfs partition, which remounted the partition from some earlier state, and each successive modification put the filesystem in a more inconsistent state, until a full messup. I don't remember specific btrfs terminology, but had to sift through binary dump of the btrfs fs to find superblocks of the older btrfs root tree of that previous state and then use special btrfs commands to extract the files out. This ventoy situation, is more like, the filesystem on the outside-partition has full control of it self and all its files and folders, no one is touching or interfering
Re: [dm-devel] [v2 PATCH 0/5] Introduce mempool pages bulk allocator and use it in dm-crypt
On Thu, Feb 16, 2023 at 9:45 AM Mikulas Patocka wrote: > > > > On Wed, 15 Feb 2023, Yang Shi wrote: > > > On Wed, Feb 15, 2023 at 4:23 AM Mikulas Patocka wrote: > > > > > > > > > > > > On Tue, 14 Feb 2023, Yang Shi wrote: > > > > > > > > > > > Changelog: > > > > RFC -> v2: > > > > * Added callback variant for page bulk allocator and mempool bulk > > > > allocator > > > > per Mel Gorman. > > > > * Used the callback version in dm-crypt driver. > > > > * Some code cleanup and refactor to reduce duplicate code. > > > > > > > > rfc: > > > > https://lore.kernel.org/linux-mm/20221005180341.1738796-1-shy828...@gmail.com/ > > > > > > Hi > > > > > > This seems like unneeded complication to me. We have alloc_pages(), it can > > > allocate multiple pages efficiently, so why not use it? > > > > The alloc_pages() allocates *contiguous* pages, but dm-crypt doesn't > > need contiguous pages at all. This may incur unnecessary compaction > > It doesn't hurt that the pages are contiguous - and allocating and freeing > a few compound pages is even faster than allocating and freeing many > 0-order pages. If "allocating many order-0 pages" means calling alloc_page() multiple times, just like what the dm-crypt code does before this patchset, yeah, allocating a compound page may be faster. But it may be not true with bulk allocator. And it also depends on how bad the fragmentation is and how contended the zone lock is. When allocating order-0 page, the bulk allocator just could take the pages from pcp list within one call. And the pcp list could hold a lot pages actually, on my test machine per pcp list could have more than 1000 pages. > > > overhead to the dm-crypt layer when memory is fragmented. > > The compaction overhead may be suppressed by the GFP flags (i.e. don't use > __GFP_DIRECT_RECLAIM). You could, but you may pressure the mempool quite more often when light memory pressure and fragmentation exist. The bulk allocator may just succeed with light reclamation without allocating from mempool. > > > The bulk allocator is a good fit to this usecase, which allocates > > multiple order-0 pages. > > > > In addition, filesystem writeback doesn't guarantee power-of-2 pages > > every time IIUC. But alloc_pages() just can allocate power-of-2 pages. > > So, we can allocate more compound pages for the non-power-of-2 case - see > the next patch that I'm sending. Thanks for the patch. If the callers are willing to handle the complexity (calculating the proper orders, dealing with the compound pages, etc), it is definitely an option for them. > > > > > > > I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if > > > alloc_pages() fails (either because the system is low on memory or because > > > memory is too fragmented), fall back to the existing code that does > > > mempool_alloc(). > > > > My PoC patches just did this way, but called bulk allocator. There may > > be other potential mepool users as I listed in this cover letter, > > which may get benefits from bulk allocator. So introducing a new bulk > > mempool API seems better for long run although we just have one user > > for now. And it makes other uses easier to gain the benefit by just > > calling the new API. > > This mempool bulk refactoring just makes the code bigger. And it is not > needed - dm-crypt can fall-back to non-bulk mempool allocations. Do you mean the mempool code? It may be inevitable by adding a new API. But it is not significantly bigger. And the API hides all the details and complexity from the callers, as well as handle all the allocation corner cases in mm layer. It would make the users life much easier. Of course if the callers are happy to handle all the complexity by themselves, they don't have to call the API. > > In the next email, I'm sending a patch that is noticeably smaller and that > uses alloc_pages()/__free_pages(). Thanks for the patch. But if other potential users would like to do the same optimization, the code have to be duplicated everywhere. Maybe not every one is happy to handle this by themselves. > > Mikulas > -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] A hang bug of dm on s390x
Hi Ming, Thank you for looking into this. let me loop in Alasdair, Mike and Zdenek for further comment on LVM stuff Thanks, Pingfan On Thu, Feb 16, 2023 at 8:08 AM Ming Lei wrote: > > On Wed, Feb 15, 2023 at 07:23:40PM +0800, Pingfan Liu wrote: > > Hi guys, > > > > I encountered a hang issue on a s390x system. The tested kernel is > > not preemptible and booting with "nr_cpus=1" > > > > The test steps: > > umount /home > > lvremove /dev/rhel_s390x-kvm-011/home > > ## uncomment "snapshot_autoextend_threshold = 70" and > > "snapshot_autoextend_percent = 20" in /etc/lvm/lvm.conf > > > > systemctl enable lvm2-monitor.service > > systemctl start lvm2-monitor.service > > > > lvremove -y rhel_s390x-kvm-011/thinp > > lvcreate -L 10M -T rhel_s390x-kvm-011/thinp > > lvcreate -V 400M -T rhel_s390x-kvm-011/thinp -n src > > mkfs.ext4 /dev/rhel_s390x-kvm-011/src > > mount /dev/rhel_s390x-kvm-011/src /mnt > > for((i=0;i<4;i++)); do dd if=/dev/zero of=/mnt/test$i.img > > bs=100M count=1; done > > > > And the system hangs with the console log [1] > > > > The related kernel config > > > > CONFIG_PREEMPT_NONE_BUILD=y > > CONFIG_PREEMPT_NONE=y > > CONFIG_PREEMPT_COUNT=y > > CONFIG_SCHED_CORE=y > > > > It turns out that when hanging, the kernel is stuck in the dead-loop > > in the function dm_wq_work() > > while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) { > > spin_lock_irq(>deferred_lock); > > bio = bio_list_pop(>deferred); > > spin_unlock_irq(>deferred_lock); > > > > if (!bio) > > break; > > thread_cpu = smp_processor_id(); > > submit_bio_noacct(bio); > > } > > where dm_wq_work()->__submit_bio_noacct()->...->dm_handle_requeue() > > keeps generating new bio, and the condition "if (!bio)" can not be > > meet. > > > > > > After applying the following patch, the issue is gone. > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > index e1ea3a7bd9d9..95c9cb07a42f 100644 > > --- a/drivers/md/dm.c > > +++ b/drivers/md/dm.c > > @@ -2567,6 +2567,7 @@ static void dm_wq_work(struct work_struct *work) > > break; > > > > submit_bio_noacct(bio); > > + cond_resched(); > > } > > } > > > > But I think it is not a proper solution. And without this patch, if > > removing nr_cpus=1 (the system has two cpus), the issue can not be > > triggered. That says when more than one cpu, the above loop can exit > > by the condition "if (!bio)" > > > > Any ideas? > > I think the patch is correct. > > For kernel built without CONFIG_PREEMPT, in case of single cpu core, > if the dm target(such as dm-thin) needs another wq or kthread for > handling IO, then dm target side is blocked because dm_wq_work() > holds the single cpu, sooner or later, dm target may have not > resource to handle new io from dm core and returns REQUEUE. > > Then dm_wq_work becomes one dead loop. > > > Thanks, > Ming > -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] blk-mq: enforce op-specific segment limits in blk_insert_cloned_request
On Thu, Feb 16 2023 at 2:27P -0500, Uday Shankar wrote: > On Wed, Feb 15, 2023 at 10:09:36PM -0800, Christoph Hellwig wrote: > > I'd just remove the debug check entirely > > Older kernels have these checks in a separate function called > blk_cloned_rq_check_limits, which carries the following comment: > > /** > * blk_cloned_rq_check_limits - Helper function to check a cloned request > * for the new queue limits > * @q: the queue > * @rq: the request being checked > * > * Description: > *@rq may have been made based on weaker limitations of upper-level queues > *in request stacking drivers, and it may violate the limitation of @q. > *Since the block layer and the underlying device driver trust @rq > *after it is inserted to @q, it should be checked against @q before > *the insertion using this generic function. > * > *Request stacking drivers like request-based dm may change the queue > *limits when retrying requests on other queues. Those requests need > *to be checked against the new queue limits again during dispatch. > */. > > Is this concern no longer relevant? Still relevant. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
[dm-devel] [PATCH v2] dm-crypt: allocate compound pages if possible
It was reported that allocating pages for the write buffer in dm-crypt causes measurable overhead [1]. This patch changes dm-crypt to allocate compound pages if they are available. If not, we fall back to the mempool. [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html Signed-off-by: Mikulas Patocka --- drivers/md/dm-crypt.c | 48 +--- 1 file changed, 33 insertions(+), 15 deletions(-) Index: linux-2.6/drivers/md/dm-crypt.c === --- linux-2.6.orig/drivers/md/dm-crypt.c2023-02-16 20:40:15.0 +0100 +++ linux-2.6/drivers/md/dm-crypt.c 2023-02-16 21:56:34.0 +0100 @@ -1657,6 +1657,9 @@ static void crypt_free_buffer_pages(stru * In order to not degrade performance with excessive locking, we try * non-blocking allocations without a mutex first but on failure we fallback * to blocking allocations with a mutex. + * + * In order to reduce allocation overhead, we try to allocate compound pages in + * the first pass. If they are not available, we fall back to the mempool. */ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size) { @@ -1664,8 +1667,8 @@ static struct bio *crypt_alloc_buffer(st struct bio *clone; unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM; - unsigned i, len, remaining_size; - struct page *page; + unsigned remaining_size; + unsigned order = MAX_ORDER - 1; retry: if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM)) @@ -1678,20 +1681,34 @@ retry: remaining_size = size; - for (i = 0; i < nr_iovecs; i++) { - page = mempool_alloc(>page_pool, gfp_mask); - if (!page) { + while (remaining_size) { + struct page *pages; + unsigned size_to_add; + unsigned remaining_order = __fls((remaining_size + PAGE_SIZE - 1) >> PAGE_SHIFT); + order = min(order, remaining_order); + + while (order > 0) { + pages = alloc_pages(gfp_mask + | __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_COMP, + order); + if (likely(pages != NULL)) + goto have_pages; + order--; + } + + pages = mempool_alloc(>page_pool, gfp_mask); + if (!pages) { crypt_free_buffer_pages(cc, clone); bio_put(clone); gfp_mask |= __GFP_DIRECT_RECLAIM; + order = 0; goto retry; } - len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size; - - bio_add_page(clone, page, len, 0); - - remaining_size -= len; +have_pages: + size_to_add = min((unsigned)PAGE_SIZE << order, remaining_size); + bio_add_page(clone, pages, size_to_add, 0); + remaining_size -= size_to_add; } /* Allocate space for integrity tags */ @@ -1709,12 +1726,13 @@ retry: static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone) { - struct bio_vec *bv; - struct bvec_iter_all iter_all; + struct folio_iter fi; - bio_for_each_segment_all(bv, clone, iter_all) { - BUG_ON(!bv->bv_page); - mempool_free(bv->bv_page, >page_pool); + bio_for_each_folio_all(fi, clone) { + if (folio_test_large(fi.folio)) + folio_put(fi.folio); + else + mempool_free(>page, >page_pool); } } -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] dm-crypt: allocate compound pages if possible
On Thu, 16 Feb 2023, Matthew Wilcox wrote: > > - len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size; > > - > > - bio_add_page(clone, page, len, 0); > > +have_pages: > > + page->compound_order = order; > > No. You'll corrupt the next page if page is order-0, which it is if it > came from the mempool. Also we've deleted page->compound_order in -next > so you can't make this mistake. Using __GFP_COMP will set this field > for you, so you can just drop this line. OK > > - remaining_size -= len; > > + for (o = 0; o < 1U << order; o++) { > > + unsigned len = min((unsigned)PAGE_SIZE, remaining_size); > > + bio_add_page(clone, page, len, 0); > > + remaining_size -= len; > > + page++; > > You can add multiple pages at once, whether they're compound or not. So > replace this entire loop with: > > bio_add_page(clone, page, remaining_size, 0); This should be min((unsigned)PAGE_SIZE << order, remaining_size), because we may allocate less than remaining_size. > > @@ -1711,10 +1732,23 @@ static void crypt_free_buffer_pages(stru > > { > > struct bio_vec *bv; > > struct bvec_iter_all iter_all; > > + unsigned skip_entries = 0; > > > > bio_for_each_segment_all(bv, clone, iter_all) { > > - BUG_ON(!bv->bv_page); > > - mempool_free(bv->bv_page, >page_pool); > > + unsigned order; > > + struct page *page = bv->bv_page; > > + BUG_ON(!page); > > + if (skip_entries) { > > + skip_entries--; > > + continue; > > + } > > + order = page->compound_order; > > + if (order) { > > + __free_pages(page, order); > > + skip_entries = (1U << order) - 1; > > + } else { > > + mempool_free(page, >page_pool); > > + } > > You can simplify this by using the folio code. > > struct folio_iter fi; > > bio_for_each_folio_all(fi, bio) { > if (folio_test_large(folio)) > folio_put(folio); > else > mempool_free(>page, >page_pool); > } OK. I'm sending version 2 of the patch. > (further work would actually convert this driver to use folios instead > of pages) Mikulas -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] blk-mq: enforce op-specific segment limits in blk_insert_cloned_request
On Wed, Feb 15, 2023 at 10:09:36PM -0800, Christoph Hellwig wrote: > I'd just remove the debug check entirely Older kernels have these checks in a separate function called blk_cloned_rq_check_limits, which carries the following comment: /** * blk_cloned_rq_check_limits - Helper function to check a cloned request * for the new queue limits * @q: the queue * @rq: the request being checked * * Description: *@rq may have been made based on weaker limitations of upper-level queues *in request stacking drivers, and it may violate the limitation of @q. *Since the block layer and the underlying device driver trust @rq *after it is inserted to @q, it should be checked against @q before *the insertion using this generic function. * *Request stacking drivers like request-based dm may change the queue *limits when retrying requests on other queues. Those requests need *to be checked against the new queue limits again during dispatch. */. Is this concern no longer relevant? -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] dm-crypt: allocate compound pages if possible
On Thu, Feb 16, 2023 at 12:47:08PM -0500, Mikulas Patocka wrote: > + while (order > 0) { > + page = alloc_pages(gfp_mask > + | __GFP_NOMEMALLOC | __GFP_NORETRY | > __GFP_NOWARN, order); ... | __GFP_COMP > page = mempool_alloc(>page_pool, gfp_mask); > if (!page) { > crypt_free_buffer_pages(cc, clone); > bio_put(clone); > gfp_mask |= __GFP_DIRECT_RECLAIM; > + order = 0; > goto retry; > } > > - len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size; > - > - bio_add_page(clone, page, len, 0); > +have_pages: > + page->compound_order = order; No. You'll corrupt the next page if page is order-0, which it is if it came from the mempool. Also we've deleted page->compound_order in -next so you can't make this mistake. Using __GFP_COMP will set this field for you, so you can just drop this line. > - remaining_size -= len; > + for (o = 0; o < 1U << order; o++) { > + unsigned len = min((unsigned)PAGE_SIZE, remaining_size); > + bio_add_page(clone, page, len, 0); > + remaining_size -= len; > + page++; You can add multiple pages at once, whether they're compound or not. So replace this entire loop with: bio_add_page(clone, page, remaining_size, 0); > @@ -1711,10 +1732,23 @@ static void crypt_free_buffer_pages(stru > { > struct bio_vec *bv; > struct bvec_iter_all iter_all; > + unsigned skip_entries = 0; > > bio_for_each_segment_all(bv, clone, iter_all) { > - BUG_ON(!bv->bv_page); > - mempool_free(bv->bv_page, >page_pool); > + unsigned order; > + struct page *page = bv->bv_page; > + BUG_ON(!page); > + if (skip_entries) { > + skip_entries--; > + continue; > + } > + order = page->compound_order; > + if (order) { > + __free_pages(page, order); > + skip_entries = (1U << order) - 1; > + } else { > + mempool_free(page, >page_pool); > + } You can simplify this by using the folio code. struct folio_iter fi; bio_for_each_folio_all(fi, bio) { if (folio_test_large(folio)) folio_put(folio); else mempool_free(>page, >page_pool); } (further work would actually convert this driver to use folios instead of pages) -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
[dm-devel] [PATCH] dm-crypt: allocate compound pages if possible
It was reported that allocating pages for the write buffer in dm-crypt causes measurable overhead [1]. This patch changes dm-crypt to allocate compound pages if they are available. If not, we fall back to the mempool. [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html Signed-off-by: Mikulas Patocka --- drivers/md/dm-crypt.c | 50 ++ 1 file changed, 42 insertions(+), 8 deletions(-) Index: linux-2.6/drivers/md/dm-crypt.c === --- linux-2.6.orig/drivers/md/dm-crypt.c2023-01-20 13:22:38.0 +0100 +++ linux-2.6/drivers/md/dm-crypt.c 2023-02-16 18:33:42.0 +0100 @@ -1657,6 +1657,9 @@ static void crypt_free_buffer_pages(stru * In order to not degrade performance with excessive locking, we try * non-blocking allocations without a mutex first but on failure we fallback * to blocking allocations with a mutex. + * + * In order to reduce allocation overhead, we try to allocate compound pages in + * the first pass. If they are not available, we fall back to the mempool. */ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size) { @@ -1664,8 +1667,9 @@ static struct bio *crypt_alloc_buffer(st struct bio *clone; unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM; - unsigned i, len, remaining_size; + unsigned remaining_size; struct page *page; + unsigned order = MAX_ORDER - 1; retry: if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM)) @@ -1678,20 +1682,37 @@ retry: remaining_size = size; - for (i = 0; i < nr_iovecs; i++) { + while (remaining_size) { + unsigned o; + unsigned remaining_order = __fls((remaining_size + PAGE_SIZE - 1) >> PAGE_SHIFT); + order = min(order, remaining_order); + + while (order > 0) { + page = alloc_pages(gfp_mask + | __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN, order); + if (likely(page != NULL)) + goto have_pages; + order--; + } + page = mempool_alloc(>page_pool, gfp_mask); if (!page) { crypt_free_buffer_pages(cc, clone); bio_put(clone); gfp_mask |= __GFP_DIRECT_RECLAIM; + order = 0; goto retry; } - len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size; - - bio_add_page(clone, page, len, 0); +have_pages: + page->compound_order = order; - remaining_size -= len; + for (o = 0; o < 1U << order; o++) { + unsigned len = min((unsigned)PAGE_SIZE, remaining_size); + bio_add_page(clone, page, len, 0); + remaining_size -= len; + page++; + } } /* Allocate space for integrity tags */ @@ -1711,10 +1732,23 @@ static void crypt_free_buffer_pages(stru { struct bio_vec *bv; struct bvec_iter_all iter_all; + unsigned skip_entries = 0; bio_for_each_segment_all(bv, clone, iter_all) { - BUG_ON(!bv->bv_page); - mempool_free(bv->bv_page, >page_pool); + unsigned order; + struct page *page = bv->bv_page; + BUG_ON(!page); + if (skip_entries) { + skip_entries--; + continue; + } + order = page->compound_order; + if (order) { + __free_pages(page, order); + skip_entries = (1U << order) - 1; + } else { + mempool_free(page, >page_pool); + } } } -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [v2 PATCH 0/5] Introduce mempool pages bulk allocator and use it in dm-crypt
On Wed, 15 Feb 2023, Yang Shi wrote: > On Wed, Feb 15, 2023 at 4:23 AM Mikulas Patocka wrote: > > > > > > > > On Tue, 14 Feb 2023, Yang Shi wrote: > > > > > > > > Changelog: > > > RFC -> v2: > > > * Added callback variant for page bulk allocator and mempool bulk > > > allocator > > > per Mel Gorman. > > > * Used the callback version in dm-crypt driver. > > > * Some code cleanup and refactor to reduce duplicate code. > > > > > > rfc: > > > https://lore.kernel.org/linux-mm/20221005180341.1738796-1-shy828...@gmail.com/ > > > > Hi > > > > This seems like unneeded complication to me. We have alloc_pages(), it can > > allocate multiple pages efficiently, so why not use it? > > The alloc_pages() allocates *contiguous* pages, but dm-crypt doesn't > need contiguous pages at all. This may incur unnecessary compaction It doesn't hurt that the pages are contiguous - and allocating and freeing a few compound pages is even faster than allocating and freeing many 0-order pages. > overhead to the dm-crypt layer when memory is fragmented. The compaction overhead may be suppressed by the GFP flags (i.e. don't use __GFP_DIRECT_RECLAIM). > The bulk allocator is a good fit to this usecase, which allocates > multiple order-0 pages. > > In addition, filesystem writeback doesn't guarantee power-of-2 pages > every time IIUC. But alloc_pages() just can allocate power-of-2 pages. So, we can allocate more compound pages for the non-power-of-2 case - see the next patch that I'm sending. > > > > I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if > > alloc_pages() fails (either because the system is low on memory or because > > memory is too fragmented), fall back to the existing code that does > > mempool_alloc(). > > My PoC patches just did this way, but called bulk allocator. There may > be other potential mepool users as I listed in this cover letter, > which may get benefits from bulk allocator. So introducing a new bulk > mempool API seems better for long run although we just have one user > for now. And it makes other uses easier to gain the benefit by just > calling the new API. This mempool bulk refactoring just makes the code bigger. And it is not needed - dm-crypt can fall-back to non-bulk mempool allocations. In the next email, I'm sending a patch that is noticeably smaller and that uses alloc_pages()/__free_pages(). Mikulas -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] A hang bug of dm on s390x
[Top-posting but please don't...] I've staged this fix for 6.3 inclusion and marked it for stable@: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-6.3=0ca44fcef241768fd25ee763b3d203b9852f269b Ming, I also staged this similar fix (not reasoned through scenario where it'd actually occur that dm_wq_requeue_work would loop endlessly but its good practice to include cond_resched() in such a workqueue while loop): https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-6.3=f77692d65d54665d81815349cc727baa85e8b71d Thanks, Mike On Thu, Feb 16 2023 at 3:30P -0500, Pingfan Liu wrote: > Hi Ming, > > Thank you for looking into this. > > let me loop in Alasdair, Mike and Zdenek for further comment on LVM stuff > > > Thanks, > > Pingfan > > On Thu, Feb 16, 2023 at 8:08 AM Ming Lei wrote: > > > > On Wed, Feb 15, 2023 at 07:23:40PM +0800, Pingfan Liu wrote: > > > Hi guys, > > > > > > I encountered a hang issue on a s390x system. The tested kernel is > > > not preemptible and booting with "nr_cpus=1" > > > > > > The test steps: > > > umount /home > > > lvremove /dev/rhel_s390x-kvm-011/home > > > ## uncomment "snapshot_autoextend_threshold = 70" and > > > "snapshot_autoextend_percent = 20" in /etc/lvm/lvm.conf > > > > > > systemctl enable lvm2-monitor.service > > > systemctl start lvm2-monitor.service > > > > > > lvremove -y rhel_s390x-kvm-011/thinp > > > lvcreate -L 10M -T rhel_s390x-kvm-011/thinp > > > lvcreate -V 400M -T rhel_s390x-kvm-011/thinp -n src > > > mkfs.ext4 /dev/rhel_s390x-kvm-011/src > > > mount /dev/rhel_s390x-kvm-011/src /mnt > > > for((i=0;i<4;i++)); do dd if=/dev/zero of=/mnt/test$i.img > > > bs=100M count=1; done > > > > > > And the system hangs with the console log [1] > > > > > > The related kernel config > > > > > > CONFIG_PREEMPT_NONE_BUILD=y > > > CONFIG_PREEMPT_NONE=y > > > CONFIG_PREEMPT_COUNT=y > > > CONFIG_SCHED_CORE=y > > > > > > It turns out that when hanging, the kernel is stuck in the dead-loop > > > in the function dm_wq_work() > > > while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) { > > > spin_lock_irq(>deferred_lock); > > > bio = bio_list_pop(>deferred); > > > spin_unlock_irq(>deferred_lock); > > > > > > if (!bio) > > > break; > > > thread_cpu = smp_processor_id(); > > > submit_bio_noacct(bio); > > > } > > > where dm_wq_work()->__submit_bio_noacct()->...->dm_handle_requeue() > > > keeps generating new bio, and the condition "if (!bio)" can not be > > > meet. > > > > > > > > > After applying the following patch, the issue is gone. > > > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > > index e1ea3a7bd9d9..95c9cb07a42f 100644 > > > --- a/drivers/md/dm.c > > > +++ b/drivers/md/dm.c > > > @@ -2567,6 +2567,7 @@ static void dm_wq_work(struct work_struct *work) > > > break; > > > > > > submit_bio_noacct(bio); > > > + cond_resched(); > > > } > > > } > > > > > > But I think it is not a proper solution. And without this patch, if > > > removing nr_cpus=1 (the system has two cpus), the issue can not be > > > triggered. That says when more than one cpu, the above loop can exit > > > by the condition "if (!bio)" > > > > > > Any ideas? > > > > I think the patch is correct. > > > > For kernel built without CONFIG_PREEMPT, in case of single cpu core, > > if the dm target(such as dm-thin) needs another wq or kthread for > > handling IO, then dm target side is blocked because dm_wq_work() > > holds the single cpu, sooner or later, dm target may have not > > resource to handle new io from dm core and returns REQUEUE. > > > > Then dm_wq_work becomes one dead loop. > > > > > > Thanks, > > Ming > > > -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] blk-mq: enforce op-specific segment limits in blk_insert_cloned_request
Hi Uday, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on 6bea9ac7c6481c09eb2b61d7cd844fc64a526e3e] url: https://github.com/intel-lab-lkp/linux/commits/Uday-Shankar/blk-mq-enforce-op-specific-segment-limits-in-blk_insert_cloned_request/20230216-041718 base: 6bea9ac7c6481c09eb2b61d7cd844fc64a526e3e patch link: https://lore.kernel.org/r/20230215201507.494152-1-ushankar%40purestorage.com patch subject: [PATCH] blk-mq: enforce op-specific segment limits in blk_insert_cloned_request config: i386-randconfig-a004-20230213 (https://download.01.org/0day-ci/archive/20230216/202302162040.fai25ul2-...@intel.com/config) compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/2cd958b09698bedea1a5a5f6298f0d25ec522bf9 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Uday-Shankar/blk-mq-enforce-op-specific-segment-limits-in-blk_insert_cloned_request/20230216-041718 git checkout 2cd958b09698bedea1a5a5f6298f0d25ec522bf9 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot | Link: https://lore.kernel.org/oe-kbuild-all/202302162040.fai25ul2-...@intel.com/ All warnings (new ones prefixed by >>): >> block/blk-mq.c:3032:36: warning: format specifies type 'unsigned short' but >> the argument has type 'unsigned int' [-Wformat] __func__, rq->nr_phys_segments, max_segments); ^~~~ include/linux/printk.h:457:60: note: expanded from macro 'printk' #define printk(fmt, ...) printk_index_wrap(_printk, fmt, ##__VA_ARGS__) ~~~^~~ include/linux/printk.h:429:19: note: expanded from macro 'printk_index_wrap' _p_func(_fmt, ##__VA_ARGS__); \ ^~~ 1 warning generated. vim +3032 block/blk-mq.c 2993 2994 #ifdef CONFIG_BLK_MQ_STACKING 2995 /** 2996 * blk_insert_cloned_request - Helper for stacking drivers to submit a request 2997 * @rq: the request being queued 2998 */ 2999 blk_status_t blk_insert_cloned_request(struct request *rq) 3000 { 3001 struct request_queue *q = rq->q; 3002 unsigned int max_sectors = blk_queue_get_max_sectors(q, req_op(rq)); 3003 unsigned int max_segments = blk_queue_get_max_segments(q, req_op(rq)); 3004 blk_status_t ret; 3005 3006 if (blk_rq_sectors(rq) > max_sectors) { 3007 /* 3008 * SCSI device does not have a good way to return if 3009 * Write Same/Zero is actually supported. If a device rejects 3010 * a non-read/write command (discard, write same,etc.) the 3011 * low-level device driver will set the relevant queue limit to 3012 * 0 to prevent blk-lib from issuing more of the offending 3013 * operations. Commands queued prior to the queue limit being 3014 * reset need to be completed with BLK_STS_NOTSUPP to avoid I/O 3015 * errors being propagated to upper layers. 3016 */ 3017 if (max_sectors == 0) 3018 return BLK_STS_NOTSUPP; 3019 3020 printk(KERN_ERR "%s: over max size limit. (%u > %u)\n", 3021 __func__, blk_rq_sectors(rq), max_sectors); 3022 return BLK_STS_IOERR; 3023 } 3024 3025 /* 3026 * The queue settings related to segment counting may differ from the 3027 * original queue. 3028 */ 3029 rq->nr_phys_segments = blk_recalc_rq_segments(rq); 3030 if (rq->nr_phys_segments > max_segments) { 3031 printk(KERN_ERR "%s: over max segments limit. (%hu > %hu)\n", > 3032 __func__, rq->nr_phys_segments, max_segments); 3033 return BLK_STS_IOERR; 3034 } 3035 3036 if (q->disk && should_fail_request(q->disk->part0, blk_rq_bytes(rq))) 3037 re
Re: [dm-devel] A hang bug of dm on s390x
Dne 16. 02. 23 v 9:30 Pingfan Liu napsal(a): Hi Ming, Thank you for looking into this. let me loop in Alasdair, Mike and Zdenek for further comment on LVM stuff Thanks, Pingfan Hi From lvm2 POV - couple clarifications - to let thin-pool auto-extend - user has to configure: thin_pool_autoextend_threshold = 70 thin_pool_autoextend_percent = 20 If the thin_pool_autoextend_threshold is left with the default value 100, there is no extension made to the thin-pool. Default behavior of thin-pool kernel target when it runs out-of-space is to put all in-flight IO operation on-hold for 60s (configurable by kernel parameter) then all such operation starts to be errored and thin-pool goes to out-of-space error state. To immediately get this state use '--errorwhenfull=y' with thinpool (lvcreate, lvconvert, lvchange) - this will avoid any delay if user doesn't want expansion of thin-pool and wants to get error ASAP. But this all might be unrelated to the issue you are getting on your hw. Regards Zdenek On Thu, Feb 16, 2023 at 8:08 AM Ming Lei wrote: On Wed, Feb 15, 2023 at 07:23:40PM +0800, Pingfan Liu wrote: Hi guys, I encountered a hang issue on a s390x system. The tested kernel is not preemptible and booting with "nr_cpus=1" The test steps: umount /home lvremove /dev/rhel_s390x-kvm-011/home ## uncomment "snapshot_autoextend_threshold = 70" and "snapshot_autoextend_percent = 20" in /etc/lvm/lvm.conf systemctl enable lvm2-monitor.service systemctl start lvm2-monitor.service lvremove -y rhel_s390x-kvm-011/thinp lvcreate -L 10M -T rhel_s390x-kvm-011/thinp lvcreate -V 400M -T rhel_s390x-kvm-011/thinp -n src mkfs.ext4 /dev/rhel_s390x-kvm-011/src mount /dev/rhel_s390x-kvm-011/src /mnt for((i=0;i<4;i++)); do dd if=/dev/zero of=/mnt/test$i.img bs=100M count=1; done And the system hangs with the console log [1] The related kernel config CONFIG_PREEMPT_NONE_BUILD=y CONFIG_PREEMPT_NONE=y CONFIG_PREEMPT_COUNT=y CONFIG_SCHED_CORE=y It turns out that when hanging, the kernel is stuck in the dead-loop in the function dm_wq_work() while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) { spin_lock_irq(>deferred_lock); bio = bio_list_pop(>deferred); spin_unlock_irq(>deferred_lock); if (!bio) break; thread_cpu = smp_processor_id(); submit_bio_noacct(bio); } where dm_wq_work()->__submit_bio_noacct()->...->dm_handle_requeue() keeps generating new bio, and the condition "if (!bio)" can not be meet. After applying the following patch, the issue is gone. diff --git a/drivers/md/dm.c b/drivers/md/dm.c index e1ea3a7bd9d9..95c9cb07a42f 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2567,6 +2567,7 @@ static void dm_wq_work(struct work_struct *work) break; submit_bio_noacct(bio); + cond_resched(); } } But I think it is not a proper solution. And without this patch, if removing nr_cpus=1 (the system has two cpus), the issue can not be triggered. That says when more than one cpu, the above loop can exit by the condition "if (!bio)" Any ideas? I think the patch is correct. For kernel built without CONFIG_PREEMPT, in case of single cpu core, if the dm target(such as dm-thin) needs another wq or kthread for handling IO, then dm target side is blocked because dm_wq_work() holds the single cpu, sooner or later, dm target may have not resource to handle new io from dm core and returns REQUEUE. Then dm_wq_work becomes one dead loop. Thanks, Ming -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] On allowing remounting the partition containing dm-mapped boot disk as read-write
Besides caching at the dm layer, there is caching at the filesystem layer (files, directories and other stuff). The caching at the DM layer is not the issue. If rootfs mounts the dm-device someplace as a (ntfs)filesystem(readonly), and then later you also mount that same (ntfs)filesystem (read-write) someplace then you have mounted the filesystem twice (not a shared bind mount that appears to be 2 mounts but in reality is only a single mount), and the filesystem code is not safe when there are 2 unique separate mount instances, the fs caching is not shared. ie if the filesystem is already mounted as a filesystem someplace (mapping a dm device is not mounting) and you mount it a 2nd time then you are at risk of corruption. For the mount command to get a busy indicates that something has already mounted it someplace and has locked the device(to prevent a 2nd corrupting mount from happening), and the entire purpose of the module appears to be to disable the safety provided by this locking and allow one to mount the fs again. And note that with double mounts (usually I have seen/debugged the crash/ouage from them with shared SAN storage) they can "APPEAR" to kind of work if you do not use them too much, but eventually the 2 instances of filesystem code step on each other and it ends badly (kernel crash, data corruption, loss of the filesystem). If you need this module so that you can use another (non-bind) mount command someplace on the same device then you are double mounting. The point of the busy device is that it is mounted someplace, and the entire point of the module is to bypass the protection. It is the filesystem code and filesystem caching that is not safe. Both filesystems assume they have total control of filesystem data structures, and at the very least the mount writing will change those structures, and if the read-only mount may read blocks that were part of one file but now aren't part of that file returning the wrong data. On Wed, Feb 15, 2023 at 11:23 PM Ganapathi Kamath wrote: > > > Firstly, thankyou for your reply. I'm not a kernel expert, so I value what > you say. > but as I raised the issue I felt I had to defend the usefulness, userbase and > the need. > > > Typically double mounts are done via bind mounts (not really double > > mounted just the device showing someplace else). Or one would do a > > mount -o remount,rw and remount it rw so you could write > > to it. > > > A real double mount where the kernel fs modules manages both mounts as > > if it was a separate device won't work reliably, as the filesystem > > module/caching assumes it has total control. So writing to a > > read-write device that has a separate read-only mount with some data > > in the read cache will not return the true data in all cases.2 > > read-write (done any way--except with a clustering fs modules) are > > going to cause corruption of the underlying filesystem. > > I want to clarify, even though ventoy uses the word 'remount' to describe the > feature, the host file system isn't mounted twice. There is no loopback-fs > to bind mount. and dmsetup status shows linear sequences of blocks allocated > to the devmapper device. > > For this feature to work, the Linux being booted up, creates the devmapper > device by after having first somehow determined the sectors occupied by > the file in the filesystem. Then mounts the partitions inside devmapper device > and then pivots to the discovered rootfs and continues booting. > > So what I think you are saying is that a mount of /dev/sdc1 and > /dev/mapper/ventoy are claiming to use the hard-disk sectors, and asking the > kernel to consider them as part of its disk-sector caching mechanism. > > Booting virtual-disks this way is also called nativeboot. > The way this nativeboot so far works, has a little danger. > Three safe guards are to be followed by self contract: > 1) The virtual-disk-file must be a fixed size, it cannot be allowed to grow > or shrink. > 2) The virtual-disk-file must not be manipulated/touched/read from the > host-partition. > 3) The filesystem driver shouldn't defrag, scrub or relocate the > virtual-disk-file. > This is so that the the file entry in the outside fs remains untouched. > Usually, as this is done by root and such administrative user knows what he is > doing, so it is not so much of a problem. > If one adheres to the above self-contract, the filesystem code for partitions > inside > the dm-device does not interfere with the filesystem code for the outside > partition. > > An idea just came to me. There maybe a way around the disk sector caching. > Will this work or be more safer? > What if, in addition to /dev/mapper/ventoy, a second dm device > /dev/mapper/antiventoy of the same size of the host partition /dev/sdc1 is > created > by stitching together the other remaining sectors of the host partition > /dev/sdc1, > with the vdisk sectors swapped for with null sectors. Then the two dm created > disks: >
Re: [dm-devel] On allowing remounting the partition containing dm-mapped boot disk as read-write
Firstly, thankyou for your reply. I'm not a kernel expert, so I value what you say. but as I raised the issue I felt I had to defend the usefulness, userbase and the need. > Typically double mounts are done via bind mounts (not really double > mounted just the device showing someplace else). Or one would do a > mount -o remount,rw and remount it rw so you could write > to it. > A real double mount where the kernel fs modules manages both mounts as > if it was a separate device won't work reliably, as the filesystem > module/caching assumes it has total control. So writing to a > read-write device that has a separate read-only mount with some data > in the read cache will not return the true data in all cases. 2 > read-write (done any way--except with a clustering fs modules) are > going to cause corruption of the underlying filesystem. I want to clarify, even though ventoy uses the word 'remount' to describe the feature, the host file system isn't mounted twice. There is no loopback-fs to bind mount. and dmsetup status shows linear sequences of blocks allocated to the devmapper device. For this feature to work, the Linux being booted up, creates the devmapper device by after having first somehow determined the sectors occupied by the file in the filesystem. Then mounts the partitions inside devmapper device and then pivots to the discovered rootfs and continues booting. So what I think you are saying is that a mount of /dev/sdc1 and /dev/mapper/ventoy are claiming to use the hard-disk sectors, and asking the kernel to consider them as part of its disk-sector caching mechanism. Booting virtual-disks this way is also called nativeboot. The way this nativeboot so far works, has a little danger. Three safe guards are to be followed by self contract: 1) The virtual-disk-file must be a fixed size, it cannot be allowed to grow or shrink. 2) The virtual-disk-file must not be manipulated/touched/read from the host-partition. 3) The filesystem driver shouldn't defrag, scrub or relocate the virtual-disk-file. This is so that the the file entry in the outside fs remains untouched. Usually, as this is done by root and such administrative user knows what he is doing, so it is not so much of a problem. If one adheres to the above self-contract, the filesystem code for partitions inside the dm-device does not interfere with the filesystem code for the outside partition. An idea just came to me. There maybe a way around the disk sector caching. Will this work or be more safer? What if, in addition to /dev/mapper/ventoy, a second dm device /dev/mapper/antiventoy of the same size of the host partition /dev/sdc1 is created by stitching together the other remaining sectors of the host partition /dev/sdc1, with the vdisk sectors swapped for with null sectors. Then the two dm created disks: /dev/mapper/ventoy and /dev/mapper/antiventoy can be mounted independently, without overlap of disk sectors, separating their caching. The self-contract will still be needed, to not alter the location/size of fs-entry. I'll suggest the above to ventoy. Your thoughts will be helpful. > Given that the use case for the module is dangerous and the use case > is of questionable usefulness I would think that is no point of the > module. The module's intent seems to be to get around the exclusive > locks that the filesystem (for good reason) is putting on the device. I believe that the danger can be mitigated with a good idea and proper coding. But the door shouldn't be shut. Its usefulness and base of users is really there. The use case is really important 1) to those users who dualboot windows/linux, multi boot other OS-es and juggle between them for want of SSD space, 2) to multiboot alternate OS. but have limited on-machine disk-space 2) to mounting different live isos often, which are often re-downloaded due to updates. 3) to those keeping a host of recovery isos-s at hand for emergency like WINRE, WIN-installer, linux-installer, HBCD-PE, gparted, clonezilla, rescuezilla, seabios, samsung-wizard at hand, Why not a VM?: VM-s are nice but a bit slower than nativeboot. Many things cannot be done inside a VM such as get full native experience of a live iso, GPU support and all. Some system level recovery and repair tools must be booted as native. In the old days Harddisks, USB drives, iso files were small. vdisks were inexistent. One had to burn live-isos to cd/dvd. Disks have grown larger. Burning DVDs is such a waste now. At one point I considered having a small number of 8GB microsd cards to function just like tiny dvds/floppies. But managing them is also a hassle, as they are stored external. Disadvantages of flashing USB drive * flashing a USB drive, which say is 32gb, with a tiny <3gb ISO file. can result in it wasting space as it creates a small partition, limiting the drive's usefulness. * One doesn't want too many usb drives lying around to boot different iso-s * In
[dm-devel] A hang bug of dm on s390x
Hi guys, I encountered a hang issue on a s390x system. The tested kernel is not preemptible and booting with "nr_cpus=1" The test steps: umount /home lvremove /dev/rhel_s390x-kvm-011/home ## uncomment "snapshot_autoextend_threshold = 70" and "snapshot_autoextend_percent = 20" in /etc/lvm/lvm.conf systemctl enable lvm2-monitor.service systemctl start lvm2-monitor.service lvremove -y rhel_s390x-kvm-011/thinp lvcreate -L 10M -T rhel_s390x-kvm-011/thinp lvcreate -V 400M -T rhel_s390x-kvm-011/thinp -n src mkfs.ext4 /dev/rhel_s390x-kvm-011/src mount /dev/rhel_s390x-kvm-011/src /mnt for((i=0;i<4;i++)); do dd if=/dev/zero of=/mnt/test$i.img bs=100M count=1; done And the system hangs with the console log [1] The related kernel config CONFIG_PREEMPT_NONE_BUILD=y CONFIG_PREEMPT_NONE=y CONFIG_PREEMPT_COUNT=y CONFIG_SCHED_CORE=y It turns out that when hanging, the kernel is stuck in the dead-loop in the function dm_wq_work() while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) { spin_lock_irq(>deferred_lock); bio = bio_list_pop(>deferred); spin_unlock_irq(>deferred_lock); if (!bio) break; thread_cpu = smp_processor_id(); submit_bio_noacct(bio); } where dm_wq_work()->__submit_bio_noacct()->...->dm_handle_requeue() keeps generating new bio, and the condition "if (!bio)" can not be meet. After applying the following patch, the issue is gone. diff --git a/drivers/md/dm.c b/drivers/md/dm.c index e1ea3a7bd9d9..95c9cb07a42f 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2567,6 +2567,7 @@ static void dm_wq_work(struct work_struct *work) break; submit_bio_noacct(bio); + cond_resched(); } } But I think it is not a proper solution. And without this patch, if removing nr_cpus=1 (the system has two cpus), the issue can not be triggered. That says when more than one cpu, the above loop can exit by the condition "if (!bio)" Any ideas? Thanks, Pingfan [1]: the console log when hanging [ 2062.321473] device-mapper: thin: 253:4: reached low water mark for data device: sending event. [ 2062.353217] dm-3: detected capacity change from 122880 to 147456 [-- MARK -- Wed Dec 14 08:45:00 2022] [ 2062.376690] device-mapper: thin: 253:4: switching pool to out-of-data-space (queue IO) mode [ 2122.393998] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 2122.394011] (detected by 0, t=6002 jiffies, g=36205, q=624 ncpus=1) [ 2122.394014] rcu: All QSes seen, last rcu_sched kthread activity 6002 (4295149593-4295143591), jiffies_till_next_fqs=1, root ->qsmask 0x0 [ 2122.394017] rcu: rcu_sched kthread starved for 6002 jiffies! g36205 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 [ 2122.394019] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 2122.394020] rcu: RCU grace-period kthread stack dump: [ 2122.394022] task:rcu_sched state:R running task stack: 0 pid: 15 ppid: 2 flags:0x [ 2122.394027] Call Trace: [ 2122.394030] [<0001001fd5b0>] __schedule+0x300/0x6c0 [ 2122.394040] [<0001001fd9d2>] schedule+0x62/0xf0 [ 2122.394042] [<00010020360c>] schedule_timeout+0x8c/0x170 [ 2122.394045] [] rcu_gp_fqs_loop+0x30e/0x3d0 [ 2122.394053] [ ] rcu_gp_kthread+0x132/0x1b0 [ 2122.394054] [ ] kthread+0x108/0x110 [ 2122.394058] [ ] __ret_from_fork+0x3c/0x60 [ 2122.394061] [<00010020455a>] ret_from_fork+0xa/0x30 [ 2122.394064] rcu: Stack dump where RCU GP kthread last ran: [ 2122.394064] Task dump for CPU 0: [ 2122.394066] task:kworker/0:2 state:R running task stack: 0 pid:16943 ppid: 2 flags:0x0004 [ 2122.394070] Workqueue: kdmflush/253:6 dm_wq_work [dm_mod] [ 2122.394100] Call Trace: [ 2122.394100] [ ] sched_show_task.part.0+0xf8/0x130 [ 2122.394106] [<0001001efc9a>] rcu_check_gp_kthread_starvation+0x172/0x190 [ 2122.394111] [ ] print_other_cpu_stall+0x2de/0x370 [ 2122.394113] [ ] check_cpu_stall+0x1d0/0x270 [ 2122.394114] [ ] rcu_sched_clock_irq+0x82/0x230 [ 2122.394117] [ ] update_process_times+0xba/0xf0 [ 2122.394121] [ ] tick_sched_handle+0x4a/0x70 [ 2122.394124] [ ] tick_sched_timer+0x5e/0xc0 [ 2122.394126] [ ] __hrtimer_run_queues+0x136/0x290 [ 2122.394128] [ ] hrtimer_interrupt+0x150/0x2d0 [ 2122.394130] [ ] do_IRQ+0x56/0x70 [ 2122.394133] [ ] do_irq_async+0x56/0xb0 [ 2122.394135] [<0001001f6786>] do_ext_irq+0x96/0x160 [ 2122.394138] [<0001002047bc>] ext_int_handler+0xdc/0x110 [ 2122.394140] [<03ff8005cf48>]
Re: [dm-devel] [v2 PATCH 0/5] Introduce mempool pages bulk allocator and use it in dm-crypt
On Wed, Feb 15, 2023 at 4:23 AM Mikulas Patocka wrote: > > > > On Tue, 14 Feb 2023, Yang Shi wrote: > > > > > Changelog: > > RFC -> v2: > > * Added callback variant for page bulk allocator and mempool bulk > > allocator > > per Mel Gorman. > > * Used the callback version in dm-crypt driver. > > * Some code cleanup and refactor to reduce duplicate code. > > > > rfc: > > https://lore.kernel.org/linux-mm/20221005180341.1738796-1-shy828...@gmail.com/ > > Hi > > This seems like unneeded complication to me. We have alloc_pages(), it can > allocate multiple pages efficiently, so why not use it? The alloc_pages() allocates *contiguous* pages, but dm-crypt doesn't need contiguous pages at all. This may incur unnecessary compaction overhead to the dm-crypt layer when memory is fragmented. The bulk allocator is a good fit to this usecase, which allocates multiple order-0 pages. In addition, filesystem writeback doesn't guarantee power-of-2 pages every time IIUC. But alloc_pages() just can allocate power-of-2 pages. > > I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if > alloc_pages() fails (either because the system is low on memory or because > memory is too fragmented), fall back to the existing code that does > mempool_alloc(). My PoC patches just did this way, but called bulk allocator. There may be other potential mepool users as I listed in this cover letter, which may get benefits from bulk allocator. So introducing a new bulk mempool API seems better for long run although we just have one user for now. And it makes other uses easier to gain the benefit by just calling the new API. > > Mikulas > -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] RAID4 with no striping mode request
> On Wed, Feb 15, 2023 at 3:44 AM Roger Heflin wrote: > > I think he is wanting the parity across the data blocks on the > separate filesystems (some sort of parity across fs[1-8]/block0 to > parity/block0). Correct. > On Wed, Feb 15, 2023 at 3:44 AM Roger Heflin wrote: > it is not clear to me what this setup would be enough better than what > the current setups.Given that one could have 8 spin + 1ssd or 12 > spin for the same price.And 2 6 disk raid6's would have the same > usable space, and be pretty safe (can lose any 2 of the 6 and lose no > data). They're not the same price though. Remember these disks are mixed sizes and various ages, exposing their entire data value (4d+8d+12d+12p gives you 24T of usable storage) all protected by the single parity disk. Yes, higher levels of RAID will always be better. However, that's not how these millions of appliances are developed by a number of manufacturers and sold at your local retailer. The proposal (and ask for help) that I've raised is to have an open-source solution to these proprietary MD implementations, as opposed to being trapped with buggy MD drivers on firmware that's glitchy and breaks other aspects of the kernel. > On Wed, Feb 15, 2023 at 3:44 AM Roger Heflin wrote: > And given the separate filesystems requirement that would > require some software above the filesystems to manage spreading the > disks across multiple filesystems. The risk of another disk going > bad (while one was failed) and losing a disk's worth of data would > push me to use the 6-disk raid6. This is long solved by a number of FUSE filesystems, as well as overlayfs (which would be nice if it could gradually spool data down into layers, but that's another ball of wax). Hopefully that makes sense. The only thing that's coming closer to this is bcachefs, but that's still looking like a multi-year long road (with the above being deployed in homes since the early 2000s). On Wed, Feb 15, 2023 at 3:44 AM Roger Heflin wrote: > > I think he is wanting the parity across the data blocks on the > separate filesystems (some sort of parity across fs[1-8]/block0 to > parity/block0). > > it is not clear to me what this setup would be enough better than what > the current setups.Given that one could have 8 spin + 1ssd or 12 > spin for the same price.And 2 6 disk raid6's would have the same > usable space, and be pretty safe (can lose any 2 of the 6 and lose no > data). And given the separate filesystems requirement that would > require some software above the filesystems to manage spreading the > disks across multiple filesystems. The risk of another disk going > bad (while one was failed) and losing a disk's worth of data would > push me to use the 6-disk raid6. > > WOL: current SSD's are rated for around 1000-2000 writes. So a 1Tb > disk can sustain 1000-2000TB of total writes. And writes to > filesystem blocks would get re-written more often than data blocks. > How well it would work would depend on how often the data is deleted > and re-written. If the disks are some sort of long term storage then > the SSD is not going to get used up. And I am not sure if the rated > used up really means anything unless you are using a STUPID enterprise > controller that proactively disables/kills the SSD when it says the > rated writes have happened. I have a 500GB ssd in a mirror that > "FAILED" according to smart 2 years ago and so far is still fully > functional, and it is "GOOD" again because the counters used to > determine total writes seems to have rolled over. > > On Tue, Feb 14, 2023 at 8:23 PM Heinz Mauelshagen wrote: > > > > Roger, > > > > as any of the currently implemented 'parity' algorithms (block > > xor/P-/Q-Syndrome) provided by DM/MD RAID > > have to have at least two data blocks to calculate: are you, apart from > > the filesystem thoughts you bring up, thinking > > about running those on e.g. pairs of disks of mentioned even numbered set > > of 8? > > > > Heinz > > > > On Tue, Feb 14, 2023 at 11:28 PM Roger Heflin wrote: > >> > >> On Tue, Feb 14, 2023 at 3:27 PM Heinz Mauelshagen > >> wrote: > >> > > >> > >> > > >> > > >> > ...which is RAID1 plus a parity disk which seems superfluous as you > >> > achieve (N-1) > >> > resilience against single device failures already without the later. > >> > > >> > What would you need such parity disk for? > >> > > >> > Heinz > >> > > >> > >> I thought that at first too, but threw that idea out as it did not > >> make much sense. > >> > >> What he appears to want is 8 linear non-striped data disks + a parity disk. > >> > >> Such that you can lose any one data disk and parity can rebuild that > >> disk. And if you lose several data diskis, then you have intact > >> non-striped data for the remaining disks. > >> > >> It would almost seem that you would need to put a separate filesystem > >> on each data disk/section (or have a filesystem that is redundant > >> enough to survive) otherwise losing an