[dm-devel] [PATCH 1/1] dm: remove unnecessary (void*) conversions

2023-02-16 Thread XU pengfei
Pointer variables of void * type do not require type cast.

Signed-off-by: XU pengfei 
---
 drivers/md/dm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b424a6ee27ba..669f2821376a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2139,7 +2139,7 @@ static void event_callback(void *context)
 {
unsigned long flags;
LIST_HEAD(uevents);
-   struct mapped_device *md = (struct mapped_device *) context;
+   struct mapped_device *md = context;
 
spin_lock_irqsave(>uevent_lock, flags);
list_splice_init(>uevent_list, );
-- 
2.18.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] On allowing remounting the partition containing dm-mapped boot disk as read-write

2023-02-16 Thread Ganapathi Kamath
Firstly, thankyou for your reply, However, I still think you are mistaken.
"There is no double re-mounting of the filesystem". 

I try again, but in case my writing isn't convincing. Let's agree to disagree, 
until
I have more conclusive proof, either to what confirm what you say, or what I 
say.   

I have captured some logs from the initramfs. Included near the end of this 
post.
I also tried my antiventoy idea. I believe it worked !!. It is included at the 
end of this post.
The logs help to see whats going on.

> Besides caching at the dm layer, there is caching at the filesystem
> layer (files, directories and other stuff).  The caching at the DM
> layer is not the issue.
If at all there is an issue, its at DM layer, But, if you say there is no issue 
at the DM
layer, then maybe there is no issue at all. 
The 2 filesystems in question, one that sits inside a partition of the 
dm-mapped vdisk-file and one that sits in the outside partition,
do not touch/interfere with one another's operation and metadata structures
as they change and modify their own filesystem structures. Any 'block write' 
from the 
dm-mapped disk does not go through the outside filesystem driver, it goes
directly to the blocks on disk.
>
> If rootfs mounts the dm-device someplace as a
> (ntfs)filesystem(readonly), and then later you also mount that same
> (ntfs)filesystem (read-write) someplace then you have mounted the
> filesystem twice (not a shared bind mount that appears to be 2 mounts
> but in reality is only a single mount), and the filesystem code is not
> safe when there are 2 unique separate mount instances, the fs caching
> is not shared.
This is not what is happening. 
The same filesystem is not being re-mounted.
In fact, by using rd.break=pre-mount, in the initramfs, before pivot, I have 
confirmed that 
no mounts have been done. Other than initramfs filesystem, no disk filesystem 
exists.
But, using a block table  a device map has been created directly to the blocks 
on disk.
so one can issue : dm table /dev/mapper/ventoy and see the block map.
The blocks correspond the the exact block-locations of the file on disk. 
After pivot, the filesystem in one the partitions inside /dev/mapper/ventoy 
(specifically
/dev/mapper/ventoy4), becomes the rootfs of the OS being booted up. 
The outside /dev/sdc1 is still unmounted.
But, after booting-up and logging-in, it cannot be mounted, 
Because, I think, the fs driver cannot get exclusive access to all blocks that 
constitute
/dev/sdc1 because the device mapper is holding onto some blocks. 
That's what, I think, the ventoy dm_patch overrides.
(you say its because of a clash with a prior filesystem mount, which is 
incorrect)
> 

> ie if the filesystem is already mounted as a filesystem someplace
It is not already mounted as a filesystem someplace.
> (mapping a dm device is not mounting) and you mount it a 2nd time then
> you are at risk of corruption.
As filesytems are separate, there won't be corruption for that reason. It is 
not a remount.

> 
> For the mount command to get a busy indicates that something has
> already mounted it someplace and has locked the device(to prevent a
> 2nd corrupting mount from happening), and the entire purpose of the
> module appears to be to disable the safety provided by this locking
> and allow one to mount the fs again.
I've grabbed /proc/mounts to be sure of what I am saying, 
I think, this mount-busy is not due to a filesystem-vs-filesystem clash but a
dm-map-vs-filesystem clash.  My guess is, the dm_patch is a crude way to make 
the devicemapper think the blocks as still available.

> And note that with double mounts (usually I have seen/debugged the
> crash/ouage from them with shared SAN storage)  they can "APPEAR" to
> kind of work if you do not use them too much, but eventually the 2
> instances of filesystem code step on each other and it ends badly
> (kernel crash, data corruption, loss of the filesystem).
I know how terrible, a double mount can get. Not, exactly a double mount, but 
once I
experienced something similar.  I had a virtualbox VM in windows accessing a 
btrfs partition 
on a disk partition (pass-through). put VM to sleep/suspend (not shutdown), 
forgot about it. 
shutdown windows, and rebooted in linux. There was probably un-committed writes 
in
its journal. In Linux, simple mounting failed, linux warned me, but I 
unwittingly 
force mounted the btrfs partition, which remounted the partition from some 
earlier state, 
and each successive modification put the filesystem in a more inconsistent
state, until a full messup. I don't remember specific btrfs terminology, but 
had to sift through 
binary dump of the btrfs fs to find superblocks of the older btrfs root tree of 
that previous state and then use special btrfs commands to extract the files 
out. 

This ventoy situation, is more like, the filesystem on the outside-partition 
has full control of it self
and all its files and folders, no one is touching or interfering 

Re: [dm-devel] [v2 PATCH 0/5] Introduce mempool pages bulk allocator and use it in dm-crypt

2023-02-16 Thread Yang Shi
On Thu, Feb 16, 2023 at 9:45 AM Mikulas Patocka  wrote:
>
>
>
> On Wed, 15 Feb 2023, Yang Shi wrote:
>
> > On Wed, Feb 15, 2023 at 4:23 AM Mikulas Patocka  wrote:
> > >
> > >
> > >
> > > On Tue, 14 Feb 2023, Yang Shi wrote:
> > >
> > > >
> > > > Changelog:
> > > > RFC -> v2:
> > > >   * Added callback variant for page bulk allocator and mempool bulk 
> > > > allocator
> > > > per Mel Gorman.
> > > >   * Used the callback version in dm-crypt driver.
> > > >   * Some code cleanup and refactor to reduce duplicate code.
> > > >
> > > > rfc: 
> > > > https://lore.kernel.org/linux-mm/20221005180341.1738796-1-shy828...@gmail.com/
> > >
> > > Hi
> > >
> > > This seems like unneeded complication to me. We have alloc_pages(), it can
> > > allocate multiple pages efficiently, so why not use it?
> >
> > The alloc_pages() allocates *contiguous* pages, but dm-crypt doesn't
> > need contiguous pages at all. This may incur unnecessary compaction
>
> It doesn't hurt that the pages are contiguous - and allocating and freeing
> a few compound pages is even faster than allocating and freeing many
> 0-order pages.

If "allocating many order-0 pages" means calling alloc_page() multiple
times, just like what the dm-crypt code does before this patchset,
yeah, allocating a compound page may be faster.

But it may be not true with bulk allocator. And it also depends on how
bad the fragmentation is and how contended the zone lock is. When
allocating order-0 page, the bulk allocator just could take the pages
from pcp list within one call. And the pcp list could hold a lot pages
actually, on my test machine per pcp list could have more than 1000
pages.

>
> > overhead to the dm-crypt layer when memory is fragmented.
>
> The compaction overhead may be suppressed by the GFP flags (i.e. don't use
> __GFP_DIRECT_RECLAIM).

You could, but you may pressure the mempool quite more often when
light memory pressure and fragmentation exist. The bulk allocator may
just succeed with light reclamation without allocating from mempool.

>
> > The bulk allocator is a good fit to this usecase, which allocates
> > multiple order-0 pages.
> >
> > In addition, filesystem writeback doesn't guarantee power-of-2 pages
> > every time IIUC. But alloc_pages() just can allocate power-of-2 pages.
>
> So, we can allocate more compound pages for the non-power-of-2 case - see
> the next patch that I'm sending.

Thanks for the patch. If the callers are willing to handle the
complexity (calculating the proper orders, dealing with the compound
pages, etc), it is definitely an option for them.

>
> > >
> > > I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if
> > > alloc_pages() fails (either because the system is low on memory or because
> > > memory is too fragmented), fall back to the existing code that does
> > > mempool_alloc().
> >
> > My PoC patches just did this way, but called bulk allocator. There may
> > be other potential mepool users as I listed in this cover letter,
> > which may get benefits from bulk allocator. So introducing a new bulk
> > mempool API seems better for long run although we just have one user
> > for now. And it makes other uses easier to gain the benefit by just
> > calling the new API.
>
> This mempool bulk refactoring just makes the code bigger. And it is not
> needed - dm-crypt can fall-back to non-bulk mempool allocations.

Do you mean the mempool code? It may be inevitable by adding a new
API. But it is not significantly bigger. And the API hides all the
details and complexity from the callers, as well as handle all the
allocation corner cases in mm layer. It would make the users life much
easier. Of course if the callers are happy to handle all the
complexity by themselves, they don't have to call the API.

>
> In the next email, I'm sending a patch that is noticeably smaller and that
> uses alloc_pages()/__free_pages().

Thanks for the patch. But if other potential users would like to do
the same optimization, the code have to be duplicated everywhere.
Maybe not every one is happy to handle this by themselves.

>
> Mikulas
>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] A hang bug of dm on s390x

2023-02-16 Thread Pingfan Liu
Hi Ming,

Thank you for looking into this.

let me loop in Alasdair, Mike and Zdenek for further comment on LVM stuff


Thanks,

Pingfan

On Thu, Feb 16, 2023 at 8:08 AM Ming Lei  wrote:
>
> On Wed, Feb 15, 2023 at 07:23:40PM +0800, Pingfan Liu wrote:
> > Hi guys,
> >
> > I encountered  a hang issue on a s390x system.  The tested kernel is
> > not preemptible and booting with "nr_cpus=1"
> >
> > The test steps:
> >   umount /home
> >   lvremove /dev/rhel_s390x-kvm-011/home
> >   ## uncomment "snapshot_autoextend_threshold = 70" and
> >   "snapshot_autoextend_percent = 20" in /etc/lvm/lvm.conf
> >
> >   systemctl enable lvm2-monitor.service
> >   systemctl start lvm2-monitor.service
> >
> >   lvremove -y rhel_s390x-kvm-011/thinp
> >   lvcreate -L 10M -T rhel_s390x-kvm-011/thinp
> >   lvcreate -V 400M -T rhel_s390x-kvm-011/thinp -n src
> >   mkfs.ext4 /dev/rhel_s390x-kvm-011/src
> >   mount /dev/rhel_s390x-kvm-011/src /mnt
> >   for((i=0;i<4;i++)); do dd if=/dev/zero of=/mnt/test$i.img
> > bs=100M count=1; done
> >
> > And the system hangs with the console log [1]
> >
> > The related kernel config
> >
> > CONFIG_PREEMPT_NONE_BUILD=y
> > CONFIG_PREEMPT_NONE=y
> > CONFIG_PREEMPT_COUNT=y
> > CONFIG_SCHED_CORE=y
> >
> > It turns out that when hanging, the kernel is stuck in the dead-loop
> > in the function dm_wq_work()
> > while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> > spin_lock_irq(>deferred_lock);
> > bio = bio_list_pop(>deferred);
> > spin_unlock_irq(>deferred_lock);
> >
> > if (!bio)
> > break;
> > thread_cpu = smp_processor_id();
> > submit_bio_noacct(bio);
> > }
> > where dm_wq_work()->__submit_bio_noacct()->...->dm_handle_requeue()
> > keeps generating new bio, and the condition "if (!bio)" can not be
> > meet.
> >
> >
> > After applying the following patch, the issue is gone.
> >
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index e1ea3a7bd9d9..95c9cb07a42f 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -2567,6 +2567,7 @@ static void dm_wq_work(struct work_struct *work)
> > break;
> >
> > submit_bio_noacct(bio);
> > +   cond_resched();
> > }
> >  }
> >
> > But I think it is not a proper solution. And without this patch, if
> > removing nr_cpus=1 (the system has two cpus), the issue can not be
> > triggered. That says when more than one cpu, the above loop can exit
> > by the condition "if (!bio)"
> >
> > Any ideas?
>
> I think the patch is correct.
>
> For kernel built without CONFIG_PREEMPT, in case of single cpu core,
> if the dm target(such as dm-thin) needs another wq or kthread for
> handling IO, then dm target side is blocked because dm_wq_work()
> holds the single cpu, sooner or later, dm target may have not
> resource to handle new io from dm core and returns REQUEUE.
>
> Then dm_wq_work becomes one dead loop.
>
>
> Thanks,
> Ming
>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] blk-mq: enforce op-specific segment limits in blk_insert_cloned_request

2023-02-16 Thread Mike Snitzer
On Thu, Feb 16 2023 at  2:27P -0500,
Uday Shankar  wrote:

> On Wed, Feb 15, 2023 at 10:09:36PM -0800, Christoph Hellwig wrote:
> > I'd just remove the debug check entirely
> 
> Older kernels have these checks in a separate function called
> blk_cloned_rq_check_limits, which carries the following comment:
> 
> /**
>  * blk_cloned_rq_check_limits - Helper function to check a cloned request
>  *  for the new queue limits
>  * @q:  the queue
>  * @rq: the request being checked
>  *
>  * Description:
>  *@rq may have been made based on weaker limitations of upper-level queues
>  *in request stacking drivers, and it may violate the limitation of @q.
>  *Since the block layer and the underlying device driver trust @rq
>  *after it is inserted to @q, it should be checked against @q before
>  *the insertion using this generic function.
>  *
>  *Request stacking drivers like request-based dm may change the queue
>  *limits when retrying requests on other queues. Those requests need
>  *to be checked against the new queue limits again during dispatch.
>  */.
> 
> Is this concern no longer relevant?

Still relevant.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v2] dm-crypt: allocate compound pages if possible

2023-02-16 Thread Mikulas Patocka
It was reported that allocating pages for the write buffer in dm-crypt
causes measurable overhead [1].

This patch changes dm-crypt to allocate compound pages if they are
available. If not, we fall back to the mempool.

[1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html

Signed-off-by: Mikulas Patocka 

---
 drivers/md/dm-crypt.c |   48 +---
 1 file changed, 33 insertions(+), 15 deletions(-)

Index: linux-2.6/drivers/md/dm-crypt.c
===
--- linux-2.6.orig/drivers/md/dm-crypt.c2023-02-16 20:40:15.0 
+0100
+++ linux-2.6/drivers/md/dm-crypt.c 2023-02-16 21:56:34.0 +0100
@@ -1657,6 +1657,9 @@ static void crypt_free_buffer_pages(stru
  * In order to not degrade performance with excessive locking, we try
  * non-blocking allocations without a mutex first but on failure we fallback
  * to blocking allocations with a mutex.
+ *
+ * In order to reduce allocation overhead, we try to allocate compound pages in
+ * the first pass. If they are not available, we fall back to the mempool.
  */
 static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
 {
@@ -1664,8 +1667,8 @@ static struct bio *crypt_alloc_buffer(st
struct bio *clone;
unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
-   unsigned i, len, remaining_size;
-   struct page *page;
+   unsigned remaining_size;
+   unsigned order = MAX_ORDER - 1;
 
 retry:
if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -1678,20 +1681,34 @@ retry:
 
remaining_size = size;
 
-   for (i = 0; i < nr_iovecs; i++) {
-   page = mempool_alloc(>page_pool, gfp_mask);
-   if (!page) {
+   while (remaining_size) {
+   struct page *pages;
+   unsigned size_to_add;
+   unsigned remaining_order = __fls((remaining_size + PAGE_SIZE - 
1) >> PAGE_SHIFT);
+   order = min(order, remaining_order);
+
+   while (order > 0) {
+   pages = alloc_pages(gfp_mask
+   | __GFP_NOMEMALLOC | __GFP_NORETRY | 
__GFP_NOWARN | __GFP_COMP,
+   order);
+   if (likely(pages != NULL))
+   goto have_pages;
+   order--;
+   }
+
+   pages = mempool_alloc(>page_pool, gfp_mask);
+   if (!pages) {
crypt_free_buffer_pages(cc, clone);
bio_put(clone);
gfp_mask |= __GFP_DIRECT_RECLAIM;
+   order = 0;
goto retry;
}
 
-   len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size;
-
-   bio_add_page(clone, page, len, 0);
-
-   remaining_size -= len;
+have_pages:
+   size_to_add = min((unsigned)PAGE_SIZE << order, remaining_size);
+   bio_add_page(clone, pages, size_to_add, 0);
+   remaining_size -= size_to_add;
}
 
/* Allocate space for integrity tags */
@@ -1709,12 +1726,13 @@ retry:
 
 static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
 {
-   struct bio_vec *bv;
-   struct bvec_iter_all iter_all;
+   struct folio_iter fi;
 
-   bio_for_each_segment_all(bv, clone, iter_all) {
-   BUG_ON(!bv->bv_page);
-   mempool_free(bv->bv_page, >page_pool);
+   bio_for_each_folio_all(fi, clone) {
+   if (folio_test_large(fi.folio))
+   folio_put(fi.folio);
+   else
+   mempool_free(>page, >page_pool);
}
 }
 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH] dm-crypt: allocate compound pages if possible

2023-02-16 Thread Mikulas Patocka



On Thu, 16 Feb 2023, Matthew Wilcox wrote:

> > -   len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size;
> > -
> > -   bio_add_page(clone, page, len, 0);
> > +have_pages:
> > +   page->compound_order = order;
> 
> No.  You'll corrupt the next page if page is order-0, which it is if it
> came from the mempool.  Also we've deleted page->compound_order in -next
> so you can't make this mistake.  Using __GFP_COMP will set this field
> for you, so you can just drop this line.

OK

> > -   remaining_size -= len;
> > +   for (o = 0; o < 1U << order; o++) {
> > +   unsigned len = min((unsigned)PAGE_SIZE, remaining_size);
> > +   bio_add_page(clone, page, len, 0);
> > +   remaining_size -= len;
> > +   page++;
> 
> You can add multiple pages at once, whether they're compound or not.  So
> replace this entire loop with:
> 
>   bio_add_page(clone, page, remaining_size, 0);

This should be min((unsigned)PAGE_SIZE << order, remaining_size), because 
we may allocate less than remaining_size.

> > @@ -1711,10 +1732,23 @@ static void crypt_free_buffer_pages(stru
> >  {
> > struct bio_vec *bv;
> > struct bvec_iter_all iter_all;
> > +   unsigned skip_entries = 0;
> >  
> > bio_for_each_segment_all(bv, clone, iter_all) {
> > -   BUG_ON(!bv->bv_page);
> > -   mempool_free(bv->bv_page, >page_pool);
> > +   unsigned order;
> > +   struct page *page = bv->bv_page;
> > +   BUG_ON(!page);
> > +   if (skip_entries) {
> > +   skip_entries--;
> > +   continue;
> > +   }
> > +   order = page->compound_order;
> > +   if (order) {
> > +   __free_pages(page, order);
> > +   skip_entries = (1U << order) - 1;
> > +   } else {
> > +   mempool_free(page, >page_pool);
> > +   }
> 
> You can simplify this by using the folio code.
> 
>   struct folio_iter fi;
> 
>   bio_for_each_folio_all(fi, bio) {
>   if (folio_test_large(folio))
>   folio_put(folio);
>   else
>   mempool_free(>page, >page_pool);
>   }

OK. I'm sending version 2 of the patch.

> (further work would actually convert this driver to use folios instead
> of pages)

Mikulas
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH] blk-mq: enforce op-specific segment limits in blk_insert_cloned_request

2023-02-16 Thread Uday Shankar
On Wed, Feb 15, 2023 at 10:09:36PM -0800, Christoph Hellwig wrote:
> I'd just remove the debug check entirely

Older kernels have these checks in a separate function called
blk_cloned_rq_check_limits, which carries the following comment:

/**
 * blk_cloned_rq_check_limits - Helper function to check a cloned request
 *  for the new queue limits
 * @q:  the queue
 * @rq: the request being checked
 *
 * Description:
 *@rq may have been made based on weaker limitations of upper-level queues
 *in request stacking drivers, and it may violate the limitation of @q.
 *Since the block layer and the underlying device driver trust @rq
 *after it is inserted to @q, it should be checked against @q before
 *the insertion using this generic function.
 *
 *Request stacking drivers like request-based dm may change the queue
 *limits when retrying requests on other queues. Those requests need
 *to be checked against the new queue limits again during dispatch.
 */.

Is this concern no longer relevant?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH] dm-crypt: allocate compound pages if possible

2023-02-16 Thread Matthew Wilcox
On Thu, Feb 16, 2023 at 12:47:08PM -0500, Mikulas Patocka wrote:
> + while (order > 0) {
> + page = alloc_pages(gfp_mask
> + | __GFP_NOMEMALLOC | __GFP_NORETRY | 
> __GFP_NOWARN, order);

... | __GFP_COMP

>   page = mempool_alloc(>page_pool, gfp_mask);
>   if (!page) {
>   crypt_free_buffer_pages(cc, clone);
>   bio_put(clone);
>   gfp_mask |= __GFP_DIRECT_RECLAIM;
> + order = 0;
>   goto retry;
>   }
>  
> - len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size;
> -
> - bio_add_page(clone, page, len, 0);
> +have_pages:
> + page->compound_order = order;

No.  You'll corrupt the next page if page is order-0, which it is if it
came from the mempool.  Also we've deleted page->compound_order in -next
so you can't make this mistake.  Using __GFP_COMP will set this field
for you, so you can just drop this line.

> - remaining_size -= len;
> + for (o = 0; o < 1U << order; o++) {
> + unsigned len = min((unsigned)PAGE_SIZE, remaining_size);
> + bio_add_page(clone, page, len, 0);
> + remaining_size -= len;
> + page++;

You can add multiple pages at once, whether they're compound or not.  So
replace this entire loop with:

bio_add_page(clone, page, remaining_size, 0);

> @@ -1711,10 +1732,23 @@ static void crypt_free_buffer_pages(stru
>  {
>   struct bio_vec *bv;
>   struct bvec_iter_all iter_all;
> + unsigned skip_entries = 0;
>  
>   bio_for_each_segment_all(bv, clone, iter_all) {
> - BUG_ON(!bv->bv_page);
> - mempool_free(bv->bv_page, >page_pool);
> + unsigned order;
> + struct page *page = bv->bv_page;
> + BUG_ON(!page);
> + if (skip_entries) {
> + skip_entries--;
> + continue;
> + }
> + order = page->compound_order;
> + if (order) {
> + __free_pages(page, order);
> + skip_entries = (1U << order) - 1;
> + } else {
> + mempool_free(page, >page_pool);
> + }

You can simplify this by using the folio code.

struct folio_iter fi;

bio_for_each_folio_all(fi, bio) {
if (folio_test_large(folio))
folio_put(folio);
else
mempool_free(>page, >page_pool);
}

(further work would actually convert this driver to use folios instead
of pages)

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH] dm-crypt: allocate compound pages if possible

2023-02-16 Thread Mikulas Patocka
It was reported that allocating pages for the write buffer in dm-crypt
causes measurable overhead [1].

This patch changes dm-crypt to allocate compound pages if they are
available. If not, we fall back to the mempool.

[1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html

Signed-off-by: Mikulas Patocka 

---
 drivers/md/dm-crypt.c |   50 ++
 1 file changed, 42 insertions(+), 8 deletions(-)

Index: linux-2.6/drivers/md/dm-crypt.c
===
--- linux-2.6.orig/drivers/md/dm-crypt.c2023-01-20 13:22:38.0 
+0100
+++ linux-2.6/drivers/md/dm-crypt.c 2023-02-16 18:33:42.0 +0100
@@ -1657,6 +1657,9 @@ static void crypt_free_buffer_pages(stru
  * In order to not degrade performance with excessive locking, we try
  * non-blocking allocations without a mutex first but on failure we fallback
  * to blocking allocations with a mutex.
+ *
+ * In order to reduce allocation overhead, we try to allocate compound pages in
+ * the first pass. If they are not available, we fall back to the mempool.
  */
 static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
 {
@@ -1664,8 +1667,9 @@ static struct bio *crypt_alloc_buffer(st
struct bio *clone;
unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
-   unsigned i, len, remaining_size;
+   unsigned remaining_size;
struct page *page;
+   unsigned order = MAX_ORDER - 1;
 
 retry:
if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
@@ -1678,20 +1682,37 @@ retry:
 
remaining_size = size;
 
-   for (i = 0; i < nr_iovecs; i++) {
+   while (remaining_size) {
+   unsigned o;
+   unsigned remaining_order = __fls((remaining_size + PAGE_SIZE - 
1) >> PAGE_SHIFT);
+   order = min(order, remaining_order);
+
+   while (order > 0) {
+   page = alloc_pages(gfp_mask
+   | __GFP_NOMEMALLOC | __GFP_NORETRY | 
__GFP_NOWARN, order);
+   if (likely(page != NULL))
+   goto have_pages;
+   order--;
+   }
+
page = mempool_alloc(>page_pool, gfp_mask);
if (!page) {
crypt_free_buffer_pages(cc, clone);
bio_put(clone);
gfp_mask |= __GFP_DIRECT_RECLAIM;
+   order = 0;
goto retry;
}
 
-   len = (remaining_size > PAGE_SIZE) ? PAGE_SIZE : remaining_size;
-
-   bio_add_page(clone, page, len, 0);
+have_pages:
+   page->compound_order = order;
 
-   remaining_size -= len;
+   for (o = 0; o < 1U << order; o++) {
+   unsigned len = min((unsigned)PAGE_SIZE, remaining_size);
+   bio_add_page(clone, page, len, 0);
+   remaining_size -= len;
+   page++;
+   }
}
 
/* Allocate space for integrity tags */
@@ -1711,10 +1732,23 @@ static void crypt_free_buffer_pages(stru
 {
struct bio_vec *bv;
struct bvec_iter_all iter_all;
+   unsigned skip_entries = 0;
 
bio_for_each_segment_all(bv, clone, iter_all) {
-   BUG_ON(!bv->bv_page);
-   mempool_free(bv->bv_page, >page_pool);
+   unsigned order;
+   struct page *page = bv->bv_page;
+   BUG_ON(!page);
+   if (skip_entries) {
+   skip_entries--;
+   continue;
+   }
+   order = page->compound_order;
+   if (order) {
+   __free_pages(page, order);
+   skip_entries = (1U << order) - 1;
+   } else {
+   mempool_free(page, >page_pool);
+   }
}
 }
 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [v2 PATCH 0/5] Introduce mempool pages bulk allocator and use it in dm-crypt

2023-02-16 Thread Mikulas Patocka



On Wed, 15 Feb 2023, Yang Shi wrote:

> On Wed, Feb 15, 2023 at 4:23 AM Mikulas Patocka  wrote:
> >
> >
> >
> > On Tue, 14 Feb 2023, Yang Shi wrote:
> >
> > >
> > > Changelog:
> > > RFC -> v2:
> > >   * Added callback variant for page bulk allocator and mempool bulk 
> > > allocator
> > > per Mel Gorman.
> > >   * Used the callback version in dm-crypt driver.
> > >   * Some code cleanup and refactor to reduce duplicate code.
> > >
> > > rfc: 
> > > https://lore.kernel.org/linux-mm/20221005180341.1738796-1-shy828...@gmail.com/
> >
> > Hi
> >
> > This seems like unneeded complication to me. We have alloc_pages(), it can
> > allocate multiple pages efficiently, so why not use it?
> 
> The alloc_pages() allocates *contiguous* pages, but dm-crypt doesn't
> need contiguous pages at all. This may incur unnecessary compaction

It doesn't hurt that the pages are contiguous - and allocating and freeing 
a few compound pages is even faster than allocating and freeing many 
0-order pages.

> overhead to the dm-crypt layer when memory is fragmented.

The compaction overhead may be suppressed by the GFP flags (i.e. don't use 
__GFP_DIRECT_RECLAIM).

> The bulk allocator is a good fit to this usecase, which allocates 
> multiple order-0 pages.
> 
> In addition, filesystem writeback doesn't guarantee power-of-2 pages
> every time IIUC. But alloc_pages() just can allocate power-of-2 pages.

So, we can allocate more compound pages for the non-power-of-2 case - see 
the next patch that I'm sending.

> >
> > I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if
> > alloc_pages() fails (either because the system is low on memory or because
> > memory is too fragmented), fall back to the existing code that does
> > mempool_alloc().
> 
> My PoC patches just did this way, but called bulk allocator. There may
> be other potential mepool users as I listed in this cover letter,
> which may get benefits from bulk allocator. So introducing a new bulk
> mempool API seems better for long run although we just have one user
> for now. And it makes other uses easier to gain the benefit by just
> calling the new API.

This mempool bulk refactoring just makes the code bigger. And it is not 
needed - dm-crypt can fall-back to non-bulk mempool allocations.

In the next email, I'm sending a patch that is noticeably smaller and that 
uses alloc_pages()/__free_pages().

Mikulas
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] A hang bug of dm on s390x

2023-02-16 Thread Mike Snitzer


[Top-posting but please don't...]

I've staged this fix for 6.3 inclusion and marked it for stable@:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-6.3=0ca44fcef241768fd25ee763b3d203b9852f269b

Ming, I also staged this similar fix (not reasoned through scenario
where it'd actually occur that dm_wq_requeue_work would loop endlessly
but its good practice to include cond_resched() in such a workqueue
while loop):
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-6.3=f77692d65d54665d81815349cc727baa85e8b71d

Thanks,
Mike

On Thu, Feb 16 2023 at  3:30P -0500,
Pingfan Liu  wrote:

> Hi Ming,
> 
> Thank you for looking into this.
> 
> let me loop in Alasdair, Mike and Zdenek for further comment on LVM stuff
> 
> 
> Thanks,
> 
> Pingfan
> 
> On Thu, Feb 16, 2023 at 8:08 AM Ming Lei  wrote:
> >
> > On Wed, Feb 15, 2023 at 07:23:40PM +0800, Pingfan Liu wrote:
> > > Hi guys,
> > >
> > > I encountered  a hang issue on a s390x system.  The tested kernel is
> > > not preemptible and booting with "nr_cpus=1"
> > >
> > > The test steps:
> > >   umount /home
> > >   lvremove /dev/rhel_s390x-kvm-011/home
> > >   ## uncomment "snapshot_autoextend_threshold = 70" and
> > >   "snapshot_autoextend_percent = 20" in /etc/lvm/lvm.conf
> > >
> > >   systemctl enable lvm2-monitor.service
> > >   systemctl start lvm2-monitor.service
> > >
> > >   lvremove -y rhel_s390x-kvm-011/thinp
> > >   lvcreate -L 10M -T rhel_s390x-kvm-011/thinp
> > >   lvcreate -V 400M -T rhel_s390x-kvm-011/thinp -n src
> > >   mkfs.ext4 /dev/rhel_s390x-kvm-011/src
> > >   mount /dev/rhel_s390x-kvm-011/src /mnt
> > >   for((i=0;i<4;i++)); do dd if=/dev/zero of=/mnt/test$i.img
> > > bs=100M count=1; done
> > >
> > > And the system hangs with the console log [1]
> > >
> > > The related kernel config
> > >
> > > CONFIG_PREEMPT_NONE_BUILD=y
> > > CONFIG_PREEMPT_NONE=y
> > > CONFIG_PREEMPT_COUNT=y
> > > CONFIG_SCHED_CORE=y
> > >
> > > It turns out that when hanging, the kernel is stuck in the dead-loop
> > > in the function dm_wq_work()
> > > while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
> > > spin_lock_irq(>deferred_lock);
> > > bio = bio_list_pop(>deferred);
> > > spin_unlock_irq(>deferred_lock);
> > >
> > > if (!bio)
> > > break;
> > > thread_cpu = smp_processor_id();
> > > submit_bio_noacct(bio);
> > > }
> > > where dm_wq_work()->__submit_bio_noacct()->...->dm_handle_requeue()
> > > keeps generating new bio, and the condition "if (!bio)" can not be
> > > meet.
> > >
> > >
> > > After applying the following patch, the issue is gone.
> > >
> > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > index e1ea3a7bd9d9..95c9cb07a42f 100644
> > > --- a/drivers/md/dm.c
> > > +++ b/drivers/md/dm.c
> > > @@ -2567,6 +2567,7 @@ static void dm_wq_work(struct work_struct *work)
> > > break;
> > >
> > > submit_bio_noacct(bio);
> > > +   cond_resched();
> > > }
> > >  }
> > >
> > > But I think it is not a proper solution. And without this patch, if
> > > removing nr_cpus=1 (the system has two cpus), the issue can not be
> > > triggered. That says when more than one cpu, the above loop can exit
> > > by the condition "if (!bio)"
> > >
> > > Any ideas?
> >
> > I think the patch is correct.
> >
> > For kernel built without CONFIG_PREEMPT, in case of single cpu core,
> > if the dm target(such as dm-thin) needs another wq or kthread for
> > handling IO, then dm target side is blocked because dm_wq_work()
> > holds the single cpu, sooner or later, dm target may have not
> > resource to handle new io from dm core and returns REQUEUE.
> >
> > Then dm_wq_work becomes one dead loop.
> >
> >
> > Thanks,
> > Ming
> >
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH] blk-mq: enforce op-specific segment limits in blk_insert_cloned_request

2023-02-16 Thread kernel test robot
Hi Uday,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on 6bea9ac7c6481c09eb2b61d7cd844fc64a526e3e]

url:
https://github.com/intel-lab-lkp/linux/commits/Uday-Shankar/blk-mq-enforce-op-specific-segment-limits-in-blk_insert_cloned_request/20230216-041718
base:   6bea9ac7c6481c09eb2b61d7cd844fc64a526e3e
patch link:
https://lore.kernel.org/r/20230215201507.494152-1-ushankar%40purestorage.com
patch subject: [PATCH] blk-mq: enforce op-specific segment limits in 
blk_insert_cloned_request
config: i386-randconfig-a004-20230213 
(https://download.01.org/0day-ci/archive/20230216/202302162040.fai25ul2-...@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project 
f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://github.com/intel-lab-lkp/linux/commit/2cd958b09698bedea1a5a5f6298f0d25ec522bf9
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review 
Uday-Shankar/blk-mq-enforce-op-specific-segment-limits-in-blk_insert_cloned_request/20230216-041718
git checkout 2cd958b09698bedea1a5a5f6298f0d25ec522bf9
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 
O=build_dir ARCH=i386 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 
O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot 
| Link: 
https://lore.kernel.org/oe-kbuild-all/202302162040.fai25ul2-...@intel.com/

All warnings (new ones prefixed by >>):

>> block/blk-mq.c:3032:36: warning: format specifies type 'unsigned short' but 
>> the argument has type 'unsigned int' [-Wformat]
   __func__, rq->nr_phys_segments, max_segments);
   ^~~~
   include/linux/printk.h:457:60: note: expanded from macro 'printk'
   #define printk(fmt, ...) printk_index_wrap(_printk, fmt, ##__VA_ARGS__)
   ~~~^~~
   include/linux/printk.h:429:19: note: expanded from macro 'printk_index_wrap'
   _p_func(_fmt, ##__VA_ARGS__);   \
   ^~~
   1 warning generated.


vim +3032 block/blk-mq.c

  2993  
  2994  #ifdef CONFIG_BLK_MQ_STACKING
  2995  /**
  2996   * blk_insert_cloned_request - Helper for stacking drivers to submit a 
request
  2997   * @rq: the request being queued
  2998   */
  2999  blk_status_t blk_insert_cloned_request(struct request *rq)
  3000  {
  3001  struct request_queue *q = rq->q;
  3002  unsigned int max_sectors = blk_queue_get_max_sectors(q, 
req_op(rq));
  3003  unsigned int max_segments = blk_queue_get_max_segments(q, 
req_op(rq));
  3004  blk_status_t ret;
  3005  
  3006  if (blk_rq_sectors(rq) > max_sectors) {
  3007  /*
  3008   * SCSI device does not have a good way to return if
  3009   * Write Same/Zero is actually supported. If a device 
rejects
  3010   * a non-read/write command (discard, write same,etc.) 
the
  3011   * low-level device driver will set the relevant queue 
limit to
  3012   * 0 to prevent blk-lib from issuing more of the 
offending
  3013   * operations. Commands queued prior to the queue limit 
being
  3014   * reset need to be completed with BLK_STS_NOTSUPP to 
avoid I/O
  3015   * errors being propagated to upper layers.
  3016   */
  3017  if (max_sectors == 0)
  3018  return BLK_STS_NOTSUPP;
  3019  
  3020  printk(KERN_ERR "%s: over max size limit. (%u > %u)\n",
  3021  __func__, blk_rq_sectors(rq), max_sectors);
  3022  return BLK_STS_IOERR;
  3023  }
  3024  
  3025  /*
  3026   * The queue settings related to segment counting may differ 
from the
  3027   * original queue.
  3028   */
  3029  rq->nr_phys_segments = blk_recalc_rq_segments(rq);
  3030  if (rq->nr_phys_segments > max_segments) {
  3031  printk(KERN_ERR "%s: over max segments limit. (%hu > 
%hu)\n",
> 3032  __func__, rq->nr_phys_segments, max_segments);
  3033  return BLK_STS_IOERR;
  3034  }
  3035  
  3036  if (q->disk && should_fail_request(q->disk->part0, 
blk_rq_bytes(rq)))
  3037  re

Re: [dm-devel] A hang bug of dm on s390x

2023-02-16 Thread Zdenek Kabelac

Dne 16. 02. 23 v 9:30 Pingfan Liu napsal(a):

Hi Ming,

Thank you for looking into this.

let me loop in Alasdair, Mike and Zdenek for further comment on LVM stuff


Thanks,

Pingfan



Hi


From lvm2 POV - couple clarifications - to let thin-pool auto-extend - user 
has to configure:



thin_pool_autoextend_threshold = 70

thin_pool_autoextend_percent = 20


If the thin_pool_autoextend_threshold is left with the default value 100, 
there is no extension made to the thin-pool.


Default behavior of thin-pool kernel target when it runs out-of-space is to 
put all in-flight IO operation on-hold for 60s (configurable by kernel 
parameter) then all such operation starts to be errored and thin-pool goes to 
out-of-space error state.


To immediately get this state use  '--errorwhenfull=y'  with thinpool 
(lvcreate, lvconvert, lvchange) - this will avoid any delay if user doesn't 
want expansion of thin-pool and wants to get error ASAP.


But this all might be unrelated to the issue you are getting on your hw.


Regards

Zdenek



On Thu, Feb 16, 2023 at 8:08 AM Ming Lei  wrote:

On Wed, Feb 15, 2023 at 07:23:40PM +0800, Pingfan Liu wrote:

Hi guys,

I encountered  a hang issue on a s390x system.  The tested kernel is
not preemptible and booting with "nr_cpus=1"

The test steps:
   umount /home
   lvremove /dev/rhel_s390x-kvm-011/home
   ## uncomment "snapshot_autoextend_threshold = 70" and
   "snapshot_autoextend_percent = 20" in /etc/lvm/lvm.conf

   systemctl enable lvm2-monitor.service
   systemctl start lvm2-monitor.service

   lvremove -y rhel_s390x-kvm-011/thinp
   lvcreate -L 10M -T rhel_s390x-kvm-011/thinp
   lvcreate -V 400M -T rhel_s390x-kvm-011/thinp -n src
   mkfs.ext4 /dev/rhel_s390x-kvm-011/src
   mount /dev/rhel_s390x-kvm-011/src /mnt
   for((i=0;i<4;i++)); do dd if=/dev/zero of=/mnt/test$i.img
bs=100M count=1; done

And the system hangs with the console log [1]

The related kernel config

 CONFIG_PREEMPT_NONE_BUILD=y
 CONFIG_PREEMPT_NONE=y
 CONFIG_PREEMPT_COUNT=y
 CONFIG_SCHED_CORE=y

It turns out that when hanging, the kernel is stuck in the dead-loop
in the function dm_wq_work()
 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
 spin_lock_irq(>deferred_lock);
 bio = bio_list_pop(>deferred);
 spin_unlock_irq(>deferred_lock);

 if (!bio)
 break;
 thread_cpu = smp_processor_id();
 submit_bio_noacct(bio);
 }
where dm_wq_work()->__submit_bio_noacct()->...->dm_handle_requeue()
keeps generating new bio, and the condition "if (!bio)" can not be
meet.


After applying the following patch, the issue is gone.

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e1ea3a7bd9d9..95c9cb07a42f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2567,6 +2567,7 @@ static void dm_wq_work(struct work_struct *work)
 break;

 submit_bio_noacct(bio);
+   cond_resched();
 }
  }

But I think it is not a proper solution. And without this patch, if
removing nr_cpus=1 (the system has two cpus), the issue can not be
triggered. That says when more than one cpu, the above loop can exit
by the condition "if (!bio)"

Any ideas?

I think the patch is correct.

For kernel built without CONFIG_PREEMPT, in case of single cpu core,
if the dm target(such as dm-thin) needs another wq or kthread for
handling IO, then dm target side is blocked because dm_wq_work()
holds the single cpu, sooner or later, dm target may have not
resource to handle new io from dm core and returns REQUEUE.

Then dm_wq_work becomes one dead loop.


Thanks,
Ming



--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] On allowing remounting the partition containing dm-mapped boot disk as read-write

2023-02-16 Thread Roger Heflin
Besides caching at the dm layer, there is caching at the filesystem
layer (files, directories and other stuff).  The caching at the DM
layer is not the issue.

If rootfs mounts the dm-device someplace as a
(ntfs)filesystem(readonly), and then later you also mount that same
(ntfs)filesystem (read-write) someplace then you have mounted the
filesystem twice (not a shared bind mount that appears to be 2 mounts
but in reality is only a single mount), and the filesystem code is not
safe when there are 2 unique separate mount instances, the fs caching
is not shared.

ie if the filesystem is already mounted as a filesystem someplace
(mapping a dm device is not mounting) and you mount it a 2nd time then
you are at risk of corruption.

For the mount command to get a busy indicates that something has
already mounted it someplace and has locked the device(to prevent a
2nd corrupting mount from happening), and the entire purpose of the
module appears to be to disable the safety provided by this locking
and allow one to mount the fs again.

And note that with double mounts (usually I have seen/debugged the
crash/ouage from them with shared SAN storage)  they can "APPEAR" to
kind of work if you do not use them too much, but eventually the 2
instances of filesystem code step on each other and it ends badly
(kernel crash, data corruption, loss of the filesystem).

If you need this module so that you can use another (non-bind) mount
command someplace on the same device then you are double mounting.
The point of the busy device is that it is mounted someplace, and the
entire point of the module is to bypass the protection.

It is the filesystem code and filesystem caching that is not safe.
Both filesystems assume they have total control of filesystem data
structures, and at the very least the mount writing will change those
structures, and if the read-only mount may read blocks that were part
of one file but now aren't part of that file returning the wrong data.

On Wed, Feb 15, 2023 at 11:23 PM Ganapathi Kamath  wrote:
>
>
> Firstly, thankyou for your reply. I'm not a kernel expert, so I value what 
> you say.
> but as I raised the issue I felt I had to defend the usefulness, userbase and 
> the need.
>
> > Typically double mounts are done via bind mounts (not really double
> > mounted just the device showing someplace else).   Or one would do a
> > mount -o remount,rw  and remount it rw so you could write
> > to it.
>
> > A real double mount where the kernel fs modules manages both mounts as
> > if it was a separate device won't work reliably, as the filesystem
> > module/caching assumes it has total control.   So writing to a
> > read-write device that has a separate read-only mount with some data
> > in the read cache will not return the true data in all cases.2
> > read-write (done any way--except with a clustering fs modules) are
> > going to cause corruption of the underlying filesystem.
>
> I want to clarify, even though ventoy uses the word 'remount' to describe the
> feature, the host file system isn't mounted twice.  There is no loopback-fs
> to bind mount. and dmsetup status shows linear sequences of blocks allocated
> to the devmapper device.
>
> For this feature to work, the Linux being booted up, creates the devmapper
> device by after having first somehow determined the sectors occupied by
> the file in the filesystem. Then mounts the partitions inside devmapper device
> and then pivots to the discovered rootfs and continues booting.
>
> So what I think you are saying is that a mount of /dev/sdc1 and
> /dev/mapper/ventoy are claiming to use the hard-disk sectors, and asking the
> kernel to consider them as part of its disk-sector caching mechanism.
>
> Booting virtual-disks this way is also called nativeboot.
> The way this nativeboot so far works, has a little danger.
> Three safe guards are to be followed by self contract:
> 1) The virtual-disk-file must be a fixed size, it cannot be allowed to grow 
> or shrink.
> 2) The virtual-disk-file must not be manipulated/touched/read from the 
> host-partition.
> 3) The filesystem driver shouldn't defrag, scrub or relocate the 
> virtual-disk-file.
> This is so that the the file entry in the outside fs remains untouched.
> Usually, as this is done by root and such administrative user knows what he is
> doing, so it is not so much of a problem.
> If one adheres to the above self-contract, the filesystem code for partitions 
> inside
> the dm-device does not interfere with the filesystem code for the outside 
> partition.
>
> An idea just came to me. There maybe a way around the disk sector caching.
> Will this work or be more safer?
> What if, in addition to /dev/mapper/ventoy, a second dm device
> /dev/mapper/antiventoy of the same size of the host partition /dev/sdc1 is 
> created
> by stitching together the other remaining sectors of the host partition 
> /dev/sdc1,
> with the vdisk sectors swapped for with null sectors. Then the two dm created 
> disks:
> 

Re: [dm-devel] On allowing remounting the partition containing dm-mapped boot disk as read-write

2023-02-16 Thread Ganapathi Kamath


Firstly, thankyou for your reply. I'm not a kernel expert, so I value what you 
say.
but as I raised the issue I felt I had to defend the usefulness, userbase and 
the need.

> Typically double mounts are done via bind mounts (not really double
> mounted just the device showing someplace else).   Or one would do a
> mount -o remount,rw  and remount it rw so you could write
> to it.

> A real double mount where the kernel fs modules manages both mounts as
> if it was a separate device won't work reliably, as the filesystem
> module/caching assumes it has total control.   So writing to a
> read-write device that has a separate read-only mount with some data
> in the read cache will not return the true data in all cases.    2
> read-write (done any way--except with a clustering fs modules) are
> going to cause corruption of the underlying filesystem.

I want to clarify, even though ventoy uses the word 'remount' to describe the 
feature, the host file system isn't mounted twice.  There is no loopback-fs 
to bind mount. and dmsetup status shows linear sequences of blocks allocated
to the devmapper device.

For this feature to work, the Linux being booted up, creates the devmapper 
device by after having first somehow determined the sectors occupied by
the file in the filesystem. Then mounts the partitions inside devmapper device 
and then pivots to the discovered rootfs and continues booting. 

So what I think you are saying is that a mount of /dev/sdc1 and 
/dev/mapper/ventoy are claiming to use the hard-disk sectors, and asking the
kernel to consider them as part of its disk-sector caching mechanism.

Booting virtual-disks this way is also called nativeboot. 
The way this nativeboot so far works, has a little danger. 
Three safe guards are to be followed by self contract: 
1) The virtual-disk-file must be a fixed size, it cannot be allowed to grow or 
shrink.
2) The virtual-disk-file must not be manipulated/touched/read from the 
host-partition.  
3) The filesystem driver shouldn't defrag, scrub or relocate the 
virtual-disk-file.
This is so that the the file entry in the outside fs remains untouched.
Usually, as this is done by root and such administrative user knows what he is 
doing, so it is not so much of a problem.
If one adheres to the above self-contract, the filesystem code for partitions 
inside 
the dm-device does not interfere with the filesystem code for the outside 
partition.

An idea just came to me. There maybe a way around the disk sector caching. 
Will this work or be more safer? 
What if, in addition to /dev/mapper/ventoy, a second dm device
/dev/mapper/antiventoy of the same size of the host partition /dev/sdc1 is 
created
by stitching together the other remaining sectors of the host partition 
/dev/sdc1, 
with the vdisk sectors swapped for with null sectors. Then the two dm created 
disks: 
/dev/mapper/ventoy and /dev/mapper/antiventoy can be mounted independently, 
without overlap of disk sectors, separating their caching.  
The self-contract will still be needed, to not alter the location/size of 
fs-entry. 
I'll suggest the above to ventoy. Your thoughts will be helpful.

> Given that the use case for the module is dangerous and the use case
> is of questionable usefulness I would think that is no point of the
> module.  The module's intent seems to be to get around the exclusive
> locks that the filesystem (for good reason) is putting on the device.

I believe that the danger can be mitigated with a good idea and proper coding. 
But the door shouldn't be shut. 

Its usefulness and base of users is really there. The use case is really 
important  
1) to those users who dualboot windows/linux, multi boot other OS-es 
and juggle between them for want of SSD space,
2) to multiboot alternate OS. but have limited on-machine disk-space
2) to mounting different live isos often, which are often re-downloaded due to 
updates.
3) to those keeping a host of recovery isos-s at hand for emergency like 
WINRE, WIN-installer, linux-installer, HBCD-PE, gparted, clonezilla, 
rescuezilla, seabios, samsung-wizard at hand,

Why not a VM?: 
VM-s are nice but a bit slower than nativeboot. Many things cannot be done
inside a VM such as get full native experience of a live iso, GPU support and 
all.
Some system level recovery and repair tools must be booted as native.

In the old days Harddisks, USB drives, iso files were small.
vdisks were inexistent.
One had to burn live-isos to cd/dvd. Disks have grown larger. 
Burning DVDs is such a waste now.

At one point I considered having a small number of 8GB microsd cards to function
just like tiny dvds/floppies. But managing them is also a hassle, as they are 
stored
external.

Disadvantages of flashing USB drive
* flashing a USB drive, which say is 32gb, with a tiny <3gb ISO file. can 
result in it wasting 
space as it creates a small partition, limiting the drive's usefulness.
* One doesn't want too many usb drives lying around to boot different iso-s
* In 

[dm-devel] A hang bug of dm on s390x

2023-02-16 Thread Pingfan Liu
Hi guys,

I encountered  a hang issue on a s390x system.  The tested kernel is
not preemptible and booting with "nr_cpus=1"

The test steps:
  umount /home
  lvremove /dev/rhel_s390x-kvm-011/home
  ## uncomment "snapshot_autoextend_threshold = 70" and
  "snapshot_autoextend_percent = 20" in /etc/lvm/lvm.conf

  systemctl enable lvm2-monitor.service
  systemctl start lvm2-monitor.service

  lvremove -y rhel_s390x-kvm-011/thinp
  lvcreate -L 10M -T rhel_s390x-kvm-011/thinp
  lvcreate -V 400M -T rhel_s390x-kvm-011/thinp -n src
  mkfs.ext4 /dev/rhel_s390x-kvm-011/src
  mount /dev/rhel_s390x-kvm-011/src /mnt
  for((i=0;i<4;i++)); do dd if=/dev/zero of=/mnt/test$i.img
bs=100M count=1; done

And the system hangs with the console log [1]

The related kernel config

CONFIG_PREEMPT_NONE_BUILD=y
CONFIG_PREEMPT_NONE=y
CONFIG_PREEMPT_COUNT=y
CONFIG_SCHED_CORE=y

It turns out that when hanging, the kernel is stuck in the dead-loop
in the function dm_wq_work()
while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags)) {
spin_lock_irq(>deferred_lock);
bio = bio_list_pop(>deferred);
spin_unlock_irq(>deferred_lock);

if (!bio)
break;
thread_cpu = smp_processor_id();
submit_bio_noacct(bio);
}
where dm_wq_work()->__submit_bio_noacct()->...->dm_handle_requeue()
keeps generating new bio, and the condition "if (!bio)" can not be
meet.


After applying the following patch, the issue is gone.

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e1ea3a7bd9d9..95c9cb07a42f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2567,6 +2567,7 @@ static void dm_wq_work(struct work_struct *work)
break;

submit_bio_noacct(bio);
+   cond_resched();
}
 }

But I think it is not a proper solution. And without this patch, if
removing nr_cpus=1 (the system has two cpus), the issue can not be
triggered. That says when more than one cpu, the above loop can exit
by the condition "if (!bio)"

Any ideas?

Thanks,

Pingfan

[1]: the console log when hanging

 [ 2062.321473] device-mapper: thin: 253:4: reached low water mark for
data device: sending event.
 [ 2062.353217] dm-3: detected capacity change from 122880 to 147456
 [-- MARK -- Wed Dec 14 08:45:00 2022]
[ 2062.376690] device-mapper: thin: 253:4: switching pool to
out-of-data-space (queue IO) mode
 [ 2122.393998] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
 [ 2122.394011] (detected by 0, t=6002 jiffies, g=36205, q=624 ncpus=1)
 [ 2122.394014] rcu: All QSes seen, last rcu_sched kthread activity
6002 (4295149593-4295143591), jiffies_till_next_fqs=1, root ->qsmask
0x0
 [ 2122.394017] rcu: rcu_sched kthread starved for 6002 jiffies!
g36205 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
 [ 2122.394019] rcu: Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
 [ 2122.394020] rcu: RCU grace-period kthread stack dump:
 [ 2122.394022] task:rcu_sched   state:R  running task stack:
  0 pid:   15 ppid: 2 flags:0x
 [ 2122.394027] Call Trace:
 [ 2122.394030]  [<0001001fd5b0>] __schedule+0x300/0x6c0
 [ 2122.394040]  [<0001001fd9d2>] schedule+0x62/0xf0
 [ 2122.394042]  [<00010020360c>] schedule_timeout+0x8c/0x170
 [ 2122.394045]  [] rcu_gp_fqs_loop+0x30e/0x3d0
 [ 2122.394053]  [] rcu_gp_kthread+0x132/0x1b0
 [ 2122.394054]  [] kthread+0x108/0x110
 [ 2122.394058]  [] __ret_from_fork+0x3c/0x60
 [ 2122.394061]  [<00010020455a>] ret_from_fork+0xa/0x30
 [ 2122.394064] rcu: Stack dump where RCU GP kthread last ran:
 [ 2122.394064] Task dump for CPU 0:
 [ 2122.394066] task:kworker/0:2 state:R  running task stack:
  0 pid:16943 ppid: 2 flags:0x0004
 [ 2122.394070] Workqueue: kdmflush/253:6 dm_wq_work [dm_mod]
 [ 2122.394100] Call Trace:
 [ 2122.394100]  [] sched_show_task.part.0+0xf8/0x130
 [ 2122.394106]  [<0001001efc9a>]
rcu_check_gp_kthread_starvation+0x172/0x190
 [ 2122.394111]  [] print_other_cpu_stall+0x2de/0x370
 [ 2122.394113]  [] check_cpu_stall+0x1d0/0x270
 [ 2122.394114]  [] rcu_sched_clock_irq+0x82/0x230
 [ 2122.394117]  [] update_process_times+0xba/0xf0
 [ 2122.394121]  [] tick_sched_handle+0x4a/0x70
 [ 2122.394124]  [] tick_sched_timer+0x5e/0xc0
 [ 2122.394126]  [] __hrtimer_run_queues+0x136/0x290
 [ 2122.394128]  [] hrtimer_interrupt+0x150/0x2d0
 [ 2122.394130]  [] do_IRQ+0x56/0x70
 [ 2122.394133]  [] do_irq_async+0x56/0xb0
 [ 2122.394135]  [<0001001f6786>] do_ext_irq+0x96/0x160
 [ 2122.394138]  [<0001002047bc>] ext_int_handler+0xdc/0x110
 [ 2122.394140]  [<03ff8005cf48>]

Re: [dm-devel] [v2 PATCH 0/5] Introduce mempool pages bulk allocator and use it in dm-crypt

2023-02-16 Thread Yang Shi
On Wed, Feb 15, 2023 at 4:23 AM Mikulas Patocka  wrote:
>
>
>
> On Tue, 14 Feb 2023, Yang Shi wrote:
>
> >
> > Changelog:
> > RFC -> v2:
> >   * Added callback variant for page bulk allocator and mempool bulk 
> > allocator
> > per Mel Gorman.
> >   * Used the callback version in dm-crypt driver.
> >   * Some code cleanup and refactor to reduce duplicate code.
> >
> > rfc: 
> > https://lore.kernel.org/linux-mm/20221005180341.1738796-1-shy828...@gmail.com/
>
> Hi
>
> This seems like unneeded complication to me. We have alloc_pages(), it can
> allocate multiple pages efficiently, so why not use it?

The alloc_pages() allocates *contiguous* pages, but dm-crypt doesn't
need contiguous pages at all. This may incur unnecessary compaction
overhead to the dm-crypt layer when memory is fragmented. The bulk
allocator is a good fit to this usecase, which allocates multiple
order-0 pages.

In addition, filesystem writeback doesn't guarantee power-of-2 pages
every time IIUC. But alloc_pages() just can allocate power-of-2 pages.

>
> I suggest to modify crypt_alloc_buffer() to use alloc_pages() and if
> alloc_pages() fails (either because the system is low on memory or because
> memory is too fragmented), fall back to the existing code that does
> mempool_alloc().

My PoC patches just did this way, but called bulk allocator. There may
be other potential mepool users as I listed in this cover letter,
which may get benefits from bulk allocator. So introducing a new bulk
mempool API seems better for long run although we just have one user
for now. And it makes other uses easier to gain the benefit by just
calling the new API.

>
> Mikulas
>

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] RAID4 with no striping mode request

2023-02-16 Thread Kyle Sanderson
> On Wed, Feb 15, 2023 at 3:44 AM Roger Heflin  wrote:
>
> I think he is wanting the parity across the data blocks on the
> separate filesystems (some sort of parity across fs[1-8]/block0 to
> parity/block0).

Correct.

> On Wed, Feb 15, 2023 at 3:44 AM Roger Heflin  wrote:
> it is not clear to me what this setup would be enough better than what
> the current setups.Given that one could have 8 spin + 1ssd or 12
> spin for the same price.And 2 6 disk raid6's would have the same
> usable space, and be pretty safe (can lose any 2 of the 6 and lose no
> data).

They're not the same price though. Remember these disks are mixed
sizes and various ages, exposing their entire data value
(4d+8d+12d+12p gives you 24T of usable storage) all protected by the
single parity disk.

Yes, higher levels of RAID will always be better. However, that's not
how these millions of appliances are developed by a number of
manufacturers and sold at your local retailer. The proposal (and ask
for help) that I've raised is to have an open-source solution to these
proprietary MD implementations, as opposed to being trapped with buggy
MD drivers on firmware that's glitchy and breaks other aspects of the
kernel.

> On Wed, Feb 15, 2023 at 3:44 AM Roger Heflin  wrote:
> And given the separate filesystems requirement that would
> require some software above the filesystems to manage spreading the
> disks across multiple filesystems.   The risk of another disk going
> bad (while one was failed) and losing a disk's worth of data would
> push me to use the 6-disk raid6.

This is long solved by a number of FUSE filesystems, as well as
overlayfs (which would be nice if it could gradually spool data down
into layers, but that's another ball of wax).

Hopefully that makes sense. The only thing that's coming closer to
this is bcachefs, but that's still looking like a multi-year long road
(with the above being deployed in homes since the early 2000s).

On Wed, Feb 15, 2023 at 3:44 AM Roger Heflin  wrote:
>
> I think he is wanting the parity across the data blocks on the
> separate filesystems (some sort of parity across fs[1-8]/block0 to
> parity/block0).
>
> it is not clear to me what this setup would be enough better than what
> the current setups.Given that one could have 8 spin + 1ssd or 12
> spin for the same price.And 2 6 disk raid6's would have the same
> usable space, and be pretty safe (can lose any 2 of the 6 and lose no
> data).  And given the separate filesystems requirement that would
> require some software above the filesystems to manage spreading the
> disks across multiple filesystems.   The risk of another disk going
> bad (while one was failed) and losing a disk's worth of data would
> push me to use the 6-disk raid6.
>
> WOL: current SSD's are rated for around 1000-2000 writes.  So a 1Tb
> disk can sustain 1000-2000TB of total writes.  And writes to
> filesystem blocks would get re-written more often than data blocks.
>  How well it would work would depend on how often the data is deleted
> and re-written.   If the disks are some sort of long term storage then
> the SSD is not going to get used up.   And I am not sure if the rated
> used up really means anything unless you are using a STUPID enterprise
> controller that proactively disables/kills the SSD when it says the
> rated writes have happened.   I have a 500GB ssd in a mirror that
> "FAILED" according to smart 2 years ago and so far is still fully
> functional, and it is "GOOD" again because the counters used to
> determine total writes seems to have rolled over.
>
> On Tue, Feb 14, 2023 at 8:23 PM Heinz Mauelshagen  wrote:
> >
> > Roger,
> >
> > as any of the currently implemented 'parity' algorithms (block 
> > xor/P-/Q-Syndrome) provided by DM/MD RAID
> > have to have at least two data blocks to calculate:  are you, apart from 
> > the filesystem thoughts you bring up, thinking
> > about running those on e.g. pairs of disks of mentioned even numbered set 
> > of 8?
> >
> > Heinz
> >
> > On Tue, Feb 14, 2023 at 11:28 PM Roger Heflin  wrote:
> >>
> >> On Tue, Feb 14, 2023 at 3:27 PM Heinz Mauelshagen  
> >> wrote:
> >> >
> >>
> >> >
> >> >
> >> > ...which is RAID1 plus a parity disk which seems superfluous as you 
> >> > achieve (N-1)
> >> > resilience against single device failures already without the later.
> >> >
> >> > What would you need such parity disk for?
> >> >
> >> > Heinz
> >> >
> >>
> >> I thought that at first too, but threw that idea out as it did not
> >> make much sense.
> >>
> >> What he appears to want is 8 linear non-striped data disks + a parity disk.
> >>
> >> Such that you can lose any one data disk and parity can rebuild that
> >> disk.  And if you lose several data diskis, then you have intact
> >> non-striped data for the remaining disks.
> >>
> >> It would almost seem that you would need to put a separate filesystem
> >> on each data disk/section (or have a filesystem that is redundant
> >> enough to survive) otherwise losing an