On Fri, Mar 22 2024 at  8:03P -0400,
Ming Lei <[email protected]> wrote:

> On Fri, Mar 22, 2024 at 11:30:33AM +0100, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 22 Mar 2024, Ming Lei wrote:
> > 
> > > On Thu, Mar 21, 2024 at 05:48:45PM +0100, Mikulas Patocka wrote:
> > > > It may be possible to set up dm-integrity with smaller sector size than
> > > > the logical sector size of the underlying device. In this situation,
> > > > dm-integrity guarantees that the outgoing bios have the same alignment 
> > > > as
> > > > incoming bios (so, if you create a filesystem with 4k block size,
> > > > dm-integrity would send 4k-aligned bios to the underlying device).
> > > > 
> > > > This guarantee was broken when integrity_recheck was implemented.
> > > > integrity_recheck sends bio that is aligned to ic->sectors_per_block. So
> > > > if we set up integrity with 512-byte sector size on a device with 
> > > > logical
> > > > block size 4k, we would be sending unaligned bio. This triggered a bug 
> > > > in
> > > > one of our internal tests.
> > > > 
> > > > This commit fixes it - it determines what's the actual alignment of the
> > > > incoming bio and then makes sure that the outgoing bio in
> > > > integrity_recheck has the same alignment.
> > > > 
> > > > Signed-off-by: Mikulas Patocka <[email protected]>
> > > > Fixes: c88f5e553fe3 ("dm-integrity: recheck the integrity tag after a 
> > > > failure")
> > > > Cc: [email protected]
> > > > 
> > > > ---
> > > >  drivers/md/dm-integrity.c |   12 ++++++++++--
> > > >  1 file changed, 10 insertions(+), 2 deletions(-)
> > > > 
> > > > Index: linux-2.6/drivers/md/dm-integrity.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/drivers/md/dm-integrity.c    2024-03-21 
> > > > 14:25:45.000000000 +0100
> > > > +++ linux-2.6/drivers/md/dm-integrity.c 2024-03-21 17:47:39.000000000 
> > > > +0100
> > > > @@ -1699,7 +1699,6 @@ static noinline void integrity_recheck(s
> > > >         struct bio_vec bv;
> > > >         sector_t sector, logical_sector, area, offset;
> > > >         struct page *page;
> > > > -       void *buffer;
> > > >  
> > > >         get_area_and_offset(ic, dio->range.logical_sector, &area, 
> > > > &offset);
> > > >         dio->metadata_block = get_metadata_sector_and_offset(ic, area, 
> > > > offset,
> > > > @@ -1708,13 +1707,14 @@ static noinline void integrity_recheck(s
> > > >         logical_sector = dio->range.logical_sector;
> > > >  
> > > >         page = mempool_alloc(&ic->recheck_pool, GFP_NOIO);
> > > > -       buffer = page_to_virt(page);
> > > >  
> > > >         __bio_for_each_segment(bv, bio, iter, dio->bio_details.bi_iter) 
> > > > {
> > > >                 unsigned pos = 0;
> > > >  
> > > >                 do {
> > > > +                       sector_t alignment;
> > > >                         char *mem;
> > > > +                       char *buffer = page_to_virt(page);
> > > >                         int r;
> > > >                         struct dm_io_request io_req;
> > > >                         struct dm_io_region io_loc;
> > > > @@ -1727,6 +1727,14 @@ static noinline void integrity_recheck(s
> > > >                         io_loc.sector = sector;
> > > >                         io_loc.count = ic->sectors_per_block;
> > > >  
> > > > +                       /* Align the bio to logical block size */
> > > > +                       alignment = dio->range.logical_sector | 
> > > > bio_sectors(bio) | (PAGE_SIZE >> SECTOR_SHIFT);
> > > > +                       alignment &= -alignment;
> > > 
> > > The above is less readable, :-(
> > 
> > It isolates the lowest bit from dio->range.logical_sector, 
> > bio_sectors(bio) and (PAGE_SIZE >> SECTOR_SHIFT).
> > 
> > See for example this https://www.felixcloutier.com/x86/blsi
> 
> Fine, but I have to say such usage isn't popular.

Yeah, at a minimum it should have a comment explaining the
optimization of combining and then getting lsbit.

The non-ffs() optimized gcd() uses the same but comments it:
        /* Isolate lsbit of r */
        r &= -r;

> > > > +                       io_loc.sector = round_down(io_loc.sector, 
> > > > alignment);
> > > > +                       io_loc.count += sector - io_loc.sector;
> > > > +                       buffer += (sector - io_loc.sector) << 
> > > > SECTOR_SHIFT;
> > > > +                       io_loc.count = round_up(io_loc.count, 
> > > > alignment);
> > > 
> > > I feel the above code isn't very reliable, what we need actually is to
> > > make sure that io's sector & size is aligned with dm's
> > > bdev_logical_block_size(bdev).
> > 
> > I thought about using bdev_logical_block_size. But it may be wrong if the 
> > device stack is reconfigured. So, I concluded that taking the alignment 
> > from the bio would be better.
> 
> If logical block becomes mismatched by reconfiguration, the whole DM stack 
> can't work:
> 
> - at the beginning, DM is over NVMe(512 bs), DM & NVMe lbs is 512
> - later, nvme is reconfigured and its lbs becomes 4k, but DM's lbs can't
> be updated
> - then unaligned IO is submitted to NVMe
> 
> So DM _never_ works with mis-matched logical block size because of
> reconfigure, and same with MD.

At some point we need to trust the queue_limits and DM takes
considerable pain to validate the alignment when a dm-table is
(re)loaded.

But we could get into problems with deep(er) device stacks where an
underlying DM device is reloaded but the upper level devices'
queue_limits aren't restacked.  Thankfully, in practice that generally
doesn't occur!

If it were to become a prevalent issue DM would need to grow
validation that DM devices aren't changing their logic_block_size and
overall alignment during runtime.

> > > Yeah, so far the max logical block size is 4k, but it may be increased
> > > in future and you can see the recent lsfmm proposal, so can we force it 
> > > to be
> > > aligned with bdev_logical_block_size(bdev) here?
> > > 
> > > Also can the above change work efficiently in case of 64K PAGE_SIZE?
> > 
> > It doesn't work efficiently at all - this piece of code is only run in a 
> > pathological case where the user writes into a buffer while reading it (or 
> > when he reads multiple blocks into the same buffer), so I optimized it for 
> > size, not for performance.
> > 
> > But yes, it works with 64K PAGE_SIZE.
> 
> Fine, but I still think PAGE_SIZE is hard to follow than logical block
> size.

Thanks for your review.  I shared many of your review concerns (the
math isn't apporachable, and why not just use logical_block_size in
queue_limits?).

That said, I'm OK with the code as-is because it has been tested to
fix the reported misalignment issue.  BUT, I would like to see
follow-on cleanup in a separate commit, at a minimum there should be
some helpful comments (to address the math and assumptions made,
e.g. this recheck code is not fast path).

Mike

Reply via email to