On 12/10/25 14:07, Benjamin Marzinski wrote:
> On Mon, Dec 08, 2025 at 06:07:14PM +0800, Yongpeng Yang wrote:
>> From: Yongpeng Yang <[email protected]>
>>
>> Currently, the max_hw_discard_sectors of a stripe target is set to the
>> minimum max_hw_discard_sectors among all sub devices. When the discard
>> bio is larger than max_hw_discard_sectors, this may cause the stripe
>> device to split discard bios unnecessarily, because the value of
>> max_hw_discard_sectors affects max_discard_sectors, which equal to
>> min(max_hw_discard_sectors, max_user_discard_sectors).
>>
>> For example:
>> root@vm:~# echo '0 33554432 striped 2 256 /dev/vdd 0 /dev/vde 0' | dmsetup 
>> create stripe_dev
>> root@vm:~# cat /sys/block/dm-1/queue/discard_max_bytes
>> 536870912
>> root@vm:~# cat /sys/block/dm-1/slaves/vdd/queue/discard_max_bytes
>> 536870912
>> root@vm:~# blkdiscard -o 0 -l 1073741824 -p 1073741824 /dev/mapper/stripe_dev
>>
>> dm-1 is the stripe device, and its discard_max_bytes is equal to
>> each sub device’s discard_max_bytes. Since the requested discard
>> length exceeds discard_max_bytes, the block layer splits the discard bio:
>>
>> block_bio_queue: 252,1 DS 0 + 2097152 [blkdiscard]
>> block_split: 252,1 DS 0 / 1048576 [blkdiscard]
>> block_rq_issue: 253,48 DS 268435456 () 0 + 524288 be,0,4 [blkdiscard]
>> block_bio_queue: 253,64 DS 524288 + 524288 [blkdiscard]
>>
>> However, both vdd and vde can actually handle a discard bio of 536870912
>> bytes, so this split is not necessary.
>>
>> This patch updates the stripe target’s q->limits.max_hw_discard_sectors
>> to be the minimum max_hw_discard_sectors of the sub devices multiplied
>> by the # of stripe devices. This enables the stripe device to handle
>> larger discard bios without incurring unnecessary splitting.
>>
>> Signed-off-by: Yongpeng Yang <[email protected]>
>> ---
>>  drivers/md/dm-stripe.c | 16 +++++++++++++++-
>>  1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
>> index 1461dc740dae..799d6def699b 100644
>> --- a/drivers/md/dm-stripe.c
>> +++ b/drivers/md/dm-stripe.c
>> @@ -38,6 +38,9 @@ struct stripe_c {
>>      uint32_t chunk_size;
>>      int chunk_size_shift;
>>  
>> +    /* The minimum max_hw_discard_sectors of all sub devices. */
>> +    unsigned int max_hw_discard_sectors;
>> +
>>      /* Needed for handling events */
>>      struct dm_target *ti;
>>  
>> @@ -169,6 +172,8 @@ static int stripe_ctr(struct dm_target *ti, unsigned int 
>> argc, char **argv)
>>       * Get the stripe destinations.
>>       */
>>      for (i = 0; i < stripes; i++) {
>> +            struct request_queue *q;
>> +
>>              argv += 2;
>>  
>>              r = get_stripe(ti, sc, i, argv);
>> @@ -180,6 +185,13 @@ static int stripe_ctr(struct dm_target *ti, unsigned 
>> int argc, char **argv)
>>                      return r;
>>              }
>>              atomic_set(&(sc->stripe[i].error_count), 0);
>> +
>> +            q = bdev_get_queue(sc->stripe[i].dev->bdev);
>> +            if (i == 0)
>> +                    sc->max_hw_discard_sectors = 
>> q->limits.max_hw_discard_sectors;
>> +            else
>> +                    sc->max_hw_discard_sectors = 
>> min_not_zero(sc->max_hw_discard_sectors,
>> +                                    q->limits.max_hw_discard_sectors);
> 
> I don't think any of the above is necessary. When stripe_io_hints() is
> called, dm_set_device_limits() will already have been called on all the
> underlying stripe devices to combine the limits. So
> limits->max_hw_discard_sectors should already be set to the same value
> as you're computing for sc->max_hw_discard_sectors. Right?

dm_set_device_limits() initializes the stack-local ti_limits defined in
dm_calculate_queue_limits(), and stripe_io_hints() modifies this same
variable as well. These updates do not affect the stripe device’s
q->limits until dm_calculate_queue_limits() later calls
blk_stack_limits(). Therefore, the q->limits.max_hw_discard_sectors of
each sub device in a stripe target is never modified and it is necessary
to record the minimum q->limits.max_hw_discard_sectors
across all sub devices.

> 
>>      }
>>  
>>      ti->private = sc;
>> @@ -456,7 +468,7 @@ static void stripe_io_hints(struct dm_target *ti,
>>                          struct queue_limits *limits)
>>  {
>>      struct stripe_c *sc = ti->private;
>> -    unsigned int io_min, io_opt;
>> +    unsigned int io_min, io_opt, max_hw_discard_sectors;
>>  
>>      limits->chunk_sectors = sc->chunk_size;
>>  
>> @@ -465,6 +477,8 @@ static void stripe_io_hints(struct dm_target *ti,
>>              limits->io_min = io_min;
>>              limits->io_opt = io_opt;
>>      }
>> +    if (!check_mul_overflow(sc->max_hw_discard_sectors, sc->stripes, 
>> &max_hw_discard_sectors))
> 
> sc->max_hw_discard_sectors should be the same as
> limits->max_hw_discard_sectors here 
> 
>> +            limits->max_hw_discard_sectors = max_hw_discard_sectors;
> 
> I see a couple of issues with this calculation. First, this only works
> if max_hw_discard_sectors is greater than sc->chunk_size. But more than
> that, before you multiply the original limit by the number of stripes,
> you must round it down to a multiple of chunk_size. Otherwise, you can
> end up writing too much to some of the underlying devices. To show this
> with simple numbers, imagine you had 3 stripes that with a chunk_size of
> 4 and all the underlying devices had a max_hw_discard_sectors of 5. You
> would set max_hw_discard_sectors to 5 * 3 = 15. But if you discared 15
> sectors from the beginning of the device, you would end up discarding
> the first 4 sectors of each underlying device, and then loop around
> discard 3 more sectors of the first device. This means that you would
> discard 7 from the first device, instead of the 5 that it could handle.
> Rounding max_hw_discard_sectors down to a multiple of chunk_size will
> fix that.

The splitting of discard bios is based on limits->max_discard_sectors,
which is independent of the stripe size. As a result, the discard bio
size passed to stripe_map->stripe_map_range() may exceed
`sc->chunk_size * sc->stripes`. In the example above, stripe_map_range()
is invoked three times, and each invocation reduces bio->bi_iter.bi_size
to 5 sectors for each sub devices.

> 
> Lastly, if you do overflow when multiplying max_hw_discard_sectors by
> the number of stripes, you should probably just set
> limits->max_hw_discard_sectors to UINT_MAX >> SECTOR_SHIFT, instead of
> leaving it at what it was.

Yes, I overlooked that. I’ll fix it in v2.

Thanks
Yongpeng,

> 
> -Ben
> 
>>  }
>>  
>>  static struct target_type stripe_target = {
>> -- 
>> 2.43.0
> 


Reply via email to