On Fri, Apr 18, 2025 at 9:17 AM Yu Kuai <yuku...@huaweicloud.com> wrote: > > From: Yu Kuai <yuku...@huawei.com> > > If sync_speed is above speed_min, then is_mddev_idle() will be called > for each sync IO to check if the array is idle, and inflihgt sync_io > will be limited if the array is not idle. > > However, while mkfs.ext4 for a large raid5 array while recovery is in > progress, it's found that sync_speed is already above speed_min while > lots of stripes are used for sync IO, causing long delay for mkfs.ext4. > > Root cause is the following checking from is_mddev_idle(): > > t1: submit sync IO: events1 = completed IO - issued sync IO > t2: submit next sync IO: events2 = completed IO - issued sync IO > if (events2 - events1 > 64) > > For consequence, the more sync IO issued, the less likely checking will > pass. And when completed normal IO is more than issued sync IO, the > condition will finally pass and is_mddev_idle() will return false, > however, last_events will be updated hence is_mddev_idle() can only > return false once in a while. > > Fix this problem by changing the checking as following: > > 1) mddev doesn't have normal IO completed; > 2) mddev doesn't have normal IO inflight; > 3) if any member disks is partition, and all other partitions doesn't > have IO completed. > > Signed-off-by: Yu Kuai <yuku...@huawei.com> > --- > drivers/md/md.c | 84 +++++++++++++++++++++++++++---------------------- > drivers/md/md.h | 3 +- > 2 files changed, 48 insertions(+), 39 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 52cadfce7e8d..dfd85a5d6112 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev *mddev) > put_cluster_ops(mddev); > } > > -static int is_mddev_idle(struct mddev *mddev, int init) > +static bool is_rdev_holder_idle(struct md_rdev *rdev, bool init) > { > + unsigned long last_events = rdev->last_events; > + > + if (!bdev_is_partition(rdev->bdev)) > + return true; > + > + /* > + * If rdev is partition, and user doesn't issue IO to the array, the > + * array is still not idle if user issues IO to other partitions. > + */ > + rdev->last_events = part_stat_read_accum(rdev->bdev->bd_disk->part0, > + sectors) - > + part_stat_read_accum(rdev->bdev, sectors); > + > + if (!init && rdev->last_events > last_events) > + return false; > + > + return true; > +} > + > +/* > + * mddev is idle if following conditions are match since last check: > + * 1) mddev doesn't have normal IO completed; > + * 2) mddev doesn't have inflight normal IO; > + * 3) if any member disk is partition, and other partitions doesn't have IO > + * completed; > + * > + * Noted this checking rely on IO accounting is enabled. > + */ > +static bool is_mddev_idle(struct mddev *mddev, int init) > +{ > + unsigned long last_events = mddev->last_events; > + struct gendisk *disk; > struct md_rdev *rdev; > - int idle; > - int curr_events; > + bool idle = true; > > - idle = 1; > - rcu_read_lock(); > - rdev_for_each_rcu(rdev, mddev) { > - struct gendisk *disk = rdev->bdev->bd_disk; > + disk = mddev_is_dm(mddev) ? mddev->dm_gendisk : mddev->gendisk; > + if (!disk) > + return true; > > - if (!init && !blk_queue_io_stat(disk->queue)) > - continue; > + mddev->last_events = part_stat_read_accum(disk->part0, sectors); > + if (!init && (mddev->last_events > last_events || > + bdev_count_inflight(disk->part0))) > + idle = false; > > - curr_events = (int)part_stat_read_accum(disk->part0, sectors) > - > - atomic_read(&disk->sync_io); > - /* sync IO will cause sync_io to increase before the > disk_stats > - * as sync_io is counted when a request starts, and > - * disk_stats is counted when it completes. > - * So resync activity will cause curr_events to be smaller > than > - * when there was no such activity. > - * non-sync IO will cause disk_stat to increase without > - * increasing sync_io so curr_events will (eventually) > - * be larger than it was before. Once it becomes > - * substantially larger, the test below will cause > - * the array to appear non-idle, and resync will slow > - * down. > - * If there is a lot of outstanding resync activity when > - * we set last_event to curr_events, then all that activity > - * completing might cause the array to appear non-idle > - * and resync will be slowed down even though there might > - * not have been non-resync activity. This will only > - * happen once though. 'last_events' will soon reflect > - * the state where there is little or no outstanding > - * resync requests, and further resync activity will > - * always make curr_events less than last_events. > - * > - */ > - if (init || curr_events - rdev->last_events > 64) { > - rdev->last_events = curr_events; > - idle = 0; > - } > - } > + rcu_read_lock(); > + rdev_for_each_rcu(rdev, mddev) > + if (!is_rdev_holder_idle(rdev, init)) > + idle = false; > rcu_read_unlock(); > + > return idle; > } > > diff --git a/drivers/md/md.h b/drivers/md/md.h > index b57842188f18..1d51c2405d3d 100644 > --- a/drivers/md/md.h > +++ b/drivers/md/md.h > @@ -132,7 +132,7 @@ struct md_rdev { > > sector_t sectors; /* Device size (in 512bytes sectors) > */ > struct mddev *mddev; /* RAID array if running */ > - int last_events; /* IO event timestamp */ > + unsigned long last_events; /* IO event timestamp */ > > /* > * If meta_bdev is non-NULL, it means that a separate device is > @@ -520,6 +520,7 @@ struct mddev { > * adding a spare > */ > > + unsigned long last_events; /* IO event timestamp > */
Can we use another name? Because mddev has events counter. This name can easily be confused with that counter. Regards Xiao > atomic_t recovery_active; /* blocks scheduled, > but not written */ > wait_queue_head_t recovery_wait; > sector_t recovery_cp; > -- > 2.39.2 >