From: Yu Kuai <yuku...@huawei.com>
If sync_speed is above speed_min, then is_mddev_idle() will be
called
for each sync IO to check if the array is idle, and inflihgt
sync_io
will be limited if the array is not idle.
However, while mkfs.ext4 for a large raid5 array while
recovery is in
progress, it's found that sync_speed is already above
speed_min while
lots of stripes are used for sync IO, causing long delay for
mkfs.ext4.
Root cause is the following checking from is_mddev_idle():
t1: submit sync IO: events1 = completed IO - issued sync IO
t2: submit next sync IO: events2 = completed IO - issued sync
IO
if (events2 - events1 > 64)
For consequence, the more sync IO issued, the less likely
checking will
pass. And when completed normal IO is more than issued sync
IO, the
condition will finally pass and is_mddev_idle() will return
false,
however, last_events will be updated hence is_mddev_idle() can
only
return false once in a while.
Fix this problem by changing the checking as following:
1) mddev doesn't have normal IO completed;
2) mddev doesn't have normal IO inflight;
3) if any member disks is partition, and all other partitions
doesn't
have IO completed.
Signed-off-by: Yu Kuai <yuku...@huawei.com>
---
drivers/md/md.c | 84
+++++++++++++++++++++++++++----------------------
drivers/md/md.h | 3 +-
2 files changed, 48 insertions(+), 39 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 52cadfce7e8d..dfd85a5d6112 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev
*mddev)
put_cluster_ops(mddev);
}
-static int is_mddev_idle(struct mddev *mddev, int init)
+static bool is_rdev_holder_idle(struct md_rdev *rdev, bool
init)
{
+ unsigned long last_events = rdev->last_events;
+
+ if (!bdev_is_partition(rdev->bdev))
+ return true;
+
+ /*
+ * If rdev is partition, and user doesn't issue IO to the
array, the
+ * array is still not idle if user issues IO to other
partitions.
+ */
+ rdev->last_events =
part_stat_read_accum(rdev->bdev->bd_disk->part0,
+ sectors) -
+ part_stat_read_accum(rdev->bdev, sectors);
+
+ if (!init && rdev->last_events > last_events)
+ return false;
+
+ return true;
+}
+
+/*
+ * mddev is idle if following conditions are match since last
check:
+ * 1) mddev doesn't have normal IO completed;
+ * 2) mddev doesn't have inflight normal IO;
+ * 3) if any member disk is partition, and other partitions
doesn't have IO
+ * completed;
+ *
+ * Noted this checking rely on IO accounting is enabled.
+ */
+static bool is_mddev_idle(struct mddev *mddev, int init)
+{
+ unsigned long last_events = mddev->last_events;
+ struct gendisk *disk;
struct md_rdev *rdev;
- int idle;
- int curr_events;
+ bool idle = true;
- idle = 1;
- rcu_read_lock();
- rdev_for_each_rcu(rdev, mddev) {
- struct gendisk *disk = rdev->bdev->bd_disk;
+ disk = mddev_is_dm(mddev) ? mddev->dm_gendisk :
mddev->gendisk;
+ if (!disk)
+ return true;
- if (!init && !blk_queue_io_stat(disk->queue))
- continue;
+ mddev->last_events = part_stat_read_accum(disk->part0,
sectors);
+ if (!init && (mddev->last_events > last_events ||
+ bdev_count_inflight(disk->part0)))
+ idle = false;