On 22/1/21 1:52 am, David Sterba wrote:
On Thu, Jan 21, 2021 at 06:10:36PM +0800, Anand Jain wrote:


On 20/1/21 8:14 pm, David Sterba wrote:
On Tue, Jan 19, 2021 at 11:52:05PM -0800, Anand Jain wrote:
The read policy type latency routes the read IO based on the historical
average wait-time experienced by the read IOs through the individual
device. This patch obtains the historical read IO stats from the kernel
block layer and calculates its average.

This does not say how the stripe is selected using the gathered numbers.
Ie. what is the criteria like minimum average time, "based on" is too
vague.



Could you please add the following in the change log. Hope this will
suffice.

----------
This patch adds new read policy Latency. This policy routes the read
I/Os based on the device's average wait time for read requests.

'wait time' means the time from io submission to completion

 Yes, at the block layer.

The average is calculated by dividing the total wait time for read
requests by the total read I/Os processed by the device.

So this is based on numbers from the entire lifetime of the device?

 No,  Kernel stats are in memory only, so it is since boot.

 The
numbers are IMHO not a reliable source. If unrelated writes increase the
read wait time then the device will not be selected until the average
is lower than of the other devices.

 I think it is fair. Because comparison is between the performance of

 1. disk-type-A   VS    disk-type-A

 OR

 2. disk-type-A    VS   disk-type-B

 In the scenario #1 above, it does not matter which disk, as both of
 them provides the same performance (theoretically), which is the most
 common config.

 In scenario 2# the user can check the read I/O on the devices, if it
 is _not_ going to the best performing device by theory, either a reboot
 or iostat-reset (which I think should be there) shall help.
 Or if they can't reboot or if iostat-reset is not available, then
 switching to the read-policy 'device' shall help until they
 reboot, which is a better alternative than PID, which is unpredictable.
 Unfortunately, this switching is not automatic (more below).

 There are drawbacks to this policy.
At any point in time, momentarily, a device may get too busy due to _external factors_ such as - multiple partitions on
 the same device, multiple LUNs on the same HBA, OR if the IRQ is shared
 by the disk's HBA and the gigabit network card (which has better IRQ
 priority) so whenever the network is busy, the I/O on the disk slows
 down (I had an opportunity to investigate such an issue before).
 So now the latency policy shall switch to the better performing device
 at such a time. But if the theoretically better performing device is
 back to its normal speed, yes, unless the device gets the read I/O by
 some operation (for example, scrub), the policy won't know. This
 scenario is more crucial for the config type #2 (above).

 Also, there may be a better alternative to the average wait time (for
 example, another type of mean-values?) Which I think can be tweaked in
 the long term when we understand the usage of this policy better. If we
 account for the inflight commands, there will be more switching in
 config type #1 (above). More switching leads to fewer I/O mergers and
 higher cache misses (DMA and storage level) leading to poorer
 performance. So switching back and forth between devices is not good as
 well. So stay where they are helps until it performs worst than its
 mirrored device.


The average can only decrease after there are some fast reads, which is
not guaranted to happen and there's no good estimate how long it could
take to happen.

 True. Also, there isn't any kernel part-stat reset. Not sure if the
 block layer will entertain such a patch, but worth a try IMO. What
 What do you think?

 However, even if I reset, it's not guaranteed that temporary bad stats
 can not happen again. Also it's a bit uncertain how to know when will
 the theoretically better performing device will be back to its good
 performance.


The tests we all probably do are on a fresh mkfs and with a small
workload but the mirror selection logic must work long term.


 I totally agree. So I am not yet recommending this policy for the
 default. But ut does solve some of the problems very well.

The part_stat numbers could be used but must reflect the time factor,
ie. it needs to be some a rolling average or collecting a sample for
last N seconds.

 But, I think the problem here is to know when will the
 theoretically better performing device will be back to its good
 performance. So for that purpose, the theoretically better performing
 device must be probed periodically. And there will be cost.


Bear in mind that this is only a heuristic and we don't need perfect
results nor we want to replace io scheduling, so the amont of collected
data or the logic should be straightforward.

 Yeah. If part_stat can provide stat only for past N-mins or so, it will
 be simpler. During this patch, I looked into the part_stat code it is
 not straightforward.


This policy uses kernel disk stat to calculate the average, so it needs
the kernel stat to be enabled.

What is needed to enable it? I see it's always compiled in in
block/blk-core.c.


 It is enabled by default. But the user may disable part_stat
 collection at the run time.

   echo 0 > /sys/block/sdx/queue/iostat


If in case the kernel stat is disabled
the policy uses the stripe 0.
This policy can be set through the read_policy sysfs interface as shown
below.

      $ echo latency > /sys/fs/btrfs/<uuid>/read_policy
      $ cat /sys/fs/btrfs/<uuid>/read_policy
           pid [latency] device roundrobin

This policy won't persist across reboot or mount unmount recycle as of
now.

Here below are few performance test results with latency compared with
pid policy.

raid1 fio read 500m

500m is really small data size for such measurement


 Pls see below about this.

-----------------------------------------------------
dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
read type   | random    sequential random    sequential
------------+------------------------------------------
pid         | 744MiB/s  809MiB/s  2225MiB/s 2155MiB/s
latency     | 2072MiB/s 2008MiB/s  1999MiB/s 1961MiB/s

Namely when the device bandwidth is 4x higher. The data size should be
scaled up so the whole run takes at least 30 seconds if not a few
minutes.
> Other missing information about the load is the number of threads and if
it's buffered or direct io.


 The cover letter has the fio command used. The output from the guest VM
 is there. From it, I notice the I/Os performed were ~16.8G. I can run
 the scripts again. Pls, do share with me if you have any ideas for
 testing.

READ: bw=87.0MiB/s (91.2MB/s), 87.0MiB/s-87.0MiB/s (91.2MB/s-91.2MB/s), io=15.6GiB (16.8GB), run=183884-183884msec


raid10 fio read 500m
-----------------------------------------------------
dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
read type   | random    sequential random    sequential
------------+------------------------------------------
pid         | 1282MiB/s 1427MiB/s 2152MiB/s 1969MiB/s
latency     | 2073MiB/s 1871MiB/s 1975MiB/s 1984MiB/s


raid1c3 fio read 500m
-----------------------------------------------------
dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
read type   | random    sequential random    sequential
------------+------------------------------------------
pid         |  973MiB/s  955MiB/s 2144MiB/s 1962MiB/s
latency     | 2005MiB/s 1924MiB/s 2083MiB/s 1980MiB/s


raid1c4 fio read 500m
-----------------------------------------------------
dev types   | nvme+ssd  nvme+ssd   all-nvme  all-nvme
read type   | random    sequential random    sequential
------------+------------------------------------------
pid         | 1204MiB/s 1221MiB/s 2065MiB/s 1878MiB/s
latency     | 1990MiB/s 1920MiB/s 1945MiB/s 1865MiB/s


In the given fio I/O workload above, it is found that there are fewer
I/O merges in case of latency as compared to pid. So in the case of all
homogeneous devices pid performance little better.

Yeah switching the device in the middle of a contiguous range could slow
it down but as long as it's not "too much", then it's ok.


 Yep.

The pid selection is good for multiple threads workload but we also want
to make it work with single thread reads, like a simple 'cp'.

I tested this policy and with 2G file 'cat file' utilizes only one
device, so this is no improvement to the pid policy.

 In the 'cat file' test case above, all the read IOs will go to a
 single stripe id. But, it does not mean that it will go to the same
 device. As of now, our chunk allocation is based on the device's free
 size. So the better thing to do is to have raid 1 on disks of
 different sizes like, for example, 50G and 100G. Then it guarantees
 that stripe 0 will be always on the 100G disk. Then it is fair to
 measure the pid policy.

 And still, pid policy may perform better, as reading from a single disk
 is not a bad idea. The read_policy type 'device' proved it.

 All the policy depends on the workload, so is pid policy. But on top
 of it the pid policy is non-deterministic which makes it hard to say
 how it shall be in a known workload.

A policy based on read latency makes sense but the current
implementation does not cover enough workloads.


 Yeah. The performances of any policy here (including PID and round-
 robin) are workload-dependent. IMHO it can't be like one-size-fits
 and meant to be tuned.

Thanks, Anand


Reply via email to