On Sun, Oct 29 2023 at 12:17P -0400,
Yarden Maymon <[email protected]> wrote:
> Running random write fio benchmarks on dm-thin with mapped data
> there is 50% degradation when using O_SYNC.
> * dm-thin without O_SYNC - 438k iops
> * dm-thin with O_SYNC on mapped data - 204k iops
> * directly on the underlying disk with O_SYNC - 451k iops, showing the
> problem is not the disk.
>
> The data is mapped so the same results are expected with O_SYNC.
>
> Currently, all O_SYNC IOs are routed to a slower path (deferred).
> This action is taken early in the procedure, prior to assessing other
> ongoing IOs or verifying if the IO is already mapped.
>
> Remove the early test, and move O_SYNC to the regular data path.
> O_SYNC io to a mapped space, that does not conflict with other inflight
> will be remapped and routed to the faster path.
> All the other O_SYNC io's behavior is maintained (deferred).
>
> The O_SYNC IO will be deferred if :
>
> * It is not mapped - dm_thin_find_block will return -ENODATA, the cell
> is deferred.
>
> * There is an inflight to the same virtual key - bio_detain will
> add the io to a cell and defer it.
>
> build_virtual_key(tc->td, block, &key);
> if (bio_detain(tc->pool, &key, bio, &virt_cell))
> return DM_MAPIO_SUBMITTED;
>
> * There is an inflight to the same physical key - bio_detain will
> add the io to a cell and defer it.
>
> build_data_key(tc->td, result.block, &key);
> if (bio_detain(tc->pool, &key, bio, &data_cell)) {
> cell_defer_no_holder(tc, virt_cell);
> return DM_MAPIO_SUBMITTED;
> }
One of thinp's biggest reasons for deferring flushes is to coalesce
associated thin-pool metadata commits.
Your change undermines bio_triggers_commit() -- given that a flush
will only trigger a commit if routed through process_deferred_bios().
This raises doubts about the safety of your change (despite your
documented testing).
>
> -----------------------------------------------------
>
> Benchmark results :
>
> The benchmark was done on top of ubuntu's 6.2.0-1008 with commit
> 450e8dee51aa ("dm bufio: improve concurrent IO performance") backported.
>
> fio params: --bs=4k --direct=1 --iodepth=32 --numjobs=8m --time_based
> --runtime=5m.
> dm-thin chunksize is 128k and allocation/thin_pool_zero=0 is set.
> The results are in IOPs and represented as: avg_iops (max_iops).
>
> Performance test on the underlying nvme device for baseline:
> +-------------------+-----------------------+
> | randwrite | 446k (455k) |
> | randwrite sync | 451k (455k) |
> | randrw 50/50 | 227k/227k (300k/300k) |
> | randrw sync 50/50 | 227k/227k (300k/300k) |
> | randread | 773k (866k) |
> | randread sync | 773k (861k) |
> +-------------------+-----------------------+
>
> dm-thin blkdev with all data allocated (16GiB):
> +-------------------+-----------------------+-----------------------+
> | | Pre Patch | Post Patch |
> +-------------------+-----------------------+-----------------------+
> | randwrite | 438k (442k) | 450k (453k) |
> | randwrite sync | 204k (228k) | 450k (454k) |
> | randrw 50/50 | 224k/224k (236k/235k) | 225k/225k (234k/234k) |
> | randrw sync 50/50 | 191k/191k (199k/197k) | 225k/225k (235k/235k) |
> | randread | 650k (703k) | 661k (705k) |
> | randread sync | 659k (705k) | 661k (707k) |
> +-------------------+-----------------------+-----------------------+
> There's a notable enhancement in random write performance with sync
> compared to previous results. In the 50/50 sync test, there's also a
> boost in random read due to the availability of extra resources for
> reading. Furthermore, no other aspects appear to be impacted.
Not sure what you mean by "boost in random read due to the
availability of extra resources for reading." -- how does your change
make extra resources for reading?
> dm-thin blkdev without allocated data with capacity of 1.6TB (to
> increase the random chance of hitting a non allocated block):
> +-------------------+-------------------------+------------------------+
> | | Pre Patch | Post Patch |
> +-------------------+-------------------------+------------------------+
> | randwrite | 116k (253k) | 112k (240k) |
> | randwrite sync | 100k (121k) | 182k (266k) |
> | randrw 50/50 | 66.7k/66.7k (109k/109k) | 67k/67k (109k/109k) |
> | randrw sync 50/50 | 76.9k/76.8k (101k/101k) | 77.6k/77.6k (122k/122k)|
> | randread | 336k (349k) | 335k (352k) |
> | randread sync | 334k (351k) | 336k (348k) |
> +-------------------+-------------------------+------------------------+
> In this case, there isn't a marked difference, with the exception of
> random write sync, since the unmapped data path has stayed the same. The
> boost in random write sync performance can be explained from random IOs
> hitting the same space twice within the test (The second time they are
> already mapped).
>
> -----------------------------------------------------
>
> Tests:
> I have ran thin tests of https://github.com/jthornber/dmtest-python.
> I have ran xfstests on top of thin lvm https://github.com/kdave/xfstests
>
> I conducted a manual data integrity test :
> * Constructed a layout with nvme target -> dm-thin -> nvme device.
> * Using vdbench from an initiator host writing to this remote nvme
> device, using journal to a local drive.
> * Initiated a reboot on the media host.
> * Verified the data using vdbench once the reboot process finished.
>
> Signed-off-by: Yarden Maymon <[email protected]>
> ---
> drivers/md/dm-thin.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 07c7f9795b10..ecd429260bee 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -2743,7 +2743,7 @@ static int thin_bio_map(struct dm_target *ti, struct
> bio *bio)
> return DM_MAPIO_SUBMITTED;
> }
>
> - if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
> + if (bio_op(bio) == REQ_OP_DISCARD) {
> thin_defer_bio_with_throttle(tc, bio);
> return DM_MAPIO_SUBMITTED;
> }
> --
> 2.25.1
>
Did you explore still deferring flush bios _but_ without imposing a
throttle?
Doing so would still punt to the workqueue to handle the flushes
(which could be a source of delay) but it avoids completely missing
thinp metadata commits.
Please revert your patch and try something like this instead:
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index f88646f9f81f..704e9bb75b0f 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -2748,7 +2748,10 @@ static int thin_bio_map(struct dm_target *ti, struct bio
*bio)
}
if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
- thin_defer_bio_with_throttle(tc, bio);
+ if (bio->bi_opf & REQ_SYNC)
+ thin_defer_bio(tc, bio);
+ else
+ thin_defer_bio_with_throttle(tc, bio);
return DM_MAPIO_SUBMITTED;
}