Hi,

There are a bunch of heuristics mentioned in the following proposed commit:

On 2026-04-03 16:36:03 -0400, Andres Freund wrote:
> Subject: [PATCH v5 1/5] aio: io_uring: Trigger async processing for large IOs
>
> io_method=io_uring has a heuristic to trigger asynchronous processing of IOs
> once the IO depth is a bit larger. That heuristic is important when doing
> buffered IO from the kernel page cache, to allow parallelizing of the memory
> copy, as otherwise io_method=io_uring would be a lot slower than
> io_method=worker in that case.
>
> An upcoming commit will make read_stream.c only increase the read-ahead
> distance if we needed to wait for IO to complete. If to-be-read data is in the
> kernel page cache, io_uring will synchronously execute IO, unless the IO is
> flagged as async.  Therefore the aforementioned change in read_stream.c
> heuristic would lead to a substantial performance regression with io_uring
> when data is in the page cache, as we would never reach a deep enough queue to
> actually trigger the existing heuristic.
>
> Parallelizing the copy from the page cache is mainly important when doing a
> lot of IO, which commonly is only possible when doing largely sequential IO.
>
> The reason we don't just mark all io_uring IOs as asynchronous is that the
> dispatch to a kernel thread has overhead. This overhead is mostly noticeable
> with small random IOs with a low queue depth, as in that case the gain from
> parallelizing the memory copy is small and the latency cost high.
>
> The facts from the two prior paragraphs show a way out: Use the size of the IO
> in addition to the depth of the queue to trigger asynchronous processing.
>
> One might think that just using the IO size might be enough, but
> experimentation has shown that not to be the case - with deep look-ahead
> distances being able to parallelize the memory copy is important even with
> smaller IOs.

> +/*
> + * io_uring executes IO in process context if possible. That's generally 
> good,
> + * as it reduces context switching. When performing a lot of buffered IO that
> + * means that copying between page cache and userspace memory happens in the
> + * foreground, as it can't be offloaded to DMA hardware as is possible when
> + * using direct IO. When executing a lot of buffered IO this causes io_uring
> + * to be slower than worker mode, as worker mode parallelizes the
> + * copying. io_uring can be told to offload work to worker threads instead.
> + *
> + * If the IOs are small, we only benefit from forcing things into the
> + * background if there is a lot of IO, as otherwise the overhead from context
> + * switching is higher than the gain.
> + *
> + * If IOs are large, there is benefit from asynchronous processing at lower
> + * queue depths, as IO latency is less of a crucial factor and parallelizing
> + * memory copies is more important.  In addition, it is important to trigger
> + * asynchronous processing even at low queue depth, as with foreground
> + * processing we might never actually reach deep enough IO depths to trigger
> + * asynchronous processing, which in turn would deprive readahead control
> + * logic of information about whether a deeper look-ahead distance would be
> + * advantageous.
> + *
> + * We have done some basic benchmarking to validate the thresholds used, but
> + * it's quite plausible that there are better values.

Thought it'd be useful to actually have an email to point to in the above
comment, with details about what benchmark I ran.

Previously I'd just manually run fio with different options, I made it a bit
more systematic with the attached (only halfway hand written) script.

I attached two different results, once when allowing access to multiple cores,
and once with a single core (simulating a very busy machine).

(nblocks is in multiples of 8KB)

Multi-core:

nblocks iod     async   bw_gib_s        lat_usec
1       1       0       4.2075  1.5802
1       1       1       1.0462  6.9652
1       2       0       4.1362  3.4533
1       2       1       1.9284  7.6040
1       4       0       4.0030  7.3720
1       4       1       4.2713  6.9086
1       8       0       4.1653  14.4072
1       8       1       4.3301  13.8365
1       16      0       4.1829  28.9216
1       16      1       4.3006  28.1261
1       32      0       4.0735  59.6232
1       32      1       4.3248  56.1614

I.e at nblocks=1, there's pretty much no gain from async, and the latency
increases markedly at the low end and just about catches up at the high end.

Around an iodepth 4 the loss from async nonexistant or minimal.


2       1       0       5.7289  2.4261
2       1       1       1.8708  7.7466
2       2       0       5.7964  5.0144
2       2       1       3.3749  8.7417
2       4       0       5.8434  10.2023
2       4       1       7.9783  7.3977
2       8       0       5.8166  20.7226
2       8       1       8.2545  14.5431
2       16      0       5.8215  41.6613
2       16      1       8.2354  29.3879
2       32      0       5.6530  86.0286
2       32      1       8.3218  58.3826

With nblocks=2, there start to be gains at higher IO depths, but they're still
somewhat limited.  Latency already starts to be better at iodepth 4.


4       1       0       7.4131  3.8807
4       1       1       3.2133  9.1827
4       2       0       7.3150  8.0854
4       2       1       5.4983  10.8039
4       4       0       7.2784  16.5097
4       4       1       11.2717 10.5699
4       8       0       7.2873  33.2331
4       8       1       16.6299 14.4164
4       16      0       7.1606  67.8777
4       16      1       16.9794 28.4981
4       32      0       6.2954  154.6834
4       32      1       16.3686 59.3610

With nblocks=4, async shows much more substantial gains. Latency of async at
the high end is also much improved.


8       1       0       8.0403  7.3503
8       1       1       4.6038  12.7202
8       2       0       8.0052  14.9161
8       2       1       8.5176  13.9987
8       4       0       8.1519  29.6698
8       4       1       14.8211 16.1640
8       8       0       7.8525  61.8612
8       8       1       27.5860 17.4434
8       16      0       6.8887  141.3268
8       16      1       34.1307 28.3463
8       32      0       6.9031  282.2350
8       32      1       38.2430 50.7700

With nblocks=8, async is faster already at iodepth 2.


64      1       0       9.1983  52.6768
64      1       1       8.1505  59.5486

128     1       0       7.5442  128.8704
128     1       1       7.3481  132.2355

Somewhere nblocks=64 and 128, we reach the point where there's basically no
loss at iodepth 1.


This seems to validate setting IOSQE_ASYNC around a block size of >= 4 and a
queue depth of > 4. I guess it could make sense to reduce it from > 4 to >= 4
based on these numbers, but I don't think it matters terribly.



Obviously with just one core there will only ever be a loss from doing an
asynchronous / concurrent copy from the page cache. But it's interesting to
see where the overhead of async starts to be less of a factor.

At iodepth 1 (worse case, a context switch for every IO)

nblocks iod     async   bw_gib_s        lat_usec
1       1       0       4.2324  1.5692
1       1       1       1.7883  3.9574
2.36x bw regression

2       1       0       5.7914  2.4004
2       1       1       2.9585  4.8417
1.96x bw regression

4       1       0       7.3171  3.9242
4       1       1       4.2450  6.8171
1.7x bw regression

8       1       0       8.1162  7.2674
8       1       1       5.7536  10.2948
1.4x bw regression

16      1       0       8.8559  13.5212
16      1       1       7.1163  16.8277
1.6x bw regression


But the IO depth would not stay at 1 in the case of postgres with the proposed
changes, it'd ramp up due to needing to wait for the kernel to complete those
IOs asynchronously.

Therefore comparing that to a deeper IO depth.

nblocks iod     async   bw_gib_s        lat_usec
1       16      0       4.1094  29.4339
1       16      1       3.3922  35.7044
1.21x bw regression

2       16      0       5.8381  41.5402
2       16      1       4.8104  50.4571
1.21x bw regression

4       16      0       7.1204  68.2612
4       16      1       5.6479  86.0973
1.26x bw regression

8       16      0       7.0780  137.5520
8       16      1       6.1687  157.8805
1.14x bw regression

16      16      0       7.4523  261.4281
16      16      1       6.7192  290.0837
1.10x bw regression


This assumes a very extreme scenario (no cycles whatsoever available for
parallelism), so I'm just looking for the worst case regression here.


I don't think there's very clear indicators for what cutoffs to use in the
onecpu data. Clearly we shouldn't go for async for single block IOs, but we
aren't.  With the default io_combine_limit=16 effective_io_concurrency=16,
we'd end up with 1.10x regression in the extreme case of only having a single
core available (but that one fully!) and doing nothing other than IO.

Seems ok to me.


I ran it on three other machines (newer workstation, laptop, old laptop) as
well, with similarly shaped results (although considerably higher & lower
throughputs across the board, depending on the machine).

Zen 4 Laptop:
nblocks iod     async   bw_gib_s        lat_usec
1       1       0       6.0989  1.1408
1       1       1       1.4477  5.1246
1       2       0       6.9600  2.0827
1       2       1       2.8750  5.1711
1       4       0       7.0283  4.2307
1       4       1       8.9174  3.3169

Suprisingly bigger difference between sync/async at iod=1, but it's again
similar around iod=4 blocks.


4       1       0       14.5638 1.9616
4       1       1       5.1245  5.8016
4       2       0       14.8867 3.9607
4       2       1       12.1841 4.8662
4       4       0       14.8678 8.0764
4       4       1       21.5077 5.5417

Similar.


16      1       0       21.0754 5.5891
16      1       1       12.6180 9.4753
16      2       0       20.2770 11.8353
16      2       1       24.3277 9.8172

At nblocks=16, iod=2 starts already starts to be faster.



Greetings,

Andres Freund
#!/usr/bin/env python3
import argparse
import json
import subprocess
import sys


def run_fio(directory, nblocks, iodepth, force_async,
            size,
            runtime):
    bs = nblocks * 8 * 1024
    cmd = [
        "fio",
        f"--directory={directory}",
        f"--size={size}",
        "--name=read",
        "--invalidate=0",
        "--rw=read",
        "--direct=0",
        "--buffered=1",
        "--time_based=1",
        f"--runtime={runtime}",
        "--ioengine=io_uring",
        f"--iodepth={iodepth}",
        f"--force_async={force_async}",
        f"--bs={bs}",
        "--output-format=json",
    ]

    result = subprocess.run(cmd, capture_output=True, text=True, check=True)
    return json.loads(result.stdout)


def extract_metrics(data):
    """
    Extract bandwidth (GiB/s) and average latency (µs) from fio JSON.
    fio JSON reports.:
    """
    read_stats = data["jobs"][0]["read"]

    bw_kibs = read_stats["bw"]              # KiB/s
    bw_gibs = bw_kibs / 1024**2            # KiB/s → GiB/s

    lat_ns = read_stats["lat_ns"]["mean"]  # nanoseconds
    lat_usec = lat_ns / 1000.0             # → µs

    return bw_gibs, lat_usec


def main():
    parser = argparse.ArgumentParser(
        description="Run fio sequential read benchmarks across parameter combos."
    )
    parser.add_argument(
        "--directory", default="/srv/fio",
        help="fio test directory (default: /srv/fio)",
    )
    parser.add_argument(
        "--size", default="4GiB",
        help="fio file size (default: 4GiB)",
    )
    parser.add_argument(
        "--runtime", type=int, default=1,
        help="Seconds per test (default: 1)",
    )
    parser.add_argument(
        "--nblocks", type=int, nargs="+",
        default=[1, 2, 4, 8, 16, 32, 64, 128],
        help="Block-count values to test (bs = nblocks * 8 KiB)",
    )
    parser.add_argument(
        "--iodepths", type=int, nargs="+",
        default=[1, 2, 4, 8, 16, 32],
        help="iodepth values to test",
    )
    args = parser.parse_args()

    print("nblocks\tiod\tasync\tbw_gib_s\tlat_usec")

    for nblocks in args.nblocks:
        for iodepth in args.iodepths:
            for force_async in [0, 1]:
                try:
                    data = run_fio(
                        directory=args.directory,
                        nblocks=nblocks,
                        iodepth=iodepth,
                        force_async=force_async,
                        size=args.size,
                        runtime=args.runtime,
                    )
                    bw_gibs, lat_usec = extract_metrics(data)
                    print(f"{nblocks}\t{iodepth}\t{force_async}\t{bw_gibs:.4f}\t{lat_usec:.4f}")
                    sys.stdout.flush()
                except subprocess.CalledProcessError as exc:
                    print(f"# ERROR nblocks={nblocks} iod={iodepth} async={force_async}: {exc}",
                          file=sys.stderr)
                    sys.exit(1)


if __name__ == "__main__":
    main()
nblocksiodasyncbw_gib_slat_usec
1104.20751.5802
1111.04626.9652
1204.13623.4533
1211.92847.6040
1404.00307.3720
1414.27136.9086
1804.165314.4072
1814.330113.8365
11604.182928.9216
11614.300628.1261
13204.073559.6232
13214.324856.1614
2105.72892.4261
2111.87087.7466
2205.79645.0144
2213.37498.7417
2405.843410.2023
2417.97837.3977
2805.816620.7226
2818.254514.5431
21605.821541.6613
21618.235429.3879
23205.653086.0286
23218.321858.3826
4107.41313.8807
4113.21339.1827
4207.31508.0854
4215.498310.8039
4407.278416.5097
44111.271710.5699
4807.287333.2331
48116.629914.4164
41607.160667.8777
416116.979428.4981
43206.2954154.6834
432116.368659.3610
8108.04037.3503
8114.603812.7202
8208.005214.9161
8218.517613.9987
8408.151929.6698
84114.821116.1640
8807.852561.8612
88127.586017.4434
81606.8887141.3268
816134.130728.3463
83206.9031282.2350
832138.243050.7700
16108.893313.4650
16116.272818.9827
16208.907627.1436
162111.722020.5300
16408.723355.6745
164118.762425.7505
16807.4987129.8232
168134.557527.9686
161607.3837263.8333
1616141.061247.2465
163207.3259531.7938
1632139.910997.4890
32109.268326.0485
32117.660631.5179
32209.002053.9343
322113.165836.4441
32407.5486128.9595
324122.496843.0776
32807.6493254.6875
328139.005949.6149
321607.5547515.7824
3216141.461793.7583
323206.69471162.7013
3232136.8926211.2120
64109.198352.6768
64118.150559.5486
64207.6384127.3833
642113.981169.1716
64407.5413258.3375
644125.501876.2306
64807.3730528.3678
648141.589393.4699
641606.72421157.6784
6416135.1358221.7170
643205.24912958.7474
6432129.6408526.3284
128107.5442128.8704
128117.3481132.2355
128207.5959256.3192
1282114.3860135.3077
128407.5891513.4345
1284126.2082148.3515
128806.72181158.2197
1288139.5513196.9944
1281605.19502990.2587
12816128.7749542.1595
1283204.83896388.4904
12832127.58571131.2934
nblocksiodasyncbw_gib_slat_usec
1104.23241.5692
1111.78833.9574
1204.07563.5073
1212.05747.1336
1404.08137.2245
1412.580511.5444
1804.148514.4645
1813.119119.3035
11604.109429.4339
11613.392235.7044
13204.165258.3173
13213.581367.8628
2105.79142.4004
2112.95854.8417
2205.82055.0033
2213.24849.0866
2405.869210.1507
2414.180514.3272
2805.824120.7047
2814.510026.7865
21605.838141.5402
21614.810450.4571
23205.768084.3214
23214.892399.4498
4107.31713.9242
4114.24506.8171
4207.31498.0876
4214.611412.9498
4407.356416.3417
4415.220423.0800
4807.375332.8332
4815.543643.7378
41607.120468.2612
41615.647986.0973
43206.2542155.6801
43215.4395179.0695
8108.11627.2674
8115.753610.2948
8208.118014.7826
8215.747220.9051
8408.049930.0124
8416.269238.6020
8807.929061.2775
8816.300077.1385
81607.0780137.5520
81616.1687157.8805
83206.9722279.4301
83216.2175313.5523
16108.855913.5212
16117.116316.8277
16208.839527.3402
16217.164633.7138
16408.628056.2576
16416.850170.9189
16807.5552128.8521
16816.5925147.6890
161607.4523261.4281
161616.7192290.0837
163207.3669528.8536
163216.7170580.7891
32109.216926.2060
32118.135229.6627
32208.978354.0723
32217.488164.8356
32407.7601125.4378
32417.0137138.7156
32807.7013252.9703
32817.0259277.4011
321607.6469509.4715
321617.0901550.0196
323206.74421154.2708
323216.48011204.9072
64109.139853.0560
64118.487657.0746
64207.7871124.9611
64217.2915133.3649
64407.7413251.6239
64417.2876267.2993
64807.7091505.3741
64817.1725543.5188
641606.82421140.7509
641616.61591179.8322
643205.19212991.8800
643215.31602934.9295
128107.7871124.8521
128117.4671130.0443
128207.6681253.8789
128217.3989263.0230
128407.6174511.4664
128417.3589529.6611
128806.82331141.0293
128816.66251170.2110
1281605.16183009.7684
1281615.47972848.2109
1283204.82046413.3049
1283214.97516274.0151

Reply via email to