Re: AIO / read stream heuristics adjustments for index prefetching

Andres Freund Fri, 03 Apr 2026 16:11:11 -0700

Hi,

There are a bunch of heuristics mentioned in the following proposed commit:


On 2026-04-03 16:36:03 -0400, Andres Freund wrote:
> Subject: [PATCH v5 1/5] aio: io_uring: Trigger async processing for large IOs
>
> io_method=io_uring has a heuristic to trigger asynchronous processing of IOs
> once the IO depth is a bit larger. That heuristic is important when doing
> buffered IO from the kernel page cache, to allow parallelizing of the memory
> copy, as otherwise io_method=io_uring would be a lot slower than
> io_method=worker in that case.
>
> An upcoming commit will make read_stream.c only increase the read-ahead
> distance if we needed to wait for IO to complete. If to-be-read data is in the
> kernel page cache, io_uring will synchronously execute IO, unless the IO is
> flagged as async.  Therefore the aforementioned change in read_stream.c
> heuristic would lead to a substantial performance regression with io_uring
> when data is in the page cache, as we would never reach a deep enough queue to
> actually trigger the existing heuristic.
>
> Parallelizing the copy from the page cache is mainly important when doing a
> lot of IO, which commonly is only possible when doing largely sequential IO.
>
> The reason we don't just mark all io_uring IOs as asynchronous is that the
> dispatch to a kernel thread has overhead. This overhead is mostly noticeable
> with small random IOs with a low queue depth, as in that case the gain from
> parallelizing the memory copy is small and the latency cost high.
>
> The facts from the two prior paragraphs show a way out: Use the size of the IO
> in addition to the depth of the queue to trigger asynchronous processing.
>
> One might think that just using the IO size might be enough, but
> experimentation has shown that not to be the case - with deep look-ahead
> distances being able to parallelize the memory copy is important even with
> smaller IOs.

> +/*
> + * io_uring executes IO in process context if possible. That's generally 
> good,
> + * as it reduces context switching. When performing a lot of buffered IO that
> + * means that copying between page cache and userspace memory happens in the
> + * foreground, as it can't be offloaded to DMA hardware as is possible when
> + * using direct IO. When executing a lot of buffered IO this causes io_uring
> + * to be slower than worker mode, as worker mode parallelizes the
> + * copying. io_uring can be told to offload work to worker threads instead.
> + *
> + * If the IOs are small, we only benefit from forcing things into the
> + * background if there is a lot of IO, as otherwise the overhead from context
> + * switching is higher than the gain.
> + *
> + * If IOs are large, there is benefit from asynchronous processing at lower
> + * queue depths, as IO latency is less of a crucial factor and parallelizing
> + * memory copies is more important.  In addition, it is important to trigger
> + * asynchronous processing even at low queue depth, as with foreground
> + * processing we might never actually reach deep enough IO depths to trigger
> + * asynchronous processing, which in turn would deprive readahead control
> + * logic of information about whether a deeper look-ahead distance would be
> + * advantageous.
> + *
> + * We have done some basic benchmarking to validate the thresholds used, but
> + * it's quite plausible that there are better values.

Thought it'd be useful to actually have an email to point to in the above
comment, with details about what benchmark I ran.

Previously I'd just manually run fio with different options, I made it a bit
more systematic with the attached (only halfway hand written) script.

I attached two different results, once when allowing access to multiple cores,
and once with a single core (simulating a very busy machine).

(nblocks is in multiples of 8KB)

Multi-core:

nblocks iod     async   bw_gib_s        lat_usec
1       1       0       4.2075  1.5802
1       1       1       1.0462  6.9652
1       2       0       4.1362  3.4533
1       2       1       1.9284  7.6040
1       4       0       4.0030  7.3720
1       4       1       4.2713  6.9086
1       8       0       4.1653  14.4072
1       8       1       4.3301  13.8365
1       16      0       4.1829  28.9216
1       16      1       4.3006  28.1261
1       32      0       4.0735  59.6232
1       32      1       4.3248  56.1614

I.e at nblocks=1, there's pretty much no gain from async, and the latency
increases markedly at the low end and just about catches up at the high end.

Around an iodepth 4 the loss from async nonexistant or minimal.


2       1       0       5.7289  2.4261
2       1       1       1.8708  7.7466
2       2       0       5.7964  5.0144
2       2       1       3.3749  8.7417
2       4       0       5.8434  10.2023
2       4       1       7.9783  7.3977
2       8       0       5.8166  20.7226
2       8       1       8.2545  14.5431
2       16      0       5.8215  41.6613
2       16      1       8.2354  29.3879
2       32      0       5.6530  86.0286
2       32      1       8.3218  58.3826

With nblocks=2, there start to be gains at higher IO depths, but they're still
somewhat limited.  Latency already starts to be better at iodepth 4.


4       1       0       7.4131  3.8807
4       1       1       3.2133  9.1827
4       2       0       7.3150  8.0854
4       2       1       5.4983  10.8039
4       4       0       7.2784  16.5097
4       4       1       11.2717 10.5699
4       8       0       7.2873  33.2331
4       8       1       16.6299 14.4164
4       16      0       7.1606  67.8777
4       16      1       16.9794 28.4981
4       32      0       6.2954  154.6834
4       32      1       16.3686 59.3610

With nblocks=4, async shows much more substantial gains. Latency of async at
the high end is also much improved.


8       1       0       8.0403  7.3503
8       1       1       4.6038  12.7202
8       2       0       8.0052  14.9161
8       2       1       8.5176  13.9987
8       4       0       8.1519  29.6698
8       4       1       14.8211 16.1640
8       8       0       7.8525  61.8612
8       8       1       27.5860 17.4434
8       16      0       6.8887  141.3268
8       16      1       34.1307 28.3463
8       32      0       6.9031  282.2350
8       32      1       38.2430 50.7700

With nblocks=8, async is faster already at iodepth 2.


64      1       0       9.1983  52.6768
64      1       1       8.1505  59.5486

128     1       0       7.5442  128.8704
128     1       1       7.3481  132.2355

Somewhere nblocks=64 and 128, we reach the point where there's basically no
loss at iodepth 1.


This seems to validate setting IOSQE_ASYNC around a block size of >= 4 and a
queue depth of > 4. I guess it could make sense to reduce it from > 4 to >= 4
based on these numbers, but I don't think it matters terribly.



Obviously with just one core there will only ever be a loss from doing an
asynchronous / concurrent copy from the page cache. But it's interesting to
see where the overhead of async starts to be less of a factor.

At iodepth 1 (worse case, a context switch for every IO)

nblocks iod     async   bw_gib_s        lat_usec
1       1       0       4.2324  1.5692
1       1       1       1.7883  3.9574
2.36x bw regression

2       1       0       5.7914  2.4004
2       1       1       2.9585  4.8417
1.96x bw regression

4       1       0       7.3171  3.9242
4       1       1       4.2450  6.8171
1.7x bw regression

8       1       0       8.1162  7.2674
8       1       1       5.7536  10.2948
1.4x bw regression

16      1       0       8.8559  13.5212
16      1       1       7.1163  16.8277
1.6x bw regression


But the IO depth would not stay at 1 in the case of postgres with the proposed
changes, it'd ramp up due to needing to wait for the kernel to complete those
IOs asynchronously.

Therefore comparing that to a deeper IO depth.

nblocks iod     async   bw_gib_s        lat_usec
1       16      0       4.1094  29.4339
1       16      1       3.3922  35.7044
1.21x bw regression

2       16      0       5.8381  41.5402
2       16      1       4.8104  50.4571
1.21x bw regression

4       16      0       7.1204  68.2612
4       16      1       5.6479  86.0973
1.26x bw regression

8       16      0       7.0780  137.5520
8       16      1       6.1687  157.8805
1.14x bw regression

16      16      0       7.4523  261.4281
16      16      1       6.7192  290.0837
1.10x bw regression


This assumes a very extreme scenario (no cycles whatsoever available for
parallelism), so I'm just looking for the worst case regression here.


I don't think there's very clear indicators for what cutoffs to use in the
onecpu data. Clearly we shouldn't go for async for single block IOs, but we
aren't.  With the default io_combine_limit=16 effective_io_concurrency=16,
we'd end up with 1.10x regression in the extreme case of only having a single
core available (but that one fully!) and doing nothing other than IO.

Seems ok to me.


I ran it on three other machines (newer workstation, laptop, old laptop) as
well, with similarly shaped results (although considerably higher & lower
throughputs across the board, depending on the machine).

Zen 4 Laptop:
nblocks iod     async   bw_gib_s        lat_usec
1       1       0       6.0989  1.1408
1       1       1       1.4477  5.1246
1       2       0       6.9600  2.0827
1       2       1       2.8750  5.1711
1       4       0       7.0283  4.2307
1       4       1       8.9174  3.3169

Suprisingly bigger difference between sync/async at iod=1, but it's again
similar around iod=4 blocks.


4       1       0       14.5638 1.9616
4       1       1       5.1245  5.8016
4       2       0       14.8867 3.9607
4       2       1       12.1841 4.8662
4       4       0       14.8678 8.0764
4       4       1       21.5077 5.5417

Similar.


16      1       0       21.0754 5.5891
16      1       1       12.6180 9.4753
16      2       0       20.2770 11.8353
16      2       1       24.3277 9.8172

At nblocks=16, iod=2 starts already starts to be faster.



Greetings,

Andres Freund

#!/usr/bin/env python3
import argparse
import json
import subprocess
import sys


def run_fio(directory, nblocks, iodepth, force_async,
            size,
            runtime):
    bs = nblocks * 8 * 1024
    cmd = [
        "fio",
        f"--directory={directory}",
        f"--size={size}",
        "--name=read",
        "--invalidate=0",
        "--rw=read",
        "--direct=0",
        "--buffered=1",
        "--time_based=1",
        f"--runtime={runtime}",
        "--ioengine=io_uring",
        f"--iodepth={iodepth}",
        f"--force_async={force_async}",
        f"--bs={bs}",
        "--output-format=json",
    ]

    result = subprocess.run(cmd, capture_output=True, text=True, check=True)
    return json.loads(result.stdout)


def extract_metrics(data):
    """
    Extract bandwidth (GiB/s) and average latency (µs) from fio JSON.
    fio JSON reports.:
    """
    read_stats = data["jobs"][0]["read"]

    bw_kibs = read_stats["bw"]              # KiB/s
    bw_gibs = bw_kibs / 1024**2            # KiB/s → GiB/s

    lat_ns = read_stats["lat_ns"]["mean"]  # nanoseconds
    lat_usec = lat_ns / 1000.0             # → µs

    return bw_gibs, lat_usec


def main():
    parser = argparse.ArgumentParser(
        description="Run fio sequential read benchmarks across parameter combos."
    )
    parser.add_argument(
        "--directory", default="/srv/fio",
        help="fio test directory (default: /srv/fio)",
    )
    parser.add_argument(
        "--size", default="4GiB",
        help="fio file size (default: 4GiB)",
    )
    parser.add_argument(
        "--runtime", type=int, default=1,
        help="Seconds per test (default: 1)",
    )
    parser.add_argument(
        "--nblocks", type=int, nargs="+",
        default=[1, 2, 4, 8, 16, 32, 64, 128],
        help="Block-count values to test (bs = nblocks * 8 KiB)",
    )
    parser.add_argument(
        "--iodepths", type=int, nargs="+",
        default=[1, 2, 4, 8, 16, 32],
        help="iodepth values to test",
    )
    args = parser.parse_args()

    print("nblocks\tiod\tasync\tbw_gib_s\tlat_usec")

    for nblocks in args.nblocks:
        for iodepth in args.iodepths:
            for force_async in [0, 1]:
                try:
                    data = run_fio(
                        directory=args.directory,
                        nblocks=nblocks,
                        iodepth=iodepth,
                        force_async=force_async,
                        size=args.size,
                        runtime=args.runtime,
                    )
                    bw_gibs, lat_usec = extract_metrics(data)
                    print(f"{nblocks}\t{iodepth}\t{force_async}\t{bw_gibs:.4f}\t{lat_usec:.4f}")
                    sys.stdout.flush()
                except subprocess.CalledProcessError as exc:
                    print(f"# ERROR nblocks={nblocks} iod={iodepth} async={force_async}: {exc}",
                          file=sys.stderr)
                    sys.exit(1)


if __name__ == "__main__":
    main()

nblocks	iod	async	bw_gib_s	lat_usec
1	1	0	4.2075	1.5802
1	1	1	1.0462	6.9652
1	2	0	4.1362	3.4533
1	2	1	1.9284	7.6040
1	4	0	4.0030	7.3720
1	4	1	4.2713	6.9086
1	8	0	4.1653	14.4072
1	8	1	4.3301	13.8365
1	16	0	4.1829	28.9216
1	16	1	4.3006	28.1261
1	32	0	4.0735	59.6232
1	32	1	4.3248	56.1614
2	1	0	5.7289	2.4261
2	1	1	1.8708	7.7466
2	2	0	5.7964	5.0144
2	2	1	3.3749	8.7417
2	4	0	5.8434	10.2023
2	4	1	7.9783	7.3977
2	8	0	5.8166	20.7226
2	8	1	8.2545	14.5431
2	16	0	5.8215	41.6613
2	16	1	8.2354	29.3879
2	32	0	5.6530	86.0286
2	32	1	8.3218	58.3826
4	1	0	7.4131	3.8807
4	1	1	3.2133	9.1827
4	2	0	7.3150	8.0854
4	2	1	5.4983	10.8039
4	4	0	7.2784	16.5097
4	4	1	11.2717	10.5699
4	8	0	7.2873	33.2331
4	8	1	16.6299	14.4164
4	16	0	7.1606	67.8777
4	16	1	16.9794	28.4981
4	32	0	6.2954	154.6834
4	32	1	16.3686	59.3610
8	1	0	8.0403	7.3503
8	1	1	4.6038	12.7202
8	2	0	8.0052	14.9161
8	2	1	8.5176	13.9987
8	4	0	8.1519	29.6698
8	4	1	14.8211	16.1640
8	8	0	7.8525	61.8612
8	8	1	27.5860	17.4434
8	16	0	6.8887	141.3268
8	16	1	34.1307	28.3463
8	32	0	6.9031	282.2350
8	32	1	38.2430	50.7700
16	1	0	8.8933	13.4650
16	1	1	6.2728	18.9827
16	2	0	8.9076	27.1436
16	2	1	11.7220	20.5300
16	4	0	8.7233	55.6745
16	4	1	18.7624	25.7505
16	8	0	7.4987	129.8232
16	8	1	34.5575	27.9686
16	16	0	7.3837	263.8333
16	16	1	41.0612	47.2465
16	32	0	7.3259	531.7938
16	32	1	39.9109	97.4890
32	1	0	9.2683	26.0485
32	1	1	7.6606	31.5179
32	2	0	9.0020	53.9343
32	2	1	13.1658	36.4441
32	4	0	7.5486	128.9595
32	4	1	22.4968	43.0776
32	8	0	7.6493	254.6875
32	8	1	39.0059	49.6149
32	16	0	7.5547	515.7824
32	16	1	41.4617	93.7583
32	32	0	6.6947	1162.7013
32	32	1	36.8926	211.2120
64	1	0	9.1983	52.6768
64	1	1	8.1505	59.5486
64	2	0	7.6384	127.3833
64	2	1	13.9811	69.1716
64	4	0	7.5413	258.3375
64	4	1	25.5018	76.2306
64	8	0	7.3730	528.3678
64	8	1	41.5893	93.4699
64	16	0	6.7242	1157.6784
64	16	1	35.1358	221.7170
64	32	0	5.2491	2958.7474
64	32	1	29.6408	526.3284
128	1	0	7.5442	128.8704
128	1	1	7.3481	132.2355
128	2	0	7.5959	256.3192
128	2	1	14.3860	135.3077
128	4	0	7.5891	513.4345
128	4	1	26.2082	148.3515
128	8	0	6.7218	1158.2197
128	8	1	39.5513	196.9944
128	16	0	5.1950	2990.2587
128	16	1	28.7749	542.1595
128	32	0	4.8389	6388.4904
128	32	1	27.5857	1131.2934

nblocks	iod	async	bw_gib_s	lat_usec
1	1	0	4.2324	1.5692
1	1	1	1.7883	3.9574
1	2	0	4.0756	3.5073
1	2	1	2.0574	7.1336
1	4	0	4.0813	7.2245
1	4	1	2.5805	11.5444
1	8	0	4.1485	14.4645
1	8	1	3.1191	19.3035
1	16	0	4.1094	29.4339
1	16	1	3.3922	35.7044
1	32	0	4.1652	58.3173
1	32	1	3.5813	67.8628
2	1	0	5.7914	2.4004
2	1	1	2.9585	4.8417
2	2	0	5.8205	5.0033
2	2	1	3.2484	9.0866
2	4	0	5.8692	10.1507
2	4	1	4.1805	14.3272
2	8	0	5.8241	20.7047
2	8	1	4.5100	26.7865
2	16	0	5.8381	41.5402
2	16	1	4.8104	50.4571
2	32	0	5.7680	84.3214
2	32	1	4.8923	99.4498
4	1	0	7.3171	3.9242
4	1	1	4.2450	6.8171
4	2	0	7.3149	8.0876
4	2	1	4.6114	12.9498
4	4	0	7.3564	16.3417
4	4	1	5.2204	23.0800
4	8	0	7.3753	32.8332
4	8	1	5.5436	43.7378
4	16	0	7.1204	68.2612
4	16	1	5.6479	86.0973
4	32	0	6.2542	155.6801
4	32	1	5.4395	179.0695
8	1	0	8.1162	7.2674
8	1	1	5.7536	10.2948
8	2	0	8.1180	14.7826
8	2	1	5.7472	20.9051
8	4	0	8.0499	30.0124
8	4	1	6.2692	38.6020
8	8	0	7.9290	61.2775
8	8	1	6.3000	77.1385
8	16	0	7.0780	137.5520
8	16	1	6.1687	157.8805
8	32	0	6.9722	279.4301
8	32	1	6.2175	313.5523
16	1	0	8.8559	13.5212
16	1	1	7.1163	16.8277
16	2	0	8.8395	27.3402
16	2	1	7.1646	33.7138
16	4	0	8.6280	56.2576
16	4	1	6.8501	70.9189
16	8	0	7.5552	128.8521
16	8	1	6.5925	147.6890
16	16	0	7.4523	261.4281
16	16	1	6.7192	290.0837
16	32	0	7.3669	528.8536
16	32	1	6.7170	580.7891
32	1	0	9.2169	26.2060
32	1	1	8.1352	29.6627
32	2	0	8.9783	54.0723
32	2	1	7.4881	64.8356
32	4	0	7.7601	125.4378
32	4	1	7.0137	138.7156
32	8	0	7.7013	252.9703
32	8	1	7.0259	277.4011
32	16	0	7.6469	509.4715
32	16	1	7.0901	550.0196
32	32	0	6.7442	1154.2708
32	32	1	6.4801	1204.9072
64	1	0	9.1398	53.0560
64	1	1	8.4876	57.0746
64	2	0	7.7871	124.9611
64	2	1	7.2915	133.3649
64	4	0	7.7413	251.6239
64	4	1	7.2876	267.2993
64	8	0	7.7091	505.3741
64	8	1	7.1725	543.5188
64	16	0	6.8242	1140.7509
64	16	1	6.6159	1179.8322
64	32	0	5.1921	2991.8800
64	32	1	5.3160	2934.9295
128	1	0	7.7871	124.8521
128	1	1	7.4671	130.0443
128	2	0	7.6681	253.8789
128	2	1	7.3989	263.0230
128	4	0	7.6174	511.4664
128	4	1	7.3589	529.6611
128	8	0	6.8233	1141.0293
128	8	1	6.6625	1170.2110
128	16	0	5.1618	3009.7684
128	16	1	5.4797	2848.2109
128	32	0	4.8204	6413.3049
128	32	1	4.9751	6274.0151

Re: AIO / read stream heuristics adjustments for index prefetching

Reply via email to