On Sun, Mar 1, 2026 at 3:03 PM Tomas Vondra <[email protected]> wrote:
> Hi,
>
> I've decided to run a couple tests, trying to reproduce some of the
> behaviors described in your (Felipe's) messages.
>
Thank you,
I will look into this data later. I am impressed with the number of IO
workers
you used, my test was typically with 3.
I'm not trying to redo the tests exactly, because (a) I don't have a M1
> machine, and (b) there's not enough details about the hardware and
> configuration to actually redo it properly.
>
Well I was running on a M1 because this is what I have in front of me
but I know that any serious database will run on linux.
> I've focused on quantifying the impact of a couple things mentioned in
> the previous message:
I will have a look into this later and compute the effect size.
The test varies the following parameters:
>
> * buffered or direct I/O
> * io_method = (worker | io_uring)
> * shared_buffers = (128MB | 8GB)
> * enable_indexscan_prefetch = (on | off)
> * indexscan_prefetch_distance = (0, 1, 4, 16, 64, 128)
> * sequential / random data (1M rows, 550MB, ~15 rows per page)
There are literally only 4 cases where prefetching does worse than
> master, and those are for random data with distance limit 1. I claim
> this is irrelevant, because it literally disables prefetching while
> still paying the full cost (all 4 are for io_method=worker, where the
> signal overhead can be high, so it's not a surprise).
>
I agree with your claim, the idea of the distance limit was to separate
to have an idea of AIO overhead without the benefit of prefetch, because
I was seeing very similar results, but when I controlled the distance
the prefetch benefit became visible. And also the gradation would
show if this has a U shape or the larger the distance the better the
performance.
It's a bit like buying a race horse, break its leg
> and then complain it's not running very fast
>
😂
The overhead of the instrumentation seems relatively small, probably
> within 5% or so. That's a bit less than I expected, but I still don't
> understand what this is meant to say us. It's measuring wall-time, and
> it's no surprise that in an I/O-bound workload most of the time is spent
> in functions doing (and waiting for) I/O. Like read_stream_next_buffer.
> But it does not give any indication *why*.
>
Well, I was hoping to be able to create a self balancing mechanism
in read_stream_next_buffer
/* Do we have to wait for an associated I/O first? */
if (stream->ios_in_progress > 0 &&
stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
{
// prefetch and increase the distance while we wait here
WaitReadBuffers(&stream->ios[io_index].op);
...
}
...
// this call could be removed if we prefetched earlier.
read_stream_look_ahead(stream);
There same principle that guided the
> Don't wait for already in-progress IO
patch. Here we should prioritise increasing the distance, and if it is not
possible (maybe we consumed all the buffers). We could take the
opportunity to yield.
>
> multi-client test (multi-client.tgz)
> ------------------------------------
>
> The test varies the following parameters:
>
> * buffered or direct I/O
> * io_method = (worker | io_uring)
> * io_workers = (12 | 32)
> * shared_buffers = (128MB | 8GB)
> * enable_indexscan_prefetch = (on | off)
> * indexscan_prefetch_distance = (0, 1, 4, 16, 64, 128)
> * sequential / random data (1M rows, 550MB, ~15 rows per page)
> * number of parallel workers (1, 2, 4, 8)
>
Are parallel workers here clients issuing queries?
This all seems perfectly fine to me. The bad behavior would be if the
> prefetching gets slower than master, because that would be a regression
> affecting users. But that happens only in 4 cells in the table.
And in this case we have to take the other extremum, and run the queries
where the prefetch is not expected to help. In this sense I agree with Peter
that the yielding logic is important. We may be limiting the potential of
the
prefetch in some cases but excessive reads is the highest risk in my
opinion.
You may know better than me, but I talk about the workloads I have seen
or worked with, it is typically a high number of small queries. Not these
huge
scans.
Large queries are rare, and when they come to our attention is because
they used too much memory and started to create temporary files.
(But I'm speculating, I haven't investigated this in detail yet.)
>
Fair enough.
Moreover, io_uring does not have this issue. Which is another indication
> it's something about the signal overhead.
>
That is interesting.
> In any case, these results clearly prefetching can be a huge improvement
> even in environments with concurrent activity, etc.
>
>
> If you see something different on the Mac, you need to investigate why.
> It could be something in the OS, or maybe it it's hardware specific
> thing (consumer SSDs can choke on too many requests). Hard to say. I
> don't even know what kind of M1 machine you have, what SSD etc.
>
My guess is that the cause is IPC, I don't know well how the
async IO works, but if it is a different process I think that MacOS is
by less efficient than linux. But I don't know how to measure that.
Regards,
Alexandre
On Sun, Mar 1, 2026 at 3:03 PM Tomas Vondra <[email protected]> wrote:
> Hi,
>
> I've decided to run a couple tests, trying to reproduce some of the
> behaviors described in your (Felipe's) messages.
>
> I'm not trying to redo the tests exactly, because (a) I don't have a M1
> machine, and (b) there's not enough details about the hardware and
> configuration to actually redo it properly.
>
> I've focused on quantifying the impact of a couple things mentioned in
> the previous message:
>
> 1) the distance limit
>
> 2) the profiling instrumentation
>
> 3) concurrency (multiple backends doing I/O)
>
> I wrote a couple scripts to run two benchmarks, one focusing on (1) and
> (2), and the second one focusing on (3).
>
> Both were ran on four builds:
>
> 1) master
> 2) patched (index prefetch v11)
> 3) patched-limit (patched + distance limit)
> 4) patched-limit-instrument (patched-limit + instrumentation)
>
> The scripts initialize an instance, vary a couple important parameters
> (shared buffers, io_method, direct I/O, ...) and run index scans on a
> table with either sequential or random data.
>
> I'm attaching the full scripts, raw results, and PDFs with a nicer
> version of the results.
>
>
> single-client test (single-client.tgz)
> --------------------------------------
>
> The test varies the following parameters:
>
> * buffered or direct I/O
> * io_method = (worker | io_uring)
> * shared_buffers = (128MB | 8GB)
> * enable_indexscan_prefetch = (on | off)
> * indexscan_prefetch_distance = (0, 1, 4, 16, 64, 128)
> * sequential / random data (1M rows, 550MB, ~15 rows per page)
>
> This was done on an old Xeon machine from ~2016, with a WD Ultrastar DC
> SN640 960GB NVMe SSD.
>
> The single-client.pdf shows the timings for different combinations of
> parameters, branches and distance limit values. There's also a table
> with timing relative to master (100% means the same as master, green =
> good, red = bad).
>
> There are literally only 4 cases where prefetching does worse than
> master, and those are for random data with distance limit 1. I claim
> this is irrelevant, because it literally disables prefetching while
> still paying the full cost (all 4 are for io_method=worker, where the
> signal overhead can be high, so it's not a surprise).
>
> We ram up the distance exactly for this reason, that's the solution for
> this overhead problem. I refuse to consider these regressions with
> limit=1 a problem. It's a bit like buying a race horse, break its leg
> and then complain it's not running very fast.
>
> The overhead of the instrumentation seems relatively small, probably
> within 5% or so. That's a bit less than I expected, but I still don't
> understand what this is meant to say us. It's measuring wall-time, and
> it's no surprise that in an I/O-bound workload most of the time is spent
> in functions doing (and waiting for) I/O. Like read_stream_next_buffer.
> But it does not give any indication *why*.
>
>
> multi-client test (multi-client.tgz)
> ------------------------------------
>
> The test varies the following parameters:
>
> * buffered or direct I/O
> * io_method = (worker | io_uring)
> * io_workers = (12 | 32)
> * shared_buffers = (128MB | 8GB)
> * enable_indexscan_prefetch = (on | off)
> * indexscan_prefetch_distance = (0, 1, 4, 16, 64, 128)
> * sequential / random data (1M rows, 550MB, ~15 rows per page)
> * number of parallel workers (1, 2, 4, 8)
>
> This was done on a Ryzen 9 machine from ~2023, with 4x Samsung 990 PRO
> 1TB drives in RAID0.
>
> The test prepares a separate table for each worker, and then runs the
> index scans concurrently (and "syncs" the workers to start at the same
> time). It measures the duration, and we can compare it to the timing
> from master (without prefetching).
>
> The multi-client-full.pdf has detailed results for all parameters, but
> as I said I don't think the distance limit (particularly for limit 1) is
> interesting.
>
> The multi-client-simple.pdf shows only results for limit=0 (i.e. without
> limit), and is hopefully easier to understand. The first table shows
> timings for each combination, the second table shows timing relative to
> master (for the same number of workers etc.).
>
> The results are pretty positive. For random data (which is about the
> worst case for I/O), it's consistently faster than master. Yes, the
> gains with 8 workers is not as significant as with 1 worker. For
> example, it may look like this:
>
> master prefetch
> 1 worker: 2960 1898 64%
> 8 workers: 5585 5361 96%
>
> But that's not a huge surprise. The storage has a limited throughput,
> and at some point it gets saturated. Whether it's by prefetching, or by
> having multiple workers does not matter.
>
> For sequential data (which is what you did in your examples) it's much
> simpler. For buffered there's not much benefit, because page cache does
> read-ahead with mostly the same effect, or there's a nice consistent
> speedup for direct I/O.
>
> This all seems perfectly fine to me. The bad behavior would be if the
> prefetching gets slower than master, because that would be a regression
> affecting users. But that happens only in 4 cells in the table. My guess
> is it hits some limit on the number of signals the system can process.
> The random data set is not great for this, it's worse with more workers,
> and the 128MB buffers make that even worse. This is a bit of perfect
> storm, and it's already there - bitmap scans can hit that too, AFAICS.
>
> (But I'm speculating, I haven't investigated this in detail yet.)
>
> Moreover, io_uring does not have this issue. Which is another indication
> it's something about the signal overhead.
>
> In any case, these results clearly prefetching can be a huge improvement
> even in environments with concurrent activity, etc.
>
>
> If you see something different on the Mac, you need to investigate why.
> It could be something in the OS, or maybe it it's hardware specific
> thing (consumer SSDs can choke on too many requests). Hard to say. I
> don't even know what kind of M1 machine you have, what SSD etc.
>
>
> regards
>
> --
> Tomas Vondra
>