On 3/1/26 23:32, Alexandre Felipe wrote:
> 
> On Sun, Mar 1, 2026 at 3:03 PM Tomas Vondra <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Hi,
> 
>     I've decided to run a couple tests, trying to reproduce some of the
>     behaviors described in your (Felipe's) messages.
> 
> 
> Thank you,
> I will look into this data later. I am impressed with the number of IO
> workers 
> you used, my test was typically with 3.
> 

3 is extremely low for an I/O bound system. It's our tradition to pick
defaults that work even on tiny systems, but need tuning on actual
non-toy systems :-(

The number of workers limits the number of I/O happening concurrently.
So with 3 workers, there'll be only 3 concurrent I/Os (which is not a
lot). At which point it can be better to keep doing synchronous I/O from
multiple backends.

It's a trade off. Having more workers is not "free", so we can't have
infinite number of those. io_uring doesn't have this limitation.

>     I'm not trying to redo the tests exactly, because (a) I don't have a M1
>     machine, and (b) there's not enough details about the hardware and
>     configuration to actually redo it properly.
> 
> 
> Well I was running on a M1 because this is what I have in front of me
> but I know that any serious database will run on linux.
>  

Sure, but even then it'd be better to have more details about the
hardware. "M1" doesn't really say much (especially to people who don't
use Apple stuff very much). There's a range of M1-based systems, from
MacBook Air/Pro, Mini, Studio, ...

> 
>     I've focused on quantifying the impact of a couple things mentioned in
>     the previous message:
> 
>  
> I will have a look into this later and compute the effect size.
> 
>     The test varies the following parameters:
> 
>     * buffered or direct I/O
>     * io_method = (worker | io_uring)
>     * shared_buffers = (128MB | 8GB)
>     * enable_indexscan_prefetch = (on | off)
>     * indexscan_prefetch_distance = (0, 1, 4, 16, 64, 128)
>     * sequential / random data (1M rows, 550MB, ~15 rows per page)
> 
> 
> 
>     There are literally only 4 cases where prefetching does worse than
>     master, and those are for random data with distance limit 1. I claim
>     this is irrelevant, because it literally disables prefetching while
>     still paying the full cost (all 4 are for io_method=worker, where the
>     signal overhead can be high, so it's not a surprise).
> 
>  
> I agree with your claim, the idea of the distance limit was to separate
> to have an idea of AIO overhead without the benefit of prefetch, because
> I was seeing very similar results, but when I controlled the distance
> the prefetch benefit became visible. And also the gradation would 
> show if this has a U shape or the larger the distance the better the
> performance.
> 

OK, understood. I guess you could do very similar thing by adjusting
effective_io_concurrency (although I admit it's not exactly the same thing).

>     It's a bit like buying a race horse, break its leg
>     and then complain it's not running very fast
> 
>  😂
> 
> 
>     The overhead of the instrumentation seems relatively small, probably
>     within 5% or so. That's a bit less than I expected, but I still don't
>     understand what this is meant to say us. It's measuring wall-time, and
>     it's no surprise that in an I/O-bound workload most of the time is spent
>     in functions doing (and waiting for) I/O. Like read_stream_next_buffer.
>     But it does not give any indication *why*.
> 
> 
> Well, I was hoping to be able to create a self balancing mechanism
> in read_stream_next_buffer
> 
>  /* Do we have to wait for an associated I/O first? */
> if (stream->ios_in_progress > 0 &&
> stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
> {
>   // prefetch and increase the distance while we wait here
> WaitReadBuffers(&stream->ios[io_index].op);
>  ...
> }
> ...
> // this call could be removed if we prefetched earlier.
> read_stream_look_ahead(stream);
> 
> 
> There same principle that guided the 
>> Don't wait for already in-progress IO
> patch. Here we should prioritise increasing the distance, and if it is not
> possible (maybe we consumed all the buffers). We could take the 
> opportunity to yield.
> 
> 

IIUC the idea would be to (automatically) increase the distance just
enough so that the IOs complete right before we actually need the
buffers? That reminds me the "IO cost model" I mentioned a couple days
ago [1]. But it's not clear to me how this profiling helps with it.

> 
>     multi-client test (multi-client.tgz)
>     ------------------------------------
> 
>     The test varies the following parameters:
> 
>     * buffered or direct I/O
>     * io_method = (worker | io_uring)
>     * io_workers = (12 | 32)
>     * shared_buffers = (128MB | 8GB)
>     * enable_indexscan_prefetch = (on | off)
>     * indexscan_prefetch_distance = (0, 1, 4, 16, 64, 128)
>     * sequential / random data (1M rows, 550MB, ~15 rows per page)
>     * number of parallel workers (1, 2, 4, 8)
> 
>  
>  Are parallel workers here clients issuing queries?
> 

Yes, clients running queries. But workers in a parallel query have
mostly the same effect, really.


>     This all seems perfectly fine to me. The bad behavior would be if the
>     prefetching gets slower than master, because that would be a regression
>     affecting users. But that happens only in 4 cells in the table.
> 
> 
> And in this case we have to take the other extremum, and run the queries
> where the prefetch is not expected to help. In this sense I agree with Peter
> that the yielding logic is important. We may be limiting the potential
> of the
> prefetch in some cases but excessive reads is the highest risk in my
> opinion.
> You may know better than me, but I talk about the workloads I have seen
> or worked with, it is typically a high number of small queries. Not
> these huge
> scans. 
> Large queries are rare, and when they come to our attention is because
> they used too much memory and started to create temporary files.
> 

You mean in the 4 cells that show regressions? Maybe. Or maybe we could
look for ways to make the communication / signalling cheaper, to not
overwhelm the workers like this.

The problem with addressing this by yielding is that the query has no
idea whether it runs alone, or with many other queries. There's no such
context available in the read_stream, so it can't consider this when
yielding.

Moreover, why would yielding help with this? Yielding will just schedule
some of the I/O later, how would it not make it *even slower*? Notice
that in the multi-client-full.pdf the regression happens even with
limit=1, at which point you *can't* yield anymore.

>     (But I'm speculating, I haven't investigated this in detail yet.)
> 
> 
> Fair enough.
> 
>     Moreover, io_uring does not have this issue. Which is another indication
>     it's something about the signal overhead.
> 
>  
> That is interesting.
>  
> 
>     In any case, these results clearly prefetching can be a huge improvement
>     even in environments with concurrent activity, etc.
> 
> 
>     If you see something different on the Mac, you need to investigate why.
>     It could be something in the OS, or maybe it it's hardware specific
>     thing (consumer SSDs can choke on too many requests). Hard to say. I
>     don't even know what kind of M1 machine you have, what SSD etc.
> 
> 
> My guess is that the cause is IPC, I don't know well how the 
> async IO works, but if it is a different process I think that MacOS is
> by less efficient than linux.  But I don't know how to measure that.
> 

I wrote about the signal overhead a couple months ago, with a simple
benchmark simulating it [2]. I also wrote a brief explanation [3] about
the AIO in PG18, which mentions that too.

[1]
https://www.postgresql.org/message-id/433c7359-3c5d-4fe3-a488-93a17f44d0bf%40vondra.me

[2]
https://www.postgresql.org/message-id/1c9302da-c834-4773-a527-1c1a7029c5a3%40vondra.me

[3] https://vondra.me/posts/tuning-aio-in-postgresql-18/


regards

-- 
Tomas Vondra



Reply via email to