Hi, On 2025-04-07 16:28:20 -0400, Andres Freund wrote: > On 2025-04-07 15:24:43 -0400, Melanie Plageman wrote: > > On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <and...@anarazel.de> wrote: > > > > > > I think we should consider increasing BAS_BULKREAD TO something like > > > Min(256, io_combine_limit * (effective_io_concurrency + 1)) > > > > Do you mean Max? If so, this basically makes sense to me. > > Err, yes.
In the attached I implemented the above idea, with some small additional refinements: - To allow sync seqscans to work at all, we should only *add* to the 256kB that we currently have - otherwise all buffers in a ring will be undergoing IO, never allowing two synchronizing scans to actually use the buffers that the other scan has already read in. This also kind of obsoletes the + 1 in the formula above, although that is arguable, particularly for effective_io_concurrency=0. - If the backend has a PinLimit() that won't allow io_combine_limit * effective_io_concurrency buffers to undergo IO, it doesn't make sense to make the ring bigger. At best it would waste space for the ring, at worst it'd make "ring escapes" inevitable - victim buffer search would always replace buffers that we have in the ring. - the multiplication by (BLCKSZ / 1024) that I omitted above is actually included :) I unfortunately think we do need *something* to address $subject for 18 - the performance regression when increasing relation sizes is otherwise just too big - it's trivial to find queries getting slower by more than 4x. On local, low-latency NVMe storage - on network storage the regression will often be bigger. If we don't do something for 18, only consolation would be that the performance when using the 256kB BAS_BULKREAD is rather close to the performance one gets in 17, with or without without a strategy. But I don't think that would make it less surprising that once your table grows sufficient to use a strategy your IO throughput craters. I've some local prototype for the 17/18 "strategy escape" issue, will work on polishing that soon, unless you have something for that Thomas? Greetings, Andres Freund
>From e49f8201413b3b143291ec97d4bea4b1015bcd13 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Tue, 18 Mar 2025 14:40:06 -0400 Subject: [PATCH v2] Increase BAS_BULKREAD based on effective_io_concurrency Before, BAS_BULKREAD was always of size 256kB. With the default io_combine_limit of 16, that only allowed 1-2 IOs to be in flight - insufficient even on very low latency storage. We don't just want to increase the size to much larger hardcoded value, as very large rings (10s of MBs of of buffers), appear to have negative performance effects when reading in data that the OS has cached. To address this, increase the size of BAS_BULKREAD to allow for io_combine_limit * effective_io_concurrency buffers getting read in. To prevent the ring being much larger than useful, limit the increased size with GetPinLimit(). The formula outlined above keeps the ring size to sizes for which we have not observed performance regressions, unless very large effective_io_concurrency values are used together with large shared_buffers setting. Discussion: https://postgr.es/m/lqwghabtu2ak4wknzycufqjm5ijnxhb4k73vzphlt2a3wsemcd@gtftg44kdim6 Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt --- src/backend/storage/buffer/freelist.c | 46 +++++++++++++++++++++++++-- 1 file changed, 44 insertions(+), 2 deletions(-) diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 336715b6c63..e1f8e5e97bd 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -555,8 +555,50 @@ GetAccessStrategy(BufferAccessStrategyType btype) return NULL; case BAS_BULKREAD: - ring_size_kb = 256; - break; + { + int ring_max_kb; + + /* + * The ring always needs to be large enough to allow some + * separation in time between providing a buffer to the user + * of the strategy and that buffer being reused. Otherwise the + * user's pin will prevent reuse of the buffer, even without + * concurrent activity. + * + * We also need to ensure the ring always is large enough for + * SYNC_SCAN_REPORT_INTERVAL, as noted above. + * + * Thus we start out a minimal size and increase the size + * further if appropriate. + */ + ring_size_kb = 256; + + /* + * There's no point in a larger ring if we won't be allowed to + * pin sufficiently many buffers. But we never limit to less + * than the minimal size above. + */ + ring_max_kb = GetPinLimit() * (BLCKSZ / 1024); + ring_max_kb = Max(ring_size_kb, ring_max_kb); + + /* + * We would like the ring to additionally have space for the + * the configured degree of IO concurrency. While being read + * in, buffers can obviously not yet be reused. + * + * Each IO can be up to io_combine_limit blocks large, and we + * want to start up to effective_io_concurrency IOs. + * + * Note that effective_io_concurrency may be 0, which disables + * AIO. + */ + ring_size_kb += (BLCKSZ / 1024) * + io_combine_limit * effective_io_concurrency; + + if (ring_size_kb > ring_max_kb) + ring_size_kb = ring_max_kb; + break; + } case BAS_BULKWRITE: ring_size_kb = 16 * 1024; break; -- 2.48.1.76.g4e746b1a31.dirty