Hi,

On 2025-04-07 16:28:20 -0400, Andres Freund wrote:
> On 2025-04-07 15:24:43 -0400, Melanie Plageman wrote:
> > On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <and...@anarazel.de> wrote:
> > >
> > > I think we should consider increasing BAS_BULKREAD TO something like
> > >   Min(256, io_combine_limit * (effective_io_concurrency + 1))
> >
> > Do you mean Max? If so, this basically makes sense to me.
>
> Err, yes.

In the attached I implemented the above idea, with some small additional
refinements:

- To allow sync seqscans to work at all, we should only *add* to the 256kB
  that we currently have - otherwise all buffers in a ring will be undergoing
  IO, never allowing two synchronizing scans to actually use the buffers that
  the other scan has already read in.

  This also kind of obsoletes the + 1 in the formula above, although that is
  arguable, particularly for effective_io_concurrency=0.

- If the backend has a PinLimit() that won't allow io_combine_limit *
  effective_io_concurrency buffers to undergo IO, it doesn't make sense to
  make the ring bigger. At best it would waste space for the ring, at worst
  it'd make "ring escapes" inevitable - victim buffer search would always
  replace buffers that we have in the ring.

- the multiplication by (BLCKSZ / 1024) that I omitted above is actually
  included :)


I unfortunately think we do need *something* to address $subject for 18 - the
performance regression when increasing relation sizes is otherwise just too
big - it's trivial to find queries getting slower by more than 4x. On local,
low-latency NVMe storage - on network storage the regression will often be
bigger.

If we don't do something for 18, only consolation would be that the
performance when using the 256kB BAS_BULKREAD is rather close to the
performance one gets in 17, with or without without a strategy. But I don't
think that would make it less surprising that once your table grows sufficient
to use a strategy your IO throughput craters.


I've some local prototype for the 17/18 "strategy escape" issue, will work on
polishing that soon, unless you have something for that Thomas?

Greetings,

Andres Freund
>From e49f8201413b3b143291ec97d4bea4b1015bcd13 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2] Increase BAS_BULKREAD based on effective_io_concurrency

Before, BAS_BULKREAD was always of size 256kB. With the default
io_combine_limit of 16, that only allowed 1-2 IOs to be in flight -
insufficient even on very low latency storage.

We don't just want to increase the size to much larger hardcoded value, as
very large rings (10s of MBs of of buffers), appear to have negative
performance effects when reading in data that the OS has cached.

To address this, increase the size of BAS_BULKREAD to allow for
io_combine_limit * effective_io_concurrency buffers getting read in. To
prevent the ring being much larger than useful, limit the increased size with
GetPinLimit().

The formula outlined above keeps the ring size to sizes for which we have not
observed performance regressions, unless very large effective_io_concurrency
values are used together with large shared_buffers setting.

Discussion: https://postgr.es/m/lqwghabtu2ak4wknzycufqjm5ijnxhb4k73vzphlt2a3wsemcd@gtftg44kdim6
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt
---
 src/backend/storage/buffer/freelist.c | 46 +++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..e1f8e5e97bd 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,8 +555,50 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
-			break;
+			{
+				int			ring_max_kb;
+
+				/*
+				 * The ring always needs to be large enough to allow some
+				 * separation in time between providing a buffer to the user
+				 * of the strategy and that buffer being reused. Otherwise the
+				 * user's pin will prevent reuse of the buffer, even without
+				 * concurrent activity.
+				 *
+				 * We also need to ensure the ring always is large enough for
+				 * SYNC_SCAN_REPORT_INTERVAL, as noted above.
+				 *
+				 * Thus we start out a minimal size and increase the size
+				 * further if appropriate.
+				 */
+				ring_size_kb = 256;
+
+				/*
+				 * There's no point in a larger ring if we won't be allowed to
+				 * pin sufficiently many buffers.  But we never limit to less
+				 * than the minimal size above.
+				 */
+				ring_max_kb = GetPinLimit() * (BLCKSZ / 1024);
+				ring_max_kb = Max(ring_size_kb, ring_max_kb);
+
+				/*
+				 * We would like the ring to additionally have space for the
+				 * the configured degree of IO concurrency. While being read
+				 * in, buffers can obviously not yet be reused.
+				 *
+				 * Each IO can be up to io_combine_limit blocks large, and we
+				 * want to start up to effective_io_concurrency IOs.
+				 *
+				 * Note that effective_io_concurrency may be 0, which disables
+				 * AIO.
+				 */
+				ring_size_kb += (BLCKSZ / 1024) *
+					io_combine_limit * effective_io_concurrency;
+
+				if (ring_size_kb > ring_max_kb)
+					ring_size_kb = ring_max_kb;
+				break;
+			}
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
 			break;
-- 
2.48.1.76.g4e746b1a31.dirty

Reply via email to