Hi Tomas,

Thanks for the thorough benchmarking.

On Sun, Mar 1, 2026 at 9:22 PM Tomas Vondra <[email protected]> wrote:
> On 2/28/26 08:08, Amit Langote wrote:
> > Tomas Vondra also tested with an I/O-intensive workload (dataset
> > larger than shared_buffers, combined with his and Peter Geoghegan's
> > I/O prefetching patches) and confirmed that the batching + SAOP
> > approach helps there too, not just in the CPU-bound / memory-resident
> > case.  In fact he showed that the patches here don't make a big dent
> > when the main bottleneck is I/O as shown in numbers that he shared in
> > an off-list email:
> >
> > master: 161617 ms
> > ri-check (0001..0004): 149446 ms  (1.08x)
> > ri-check + i/o prefetching: 50885 ms  (3.2x)
> >
> > So the RI patches alone only give ~8% here since most time is waiting
> > on reads.  But the batching gives the prefetch machinery a window of
> > upcoming probes to issue readahead against, so the two together yield
> > 3.2x.
> >
>
> I tested this (with the index prefetching v11 patch), because I wanted
> to check if the revised API works fine for other use cases, not just the
> regular index scans. Turns out the answer is "yes", the necessary tweaks
> to the FK batching patch were pretty minimal, and at the same time it
> did help quite a bit for cases bottle-necked on I/O.

Do you think those changes to the FK batching are only necessary for
making it work with your patch or is that worth including with the set
here because it's generally applicable?

> FWIW I wonder how difficult would it be to do something like this for
> inserts into indexes. It's an orthogonal issue to FK checks (especially
> for the CPU-bound cases this thread focuses on), but it's a bit similar
> to the I/O-bound case. In fact, I now realize I actually did a PoC for
> that in 2023-11 [1], but it went stale ...

Interesting. I hadn't seen your earlier PoC. Does the current I/O
prefetching infrastructure simplify that approach, or are they
independent paths? The old patch calls PrefetchBuffer() directly on
the leaf, which seems orthogonal to the scan-side prefetching. Either
way, would be nice to see more paths benefit from batching.

> benchmarks
> ----------
>
> Anyway, thinking about the CPU-bound case, I decided to do a bit of
> testing on my own. I was wondering about three things:
>
> (a) how does the improvement depend on data distribution
> (b) could it cause regressions for small inserts
> (c) how sensitive is the batch size
>
> So I devised two simple benchmarks:
>
> 1) run-pattern.sh - Inserts batches of values into a table, both the
> batch and table can be either random or sequential. It's either 100k or
> 1M rows, logged or unlogged, etc.
>
> 2) run-pgbench.sh - Runs short pgbench inserting data into a table,
> similar to (1), but with very few rows - so the timing approach is not
> suitable to measure this.
>
> Both scripts run against master, and then patched branch with three
> batch sizes (default 64, 16 and 256).
>
>
> results
> -------
>
> The results are very positive - see the attached PDF files comparing the
> patched builds to master.
>
> I have not found a single case where the batching causes regressions.
> This surprised me a bit, I've expected small regressions for single-row
> inserts in the pgbench test, but even that shows a small (~5%) gain.
> Even just 2-row inserts show +25% improvement in pgbench throughput.

This is reassuring. I too was half-expecting the batching
infrastructure to add measurable overhead for single-row inserts, but
it looks like the SPI bypass alone more than covers it.

> There are a couple cases where it matches master, I assume that's for
> I/O bound cases where the CPU optimizations do not really matter. That's
> expected, of course.
>
> I don't see much sensitivity on the batch size. The 256 batches seem to
> be a bit slower, but there's little difference between 16 and 64. So I'd
> say 64 seems reasonable.

Agreed. Interesting that 16 is consistently a little better than 64 in
the patterns benchmark. I'd guess that's the per-PK-index-match linear
scan over the batch cost showing up, since it's O(batch_size) per PK
match. 256 being noticeably worse fits that picture. 64 seems like a
good middle ground since the pgbench numbers show virtually no
difference between 16 and 64.

The best-case numbers are striking -- when both the PK table and the
FK values being inserted are in sequential order, the unlogged
patterns case hits 4-5x, wow. I guess that makes sense because
sequential FK values turn into a sorted SAOP array that walks
consecutive leaf pages, so it's essentially a single sequential scan
of the relevant index portion.

> Overall, I think these results looks quite good. I haven't looked at the
> code very closely, not beyond adjusting it to work with index prefetch.

If you get a chance, I'd welcome a closer look. Your memory context
catch was a real bug that I'd missed entirely. The area that would
benefit most from a second pair of eyes is the snapshot and permission
caching semantics in 0002.  The argument for why reusing the snapshot
and checking permissions once per batch is safe rather than per-row is
sound I think, but the effects are global and hard to validate by
testing alone..

-- 
Thanks, Amit Langote


Reply via email to