On Mon, Apr 6, 2026 at 9:02 PM Amit Langote <[email protected]> wrote: > On Tue, Mar 24, 2026 at 9:59 AM Amit Langote <[email protected]> wrote: > > Here is a significantly revised version of the patch series. A lot has > > changed since the January submission, so I want to summarize the > > design changes before getting into the patches. I think it does > > address the points in the two reviews that landed since v5 but maybe a > > bunch of points became moot after my rewrite of the relevant portions > > (thanks Junwang and ChangAo for the review in any case). > > > > At this point it might be better to think of this as targeting v20, > > except that if there is review bandwidth in the remaining two weeks > > before the v19 feature freeze, the rs_vistuples[] change described > > below as a standalone improvement to the existing pagemode scan path > > could be considered for v19, though that too is an optimistic > > scenario. > > > > It is also worth noting that Andres identified a number of > > inefficiencies in the existing scan path in: > > > > Re: unnecessary executor overheads around seqscans > > https://postgr.es/m/xzflwwjtwxin3dxziyblrnygy3gfygo5dsuw6ltcoha73ecmnf%40nh6nonzta7kw > > > > that are worth fixing independently of batching. Some of those fixes > > may be better pursued first, both because they benefit all scan paths > > and because they would make batching's gains more honest. > > > > Separately, after looking at the previous version, Andres pointed out > > offlist two fundamental issues with the patch's design: > > > > * The heapam implementation (in a version of the patch I didn't post > > to the thread) duplicated heap_prepare_pagescan() logic in a separate > > batch-specific code path, which is not acceptable as changes should > > benefit the existing slot interface too. Code duplication is not good > > either from a future maintainability aspect. The v5 version of that > > code is not great in that respect either; it instead duplicated > > heapggettup_pagemode() to slap batching on it. > > > > * Allocating executor_batch_rows slots on the executor side to receive > > rows from the AM adds significant overhead for slot initialization and > > management, and for non-row-organized AMs that do not produce > > individual rows at all, those slots would never be meaningfully > > populated. > > > > In any case, he just wasn't a fan of the slot-array approach the > > moment I mentioned it. The previous version had two slot arrays, > > inslots and outslots, of TTSOpsHeapTuple type (not > > TTSOpsBufferHeapTuple because buffer pins were managed by the batch > > code, which has its own modularity/correctness issues), populated via > > a materialize_all callback. A batch qual evaluator would copy > > qualifying tuples into outslots, with an activeslots pointer switching > > between the two depending on whether batch qual evaluation was used. > > > > The new design addresses both issues and differs from the previous > > version in several other ways: > > > > * Single slot instead of slot arrays: there is a single > > TupleTableSlot, reusing the scan node's ss_ScanTupleSlot whose type > > was already determined by the AM via table_slot_callbacks(). The slot > > is re-pointed to each HeapTuple in the current buffer page via a new > > repoint_slot AM callback, with no materialization or copying. Tuples > > are returned one by one from the executor's perspective, but the AM > > serves them in page-sized batches from pre-built HeapTupleData > > descriptors in rs_vistuples[], avoiding repeated descent into heapam > > per tuple. This is heapam's implementation of the batch interface; > > there is no intention to force other AMs into the same row-oriented > > model. > > > > * Batch qual evaluator not included: with the single-slot model, > > quals are evaluated per tuple via the existing ExecQual path after > > each repoint_slot call. A natural next step would be a new opcode > > (EEOP) that calls repoint_slot() internally within expression > > evaluation, allowing ExecQual to advance through multiple tuples from > > the same batch without returning to the scan node each time, with qual > > results accumulated in a bitmask in ExprState. The details of that > > will be worked out in a follow-on series. > > > > * heapgettup_pagemode_batch() gone: patch 0001 (described below) makes > > HeapScanDesc store full HeapTupleData entries in rs_vistuples[], which > > allows heap_getnextbatch() to simply advance a slice pointer into that > > array without any additional copying or re-entering heap code, making > > a separate batch-specific scan function unnecessary. > > > > * TupleBatch renamed to RowBatch: "row batch" is more natural > > terminology for this concept and also consistent with how similar > > abstractions are named in columnar and OLAP systems. > > > > * AM callbacks now take RowBatch directly: previously > > heap_getnextbatch() returned a void pointer that the executor would > > store into RowBatch.am_payload, because only the executor knew the > > internals of RowBatch. Now the AM receives RowBatch directly as a > > parameter and can populate it without the executor acting as an > > intermediary. This is also why RowBatch is introduced in its own > > patch ahead of the AM API addition, so the struct definition is > > available to both sides. > > > > Patch 0001 changes rs_vistuples[] to store full HeapTupleData entries > > instead of OffsetNumbers, as a standalone improvement to the existing > > pagemode scan path. Measured on a pg_prewarm'd (also vaccum freeze'd > > in the all-visible case) table with 1M/5M/10M rows: > > > > query all-visible not-all-visible > > count(*) -0.2% to +0.9% -0.4% to +0.5% > > count(*) WHERE id % 10 = 0 -1.1% to +3.4% +0.2% to +1.5% > > SELECT * LIMIT 1 OFFSET N -2.2% to -0.6% -0.9% to +6.6% > > SELECT * WHERE id%10=0 LIMIT -0.8% to +3.9% +0.9% to +9.6% > > > > No significant regression on either page type. The structural > > improvement is most visible on not-all-visible pages where > > HeapTupleSatisfiesMVCCBatch() already reads every tuple header during > > visibility checks, so persisting the result into rs_vistuples[] > > eliminates the downstream re-read (in heapgettupe_pagemode()) with no > > measurable overhead. That said, these numbers are somewhat noisy on > > my machine. Results on other machines would be welcome. > > > > Patches 0002-0005 add the RowBatch infrastructure, the batch AM API > > and heapam implementation including seqscan variants that use the new > > scan_getnextbatch() API, and EXPLAIN (ANALYZE, BATCHES) support, > > respectively. With batching enabled (executor_batch_rows=300, > > ~MaxHeapTuplesPerPage): > > > > query all-visible not-all-visible > > count(*) +11 to +15% +9 to +13% > > count(*) WHERE id % 10 = 0 +6 to +11% +10 to +14% > > SELECT * LIMIT 1 OFFSET N +16 to +19% +16 to +22% > > SELECT * WHERE id%10=0 LIMIT +8 to +10% +8 to +13% > > > > With executor_batch_rows=0, results are within noise of master across > > all query types and sizes, confirming no regression from the > > infrastructure changes themselves. The not-all-visible results tend > > to show slightly higher gains than the all-visible case. This is > > likely because the existing heapam code is more optimized for the > > all-visible path, so the not-all-visible path, which goes through > > HeapTupleSatisfiesMVCCBatch() for per-tuple visibility checks, has > > more headroom that batching can exploit. > > > > Setting aside the current series for a moment, there are some broader > > design questions worth raising while we have attention on this area. > > Some of these echo points Tomas raised in his first reply on this > > thread, and I am reiterating them deliberately since I have not > > managed to fully address them on my own or I simply didn't need to for > > the TAM-to-scan-node batching and think they would benefit from wider > > input rather than just my own iteration. > > > > We should also start thinking about other ways the executor can > > consume batch rows, not always assuming they are presented as > > HeapTupleData. For instance, an AM could expose decoded column arrays > > directly to operators that can consume them, bypassing slot-based > > deform entirely, or a columnar AM could implement scan_getnextbatch by > > decoding column strips directly into the batch without going through > > per-tuple HeapTupleData at all. Feedback on whether the current > > RowBatch design and the choices made in the scan_getnextbatch and > > RowBatchOps API make that sort of thing harder than it needs to be > > would be appreciated. For example, heapam's implementation of > > scan_getnextbatch uses a single TTSOpsBufferHeapTuple slot re-pointed > > to HeapTupleData entries one at a time via repoint_slot in > > RowBatchHeapOps. That works for heapam but a columnar AM could > > implement scan_getnextbatch to decode column strips directly into > > arrays in the batch, with no per-row repoint step needed at all. Any > > adjustments that would make RowBatch more AM-agnostic are worth > > discussing now before the design hardens. > > > > There are also broader open questions about how far the batch model > > can extend beyond the scan node. Qual pushdown into the AM has been > > discussed in nearby threads and would be one way to allow expression > > evaluation to happen before data reaches the executor proper, though > > that is a separate effort. For the purposes of this series, expression > > evaluation still happens in the executor after scan_getnextbatch > > returns. If the scan node does not project, the buffer heap slot is > > passed directly to the parent node, which calls slot callbacks to > > deform as needed. But once a node above projects, aggregates, or > > joins, the notion of a page-sized batch from a single AM loses its > > meaning and virtual slots take over. Whether RowBatch is usable or > > meaningful beyond the scan/TAM boundary in any form, and whether the > > core executor will ever have non-HeapTupleData batch consumption paths > > or leave that entirely to extensions, are open questions worth > > discussing. > > > > For RowBatch to eventually play the role that TupleTableSlot plays for > > row-at-a-time execution, something inside it would need to serve as > > the common currency for batch data, analogous to TupleTableSlot's > > datum/isnull arrays. Column arrays are the obvious direction, but even > > that leaves open the question of representation. PostgreSQL's Datum is > > a pointer-sized abstraction that boxes everything, whereas vectorized > > systems use typed packed arrays of native types with validity > > bitmasks, which is a significant part of why tight vectorized loops > > are fast there. Whether column arrays of Datum would be good enough, > > or whether going further toward typed packed arrays would be necessary > > to get meaningful vectorization, is a deeper design question that this > > series deliberately does not try to answer. > > > > Even though the focus is on getting batching working at the scan/TAM > > boundary first, thoughts on any of these points would be welcome. > > Rebased.
Just a beginning-of-CF note: I'm working on a significantly revised version (as described in my pgconf.dev talk) of this set that I will post here by EOW. Apologies to anyone who spent time reviewing v7. -- Thanks, Amit Langote
