Re: Batching in executor

Amit Langote Wed, 01 Jul 2026 02:19:27 -0700

On Mon, Apr 6, 2026 at 9:02 PM Amit Langote <[email protected]> wrote:
> On Tue, Mar 24, 2026 at 9:59 AM Amit Langote <[email protected]> wrote:
> > Here is a significantly revised version of the patch series. A lot has
> > changed since the January submission, so I want to summarize the
> > design changes before getting into the patches.  I think it does
> > address the points in the two reviews that landed since v5 but maybe a
> > bunch of points became moot after my rewrite of the relevant portions
> > (thanks Junwang and ChangAo for the review in any case).
> >
> > At this point it might be better to think of this as targeting v20,
> > except that if there is review bandwidth in the remaining two weeks
> > before the v19 feature freeze, the rs_vistuples[] change described
> > below as a standalone improvement to the existing pagemode scan path
> > could be considered for v19, though that too is an optimistic
> > scenario.
> >
> > It is also worth noting that Andres identified a number of
> > inefficiencies in the existing scan path in:
> >
> > Re: unnecessary executor overheads around seqscans
> > https://postgr.es/m/xzflwwjtwxin3dxziyblrnygy3gfygo5dsuw6ltcoha73ecmnf%40nh6nonzta7kw
> >
> > that are worth fixing independently of batching. Some of those fixes
> > may be better pursued first, both because they benefit all scan paths
> > and because they would make batching's gains more honest.
> >
> > Separately, after looking at the previous version, Andres pointed out
> > offlist two fundamental issues with the patch's design:
> >
> > * The heapam implementation (in a version of the patch I didn't post
> > to the thread) duplicated heap_prepare_pagescan() logic in a separate
> > batch-specific code path, which is not acceptable as changes should
> > benefit the existing slot interface too.  Code duplication is not good
> > either from a future maintainability aspect. The v5 version of that
> > code is not great in that respect either; it instead duplicated
> > heapggettup_pagemode() to slap batching on it.
> >
> > * Allocating executor_batch_rows slots on the executor side to receive
> > rows from the AM adds significant overhead for slot initialization and
> > management, and for non-row-organized AMs that do not produce
> > individual rows at all, those slots would never be meaningfully
> > populated.
> >
> > In any case, he just wasn't a fan of the slot-array approach the
> > moment I mentioned it. The previous version had two slot arrays,
> > inslots and outslots, of TTSOpsHeapTuple type (not
> > TTSOpsBufferHeapTuple because buffer pins were managed by the batch
> > code, which has its own modularity/correctness issues), populated via
> > a materialize_all callback. A batch qual evaluator would copy
> > qualifying tuples into outslots, with an activeslots pointer switching
> > between the two depending on whether batch qual evaluation was used.
> >
> > The new design addresses both issues and differs from the previous
> > version in several other ways:
> >
> >  * Single slot instead of slot arrays: there is a single
> > TupleTableSlot, reusing the scan node's ss_ScanTupleSlot whose type
> > was already determined by the AM via table_slot_callbacks().  The slot
> > is re-pointed to each HeapTuple in the current buffer page via a new
> > repoint_slot AM callback, with no materialization or copying.  Tuples
> > are returned one by one from the executor's perspective, but the AM
> > serves them in page-sized batches from pre-built HeapTupleData
> > descriptors in rs_vistuples[], avoiding repeated descent into heapam
> > per tuple.  This is heapam's implementation of the batch interface;
> > there is no intention to force other AMs into the same row-oriented
> > model.
> >
> >  * Batch qual evaluator not included: with the single-slot model,
> > quals are evaluated per tuple via the existing ExecQual path after
> > each repoint_slot call.  A natural next step would be a new opcode
> > (EEOP) that calls repoint_slot() internally within expression
> > evaluation, allowing ExecQual to advance through multiple tuples from
> > the same batch without returning to the scan node each time, with qual
> > results accumulated in a bitmask in ExprState.  The details of that
> > will be worked out in a follow-on series.
> >
> > * heapgettup_pagemode_batch() gone: patch 0001 (described below) makes
> > HeapScanDesc store full HeapTupleData entries in rs_vistuples[], which
> > allows heap_getnextbatch() to simply advance a slice pointer into that
> > array without any additional copying or re-entering heap code, making
> > a separate batch-specific scan function unnecessary.
> >
> >  * TupleBatch renamed to RowBatch: "row batch" is more natural
> > terminology for this concept and also consistent with how similar
> > abstractions are named in columnar and OLAP systems.
> >
> >  * AM callbacks now take RowBatch directly: previously
> > heap_getnextbatch() returned a void pointer that the executor would
> > store into RowBatch.am_payload, because only the executor knew the
> > internals of RowBatch.  Now the AM receives RowBatch directly as a
> > parameter and can populate it without the executor acting as an
> > intermediary.  This is also why RowBatch is introduced in its own
> > patch ahead of the AM API addition, so the struct definition is
> > available to both sides.
> >
> > Patch 0001 changes rs_vistuples[] to store full HeapTupleData entries
> > instead of OffsetNumbers, as a standalone improvement to the existing
> > pagemode scan path. Measured on a pg_prewarm'd  (also vaccum freeze'd
> > in the all-visible case) table with 1M/5M/10M rows:
> >
> >   query                           all-visible      not-all-visible
> >   count(*)                        -0.2% to +0.9%   -0.4% to +0.5%
> >   count(*) WHERE id % 10 = 0     -1.1% to +3.4%   +0.2% to +1.5%
> >   SELECT * LIMIT 1 OFFSET N      -2.2% to -0.6%   -0.9% to +6.6%
> >   SELECT * WHERE id%10=0 LIMIT   -0.8% to +3.9%   +0.9% to +9.6%
> >
> > No significant regression on either page type. The structural
> > improvement is most visible on not-all-visible pages where
> > HeapTupleSatisfiesMVCCBatch() already reads every tuple header during
> > visibility checks, so persisting the result into rs_vistuples[]
> > eliminates the downstream re-read (in heapgettupe_pagemode()) with no
> > measurable overhead.  That said, these numbers are somewhat noisy on
> > my machine.  Results on other machines would be welcome.
> >
> > Patches 0002-0005 add the RowBatch infrastructure, the batch AM API
> > and heapam implementation including seqscan variants that use the new
> > scan_getnextbatch() API, and EXPLAIN (ANALYZE, BATCHES) support,
> > respectively. With batching enabled (executor_batch_rows=300,
> > ~MaxHeapTuplesPerPage):
> >
> >   query                           all-visible    not-all-visible
> >   count(*)                        +11 to +15%    +9 to +13%
> >   count(*) WHERE id % 10 = 0     +6 to +11%     +10 to +14%
> >   SELECT * LIMIT 1 OFFSET N      +16 to +19%    +16 to +22%
> >   SELECT * WHERE id%10=0 LIMIT   +8 to +10%     +8 to +13%
> >
> > With executor_batch_rows=0, results are within noise of master across
> > all query types and sizes, confirming no regression from the
> > infrastructure changes themselves.  The not-all-visible results tend
> > to show slightly higher gains than the all-visible case. This is
> > likely because the existing heapam code is more optimized for the
> > all-visible path, so the not-all-visible path, which goes through
> > HeapTupleSatisfiesMVCCBatch() for per-tuple visibility checks, has
> > more headroom that batching can exploit.
> >
> > Setting aside the current series for a moment, there are some broader
> > design questions worth raising while we have attention on this area.
> > Some of these echo points Tomas raised in his first reply on this
> > thread, and I am reiterating them deliberately since I have not
> > managed to fully address them on my own or I simply didn't need to for
> > the TAM-to-scan-node batching and think they would benefit from wider
> > input rather than just my own iteration.
> >
> > We should also start thinking about other ways the executor can
> > consume batch rows, not always assuming they are presented as
> > HeapTupleData. For instance, an AM could expose decoded column arrays
> > directly to operators that can consume them, bypassing slot-based
> > deform entirely, or a columnar AM could implement scan_getnextbatch by
> > decoding column strips directly into the batch without going through
> > per-tuple HeapTupleData at all. Feedback on whether the current
> > RowBatch design and the choices made in the scan_getnextbatch and
> > RowBatchOps API make that sort of thing harder than it needs to be
> > would be appreciated. For example, heapam's implementation of
> > scan_getnextbatch uses a single TTSOpsBufferHeapTuple slot re-pointed
> > to HeapTupleData entries one at a time via repoint_slot in
> > RowBatchHeapOps. That works for heapam but a columnar AM could
> > implement scan_getnextbatch to decode column strips directly into
> > arrays in the batch, with no per-row repoint step needed at all. Any
> > adjustments that would make RowBatch more AM-agnostic are worth
> > discussing now before the design hardens.
> >
> > There are also broader open questions about how far the batch model
> > can extend beyond the scan node. Qual pushdown into the AM has been
> > discussed in nearby threads and would be one way to allow expression
> > evaluation to happen before data reaches the executor proper, though
> > that is a separate effort. For the purposes of this series, expression
> > evaluation still happens in the executor after scan_getnextbatch
> > returns. If the scan node does not project, the buffer heap slot is
> > passed directly to the parent node, which calls slot callbacks to
> > deform as needed. But once a node above projects, aggregates, or
> > joins, the notion of a page-sized batch from a single AM loses its
> > meaning and virtual slots take over. Whether RowBatch is usable or
> > meaningful beyond the scan/TAM boundary in any form, and whether the
> > core executor will ever have non-HeapTupleData batch consumption paths
> > or leave that entirely to extensions, are open questions worth
> > discussing.
> >
> > For RowBatch to eventually play the role that TupleTableSlot plays for
> > row-at-a-time execution, something inside it would need to serve as
> > the common currency for batch data, analogous to TupleTableSlot's
> > datum/isnull arrays. Column arrays are the obvious direction, but even
> > that leaves open the question of representation. PostgreSQL's Datum is
> > a pointer-sized abstraction that boxes everything, whereas vectorized
> > systems use typed packed arrays of native types with validity
> > bitmasks, which is a significant part of why tight vectorized loops
> > are fast there. Whether column arrays of Datum would be good enough,
> > or whether going further toward typed packed arrays would be necessary
> > to get meaningful vectorization, is a deeper design question that this
> > series deliberately does not try to answer.
> >
> > Even though the focus is on getting batching working at the scan/TAM
> > boundary first, thoughts on any of these points would be welcome.
>
> Rebased.


Just a beginning-of-CF note: I'm working on a significantly revised
version (as described in my pgconf.dev talk) of this set that I will
post here by EOW.  Apologies to anyone who spent time reviewing v7.

-- 
Thanks, Amit Langote

Re: Batching in executor

Reply via email to