Hi, Here is v5 of the patch series.
Patches 0001-0003 add the core batching infrastructure. 0001 adds the batch table AM API with heapam implementation, 0002 wires up SeqScan to use it (still returning one slot at a time), and 0003 adds EXPLAIN (BATCHES). I'd love to hear people's thoughts around TupleBatch structure added in 0002. I thought about making it a separate patch so that 0002 will still populate the single ScanState.ss_scanTupleSlot, but that means we'd still have to call the TAM callback to populate the tuple in the TAM's batch struct into the slot, defeating the whole point. With TupleBatch, you have executor_batch_rows number of slots which are filled in one TAM callback (materialize_all) call. So I decided to keep the TupleBatch and related things in 0002. For scans without quals, batching shows 20-30% improvement with no visible regressions when batching is disabled (batch_rows=0): SELECT * FROM t LIMIT n (no qual) Rows Master batch=0 %diff batch=64 %diff ------ -------- ------- ----- -------- ----- 1M 12.42 ms 11.96 ms 3.7% 8.56 ms 31.0% 3M 38.95 ms 38.92 ms 0.1% 28.59 ms 26.6% 10M 153.64 ms 150.28 ms 2.2% 112.95 ms 26.5% (%diff: positive = faster than master, negative = slower) Patches 0004-0005 add batched qual evaluation and are more experimental (see below on why 0005 exists). For quals referencing early columns, the improvement is significant: SELECT * FROM t WHERE a = 0 ... OFFSET n (qual on 1st col) Rows Master batch=64 %diff ------ -------- -------- ----- 1M 30.19 ms 15.55 ms 48.5% 3M 92.47 ms 50.01 ms 45.9% 10M 325.58 ms 211.83 ms 34.9% However, for quals on later columns (e.g., 15th), batching provides no benefit - deformation dominates and batching doesn't help: SELECT * FROM t WHERE o = 0 ... OFFSET n (qual on 15th col) Rows Master batch=64 %diff ------ -------- -------- ----- 1M 44.14 ms 44.56 ms -0.9% 3M 133.89 ms 137.77 ms -2.9% 10M 503.33 ms 528.88 ms -5.1% I don't have a satisfactory explanation for why batching doesn't help the deform-heavy case at all. One would expect at least some benefit from reduced per-tuple overhead, but that's not materializing. I've also been struggling to understand why 0004 affects the per-tuple path even when batch_rows=0. For quals with 0% selectivity (all rows fail the qual), perf shows ExecInterpExpr is noticeably hotter with the patched code compared to master, even though batching is disabled: SELECT * FROM t WHERE a = 0 ... OFFSET n (0% selectivity) Rows Master batch=0 %diff batch=64 %diff ------ -------- ------- ----- -------- ----- 1M 24.37 ms 28.67 ms -17.6% 12.46 ms 48.9% 3M 73.95 ms 85.07 ms -15.0% 41.64 ms 43.7% 10M 287.63 ms 316.81 ms -10.1% 188.01 ms 34.6% Compare that to 100% selectivity (all rows pass), where there's no regression: SELECT * FROM t WHERE a > 0 ... OFFSET n (100% selectivity) Rows Master batch=0 %diff batch=64 %diff ------ -------- ------- ----- -------- ----- 1M 29.44 ms 29.10 ms 1.2% 16.61 ms 43.6% 3M 91.22 ms 90.28 ms 1.0% 54.10 ms 40.7% 10M 360.77 ms 331.25 ms 8.2% 224.00 ms 37.9% I tried moving batch opcodes to a separate interpreter (0005) thinking it might be register pressure or jump table effects from adding cases to ExecInterpExpr's switch. With 0005, the generated assembly for ExecInterpExpr looks identical to master (same stack frame size, same epilogue), yet the performance still differs. Specifically, the ldp instruction in the function epilogue shows 53% hotness in patched vs 35% in master. We still need placeholder entries in the dispatch table, so it's unclear if this fully isolates the per-tuple path. I'll continue looking at perf, but I feel like at a bit of a loss here and would appreciate any insights. Other changes worth noting: - I removed the BatchVector intermediate representation that copied Datums into columnar arrays before qual evaluation (it used to be in the batched qual patch 0004). Now quals access batch slots' tts_values directly. This simplifies the code and the copy overhead wasn't paying off. If we pursue serious vectorization later, this may need to be revisited, but removing it doesn't degrade performance. -- Thanks, Amit Langote
v5-0001-Add-batch-table-AM-API-and-heapam-implementation.patch
Description: Binary data
v5-0002-SeqScan-add-batch-driven-variants-returning-slots.patch
Description: Binary data
v5-0003-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch
Description: Binary data
v5-0004-WIP-Add-ExecQualBatch-for-batched-qual-evaluation.patch
Description: Binary data
v5-0005-WIP-Use-dedicated-interpreter-for-batched-qual-ev.patch
Description: Binary data
