Hi, Amit and Junwang Thanks for the latest patch. I think the overall direction makes sense, and the single-column SK_SEARCHARRAY path looks like one of the most valuable optimizations here. The patch also seems to cover several important cases, including deferred constraints, duplicate FK values, and multi-column fallback behavior.
After reading through the patch, I have one major comments and a few smaller ones. 1. TopTransactionContext usage during batched flush may be too coarse-grained My biggest concern is the use of TopTransactionContext around the batched flush path. As written, ri_FastPathBatchFlush() switches to TopTransactionContext before calling ri_FastPathFlushArray() / ri_FastPathFlushLoop(). That seems broad enough that temporary allocations made during the flush may end up there. In particular, in ri_FastPathFlushArray(), I think the objects worth checking carefully are the pass-by-reference Datums returned by the per-element cast call and stored in search_vals[], e.g. ```search_vals[i] = FunctionCall3(&entry->cast_func_finfo, ...);``` If those cast results are separately allocated in the current memory context, then pfree(arr) only frees the constructed array object itself; it does not obviously free those intermediate cast results. If so, those allocations could survive until end of transaction rather than just until the end of the current flush. Maybe this is harmless in practice, but I think it needs a closer look. It might be better to use a dedicated short-lived context for per-flush temporary allocations, reset it after each flush, or otherwise separate allocations that really need transaction lifetime from those that are only needed transiently during batched processing. 2. RI_FastPathEntry comment mentions the wrong function name The comment above RI_FastPathEntry says it contains resources needed by ri_FastPathFlushBatch(), but the function is named ri_FastPathBatchFlush(). 3. RI_FASTPATH_BATCH_SIZE needs some rationale RI_FASTPATH_BATCH_SIZE = 64 may well be a reasonable compromise, but right now it reads like a magic number. This choice seems especially relevant because the patch has two opposing effects: 3-1. larger batches should amortize the array-scan work better, 3-2. but the matched[] bookkeeping in ri_FastPathFlushArray() is O(batch_size^2) in the worst case. So I think it would help to include at least a brief rationale in a comment or in the commit message. 4. Commit message says the entry stashes fk_relid, but the code actually stashes riinfo The commit message says the entry stashes fk_relid and can reopen the relation if needed. Unless I am misreading it, the code actually stores riinfo and later uses riinfo->fk_relid. The distinction is small, but I think the wording should match the implementation more closely. Thanks again for working on this. Best regards, Haibo Yan > On Mar 10, 2026, at 5:28 AM, Junwang Zhao <[email protected]> wrote: > > Hi, > > On Mon, Mar 2, 2026 at 11:30 PM Junwang Zhao <[email protected]> wrote: >> >> On Sat, Feb 28, 2026 at 3:08 PM Amit Langote <[email protected]> wrote: >>> >>> Hi Junwang, >>> >>> On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <[email protected]> wrote: >>>> On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <[email protected]> >>>> wrote: >>>>> I re-ran the benchmarks (same test as yours, different machine): >>>>> >>>>> create table pk (a numeric primary key); >>>>> create table fk (a bigint references pk); >>>>> insert into pk select generate_series(1, 2000000); >>>>> insert into fk select generate_series(1, 2000000, 2); >>>>> >>>>> master: 2444 ms (median of 3 runs) >>>>> 0001: 1382 ms (43% faster) >>>>> 0001+0002: 1202 ms (51% faster, 13% over 0001 alone) >>>> >>>> I can get similar improvement on my old mac intel chip: >>>> >>>> master: 12963.993 ms >>>> 0001: 6641.692 ms, 48.8% faster >>>> 0001+0002: 5771.703 ms, 55.5% faster >>>>> >>>>> Also, with int PK / int FK (1M rows): >>>>> >>>>> create table pk (a int primary key); >>>>> create table fk (a int references pk); >>>>> insert into pk select generate_series(1, 1000000); >>>>> insert into fk select generate_series(1, 1000000); >>>>> >>>>> master: 1000 ms >>>>> 0001: 520 ms (48% faster) >>>>> 0001+0002: 432 ms (57% faster, 17% over 0001 alone) >>>> >>>> master: 11134.583 ms >>>> 0001: 5240.298 ms, 52.9% faster >>>> 0001+0002: 4554.215 ms, 59.1% faster >>> >>> Thanks for testing, good to see similar numbers. I had forgotten to >>> note that these results are when these PK index probes don't do any >>> I/O, though you might be aware of that. Below, I report some numbers >>> that Tomas Vondra shared with me off-list where the probes do have to >>> perform I/O and there the benefits from only this patch set are only >>> marginal. >>> >>>> I don't have any additional comments on the patch except one minor nit, >>>> maybe merge the following two if conditions into one, not a strong opinion >>>> though. >>>> >>>> if (use_cache) >>>> { >>>> /* >>>> * The snapshot was registered once when the cache entry was created. >>>> * We just patch curcid to reflect the new command counter. >>>> * SnapshotSetCommandId() only patches process-global statics, not >>>> * registered copies, so we do it directly. >>>> * >>>> * The xmin/xmax/xip fields don't need refreshing: within a single >>>> * statement batch, only curcid changes between rows. >>>> */ >>>> Assert(fpentry && fpentry->snapshot != NULL); >>>> snapshot = fpentry->snapshot; >>>> snapshot->curcid = GetCurrentCommandId(false); >>>> } >>>> else >>>> snapshot = RegisterSnapshot(GetLatestSnapshot()); >>>> >>>> if (use_cache) >>>> { >>>> pk_rel = fpentry->pk_rel; >>>> idx_rel = fpentry->idx_rel; >>>> scandesc = fpentry->scandesc; >>>> slot = fpentry->slot; >>>> } >>>> else >>>> { >>>> pk_rel = table_open(riinfo->pk_relid, RowShareLock); >>>> idx_rel = index_open(riinfo->conindid, AccessShareLock); >>>> scandesc = index_beginscan(pk_rel, idx_rel, >>>> snapshot, NULL, >>>> riinfo->nkeys, 0); >>>> slot = table_slot_create(pk_rel, NULL); >>>> } >>> >>> Good idea, done. >>> >>> While polishing 0002, I revisited the snapshot caching semantics. The >>> previous commit message hand-waved about only curcid changing between >>> rows, but GetLatestSnapshot() also reflects other backends' commits, >>> so reusing the snapshot is a deliberate semantic change from the SPI >>> path. I think it's safe because curcid is all we need for >>> intra-statement visibility, concurrent commits either already happened >>> before our snapshot (and are visible) or are racing with our statement >>> and wouldn't be seen reliably even with per-row snapshots since the >>> order in which FK rows are checked is nondeterministic, and >>> LockTupleKeyShare prevents the PK row from disappearing regardless. In >>> essence, we're treating all the FK checks within a trigger-firing >>> cycle as a single plan execution that happens to scan N rows, rather >>> than N independent SPI queries each taking a fresh snapshot. That's >>> the natural model -- a normal SELECT ... FOR KEY SHARE plan doesn't >>> re-take GetLatestSnapshot() between rows either. >>> >>> Similarly, the permission check (schema USAGE + table SELECT) is now >>> done once at cache entry creation in ri_FastPathGetEntry() rather than >>> on every flush. >> >> nice improvement. >> >>> The RI check runs as the PK table owner, so we're >>> verifying that the owner can access their own table -- a condition >>> that won't change unless someone explicitly revokes from the owner, >>> which would also break the SPI path. >>> >>>>> David Rowley mentioned off-list that it might be worth batching >>>>> multiple FK values into a single index probe, leveraging the >>>>> ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be >>>>> to buffer FK values across trigger invocations in the per-constraint >>>>> cache (0002 already has the right structure for this), build a >>>>> SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf >>>>> pages in one sorted traversal instead of one tree descent per row. The >>>>> locking and recheck would still be per-tuple, but the index traversal >>>>> cost drops significantly. Single-column FKs are the obvious starting >>>>> point. That seems worth exploring but can be done as a separate patch >>>>> on top of this. >>>> >>>> I will take a look at this in the following weeks. >>> >>> I ended up going ahead with the batching and SAOP idea that David >>> mentioned -- I had a proof-of-concept working shortly after posting v3 >>> and kept iterating on it. So attached set is now: >>> >>> 0001 - Core fast path (your 0001+0002 reworked, as before) >>> >>> 0002 - Per-batch resource caching (PK relation, index, scandesc, snapshot) >>> >>> 0003 - FK row buffering: materialize FK tuples into a per-constraint >>> batch buffer (64 rows), flush when full or at batch end >>> >>> 0004 - SK_SEARCHARRAY for single-column FKs: build an array from the >>> buffered FK values and do one index scan instead of 64 separate tree >>> descents. Multi-column FKs fall back to a per-row loop. >>> >>> 0003 is pure infrastructure -- it doesn't improve performance on its >>> own because the per-row index descent still dominates. The payoff >>> comes in 0004. >>> >>> Numbers (same machine as before, median of 3 runs): >>> >>> numeric PK / bigint FK, 1M rows: >>> master: 2487 ms >>> 0001..0004: 1168 ms (2.1x) >>> >>> int PK / int FK, 500K rows: >>> master: 1043 ms >>> 0001..0004: 335 ms (3.1x) >>> >>> The int/int case benefits most because the per-row cost is lower, so >>> the SAOP traversal savings are a larger fraction of the total. The >>> numeric/bigint case still sees a solid improvement despite the >>> cross-type cast overhead. >>> >>> Tomas Vondra also tested with an I/O-intensive workload (dataset >>> larger than shared_buffers, combined with his and Peter Geoghegan's >>> I/O prefetching patches) and confirmed that the batching + SAOP >>> approach helps there too, not just in the CPU-bound / memory-resident >>> case. In fact he showed that the patches here don't make a big dent >>> when the main bottleneck is I/O as shown in numbers that he shared in >>> an off-list email: >>> >>> master: 161617 ms >>> ri-check (0001..0004): 149446 ms (1.08x) >>> ri-check + i/o prefetching: 50885 ms (3.2x) >>> >>> So the RI patches alone only give ~8% here since most time is waiting >>> on reads. But the batching gives the prefetch machinery a window of >>> upcoming probes to issue readahead against, so the two together yield >>> 3.2x. >> >> impressive! >> >>> >>> Tomas also caught a memory context bug in the batch flush path: the >>> cached scandesc lives in TopTransactionContext, but the btree AM >>> defers _bt_preprocess_keys allocation to the first getnext call, which >>> pallocs into CurrentMemoryContext. If that's a short-lived >>> per-trigger-row context, the scandesc has dangling pointers on the >>> next rescan. Fixed by switching to TopTransactionContext before the >>> probe loop. >>> >>> Finally, I've fixed a number of other small and not-so-small bugs >>> found while polishing the old patches and made other stylistic >>> improvements. One notable change is that I introduced a FastPathMeta >> >> Yeah, this is much better than the fpmeta_valid field. >> >>> struct to store the fast path metadata instead of dumping those arrays >>> in the RI_ConstraintInfo. It's allocated lazily on first use and holds >>> the per-key compare entries, operator procedures, and index strategy >>> info needed by the scan key construction, so RI_ConstraintInfo doesn't >>> pay for them when the fast path isn't used. >>> >>> >>> On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <[email protected]> wrote: >>>> >>>> Hi Amit, >>>> >>>> On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <[email protected]> >>>> wrote: >>>>> >>>>> Hi Junwang, >>>>> >>>>> On Mon, Dec 1, 2025 at 3:09 PM Junwang Zhao <[email protected]> wrote: >>>>>> As Amit has already stated, we are approaching a hybrid "fast-path + >>>>>> fallback" >>>>>> design. >>>>>> >>>>>> 0001 adds a fast path optimization for foreign key constraint checks >>>>>> that bypasses the SPI executor, the fast path applies when the referenced >>>>>> table is not partitioned, and the constraint does not involve temporal >>>>>> semantics. >>>>>> >>>>>> With the following test: >>>>>> >>>>>> create table pk (a numeric primary key); >>>>>> create table fk (a bigint references pk); >>>>>> insert into pk select generate_series(1, 2000000); >>>>>> >>>>>> head: >>>>>> >>>>>> [local] zhjwpku@postgres:5432-90419=# insert into fk select >>>>>> generate_series(1, 2000000, 2); >>>>>> INSERT 0 1000000 >>>>>> Time: 13516.177 ms (00:13.516) >>>>>> >>>>>> [local] zhjwpku@postgres:5432-90419=# update fk set a = a + 1; >>>>>> UPDATE 1000000 >>>>>> Time: 15057.638 ms (00:15.058) >>>>>> >>>>>> patched: >>>>>> >>>>>> [local] zhjwpku@postgres:5432-98673=# insert into fk select >>>>>> generate_series(1, 2000000, 2); >>>>>> INSERT 0 1000000 >>>>>> Time: 8248.777 ms (00:08.249) >>>>>> >>>>>> [local] zhjwpku@postgres:5432-98673=# update fk set a = a + 1; >>>>>> UPDATE 1000000 >>>>>> Time: 10117.002 ms (00:10.117) >>>>>> >>>>>> 0002 cache fast-path metadata used by the index probe, at the current >>>>>> time only comparison operator hash entries, operator function OIDs >>>>>> and strategy numbers and subtypes for index scans. But this cache >>>>>> doesn't buy any performance improvement. >>>>>> >>>>>> Caching additional metadata should improve performance for foreign key >>>>>> checks. >>>>>> >>>>>> Amit suggested introducing a mechanism for ri_triggers.c to register a >>>>>> cleanup callback in the EState, which AfterTriggerEndQuery() could then >>>>>> invoke to release per-statement cached metadata (such as the >>>>>> IndexScanDesc). >>>>>> However, I haven't been able to implement this mechanism yet. >>>>> >>>>> Thanks for working on this. I've taken your patches as a starting >>>>> point and reworked the series into two patches (attached): 1st is your >>>>> 0001+0002 as the core patch that adds a gated fast-path alternative to >>>>> SPI and 2nd where I added per-statement resource caching. Doing the >>>>> latter turned out to be not so hard thanks to the structure you chose >>>>> to build the core fast path. Good call on adding the RLS and ACL test >>>>> cases, btw. >>>>> >>>>> So, 0001 is a functionally complete fast path: concurrency handling, >>>>> REPEATABLE READ crosscheck, cross-type operators, security context, >>>>> and metadata caching. 0002 implements the per-statement resource >>>>> caching we discussed, though instead of sharing the EState between >>>>> trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback >>>>> mechanism that fires at the end of each trigger-firing cycle >>>>> (per-statement for immediate constraints, or until COMMIT for deferred >>>>> ones). It layers resource caching on top so that the PK relation, >>>>> index, scan descriptor, and snapshot stay open across all FK trigger >>>>> invocations within a single trigger-firing cycle rather than being >>>>> opened and closed per row. >>>>> >>>>> Note that phe previous 0002 (metadata caching) is folded into 0001, >>>>> and most of the new fast-path logic added in 0001 now lives in >>>>> ri_FastPathCheck() rather than inline in RI_FKey_check(), so the >>>>> RI_FKey_check diff is just the gating call and SPI fallback. >>>>> >>>>> I re-ran the benchmarks (same test as yours, different machine): >>>>> >>>>> create table pk (a numeric primary key); >>>>> create table fk (a bigint references pk); >>>>> insert into pk select generate_series(1, 2000000); >>>>> insert into fk select generate_series(1, 2000000, 2); >>>>> >>>>> master: 2444 ms (median of 3 runs) >>>>> 0001: 1382 ms (43% faster) >>>>> 0001+0002: 1202 ms (51% faster, 13% over 0001 alone) >>>> >>>> I can get similar improvement on my old mac intel chip: >>>> >>>> master: 12963.993 ms >>>> 0001: 6641.692 ms, 48.8% faster >>>> 0001+0002: 5771.703 ms, 55.5% faster >>>> >>>>> >>>>> Also, with int PK / int FK (1M rows): >>>>> >>>>> create table pk (a int primary key); >>>>> create table fk (a int references pk); >>>>> insert into pk select generate_series(1, 1000000); >>>>> insert into fk select generate_series(1, 1000000); >>>>> >>>>> master: 1000 ms >>>>> 0001: 520 ms (48% faster) >>>>> 0001+0002: 432 ms (57% faster, 17% over 0001 alone) >>>> >>>> master: 11134.583 ms >>>> 0001: 5240.298 ms, 52.9% faster >>>> 0001+0002: 4554.215 ms, 59.1% faster >>>> >>>>> >>>>> The incremental gain from 0002 comes from eliminating per-row relation >>>>> open/close, scan begin/end, slot alloc/free, and replacing per-row >>>>> GetSnapshotData() with only curcid adjustment on the registered >>>>> snapshot copy in the cache. >>>>> >>>>> The two current limitations are partitioned referenced tables and >>>>> temporal foreign keys. Partitioned PKs are relatively uncommon in >>>>> practice, so the non-partitioned case should cover most FK workloads, >>>>> so I'm not sure it's worth the added complexity to support them. >>>>> Temporal FKs are inherently multi-row, so they're a poor fit for a >>>>> single-probe fast path. >>>>> >>>>> David Rowley mentioned off-list that it might be worth batching >>>>> multiple FK values into a single index probe, leveraging the >>>>> ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be >>>>> to buffer FK values across trigger invocations in the per-constraint >>>>> cache (0002 already has the right structure for this), build a >>>>> SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf >>>>> pages in one sorted traversal instead of one tree descent per row. The >>>>> locking and recheck would still be per-tuple, but the index traversal >>>>> cost drops significantly. Single-column FKs are the obvious starting >>>>> point. That seems worth exploring but can be done as a separate patch >>>>> on top of this. >>>> >>>> I will take a look at this in the following weeks. >>>> >>>>> >>>>> I think the series is in reasonable shape but would appreciate extra >>>>> eyeballs, especially on the concurrency handling in ri_LockPKTuple() >>>>> in 0001 and the snapshot lifecycle in 0002. Or anything else that >>>>> catches one's eye. >>>>> >>>>> -- >>>>> Thanks, Amit Langote >>>> >>>> I don't have any additional comments on the patch except one minor nit, >>>> maybe merge the following two if conditions into one, not a strong opinion >>>> though. >>>> >>>> if (use_cache) >>>> { >>>> /* >>>> * The snapshot was registered once when the cache entry was created. >>>> * We just patch curcid to reflect the new command counter. >>>> * SnapshotSetCommandId() only patches process-global statics, not >>>> * registered copies, so we do it directly. >>>> * >>>> * The xmin/xmax/xip fields don't need refreshing: within a single >>>> * statement batch, only curcid changes between rows. >>>> */ >>>> Assert(fpentry && fpentry->snapshot != NULL); >>>> snapshot = fpentry->snapshot; >>>> snapshot->curcid = GetCurrentCommandId(false); >>>> } >>>> else >>>> snapshot = RegisterSnapshot(GetLatestSnapshot()); >>>> >>>> if (use_cache) >>>> { >>>> pk_rel = fpentry->pk_rel; >>>> idx_rel = fpentry->idx_rel; >>>> scandesc = fpentry->scandesc; >>>> slot = fpentry->slot; >>>> } >>>> else >>>> { >>>> pk_rel = table_open(riinfo->pk_relid, RowShareLock); >>>> idx_rel = index_open(riinfo->conindid, AccessShareLock); >>>> scandesc = index_beginscan(pk_rel, idx_rel, >>>> snapshot, NULL, >>>> riinfo->nkeys, 0); >>>> slot = table_slot_create(pk_rel, NULL); >>>> } >>>> >>>> -- >>>> Regards >>>> Junwang Zhao >>> >>> >>> >>> -- >>> Thanks, Amit Langote >> >> >> >> -- >> Regards >> Junwang Zhao > > I had an offline discussion with Amit today. There were a few small things > that could be improved, so I posted a new version of the patch set. > > 1. > > + if (ri_fastpath_is_applicable(riinfo)) > + { > + bool found = ri_FastPathCheck(riinfo, fk_rel, newslot); > + > + if (found) > + return PointerGetDatum(NULL); > + > + /* > + * ri_FastPathCheck opens pk_rel internally; we need it for > + * ri_ReportViolation. Re-open briefly. > + */ > + pk_rel = table_open(riinfo->pk_relid, RowShareLock); > + ri_ReportViolation(riinfo, pk_rel, fk_rel, > + newslot, NULL, > + RI_PLAN_CHECK_LOOKUPPK, false, false); > + } > > Move ri_ReportViolation into ri_FastPathCheck, so table_open is no > longer needed, and ri_FastPathCheck now returns void. Since Amit > agreed this is the right approach, I included it directly in v5-0001. > > 2. > > After adding the batch fast path, the original ri_FastPathCheck is only > used by the ALTER TABLE validation path. This path cannot use the > cache because the registered AfterTriggerBatch callback will never run. > Therefore, the use_cache branch can be removed. > > I made this change in v5-0004 and also updated some related comments. > Once we agree the changes are correct, it can be merged into v5-0003. > > 3. > > + fk_slot = MakeSingleTupleTableSlot(RelationGetDescr(fk_rel), > + &TTSOpsHeapTuple); > > ri_FastPathBatchFlush creates a new fk_slot but does not cache it in > RI_FastPathEntry. I tried caching it in v5-0006 and ran some benchmarks, > it didn't show much improvement. This might be because the slot creation > function is called once per batch rather than once per row, so the overall > impact is minimal. I'm posting this here for Amit to take a look and decide > whether we should adopt it or drop it, since I mentioned the idea to > him earlier. > > 4. > > ri_FastPathFlushArray currently uses SK_SEARCHARRAY only for > single-column checks. I asked whether this could be extended to support > multi-column cases, and Amit encouraged me to look into it. > > After a brief investigation, it seems that ScanKeyEntryInitialize only allows > passing a single subtype/collation/procedure, which makes it difficult to > handle multiple types. Based on this, my current understanding is that > SK_SEARCHARRAY may not work for multi-column checks. > > -- > Regards > Junwang Zhao > <v5-0005-Use-SK_SEARCHARRAY-for-batched-fast-path-FK-probe.patch><v5-0006-Reuse-FK-tuple-slot-across-fast-path-batches.patch><v5-0002-Cache-per-batch-resources-for-fast-path-foreign-k.patch><v5-0004-Refine-fast-path-FK-validation-path.patch><v5-0003-Buffer-FK-rows-for-batched-fast-path-probing.patch><v5-0001-Add-fast-path-for-foreign-key-constraint-checks.patch>
