On Sat, Feb 28, 2026 at 3:08 PM Amit Langote <[email protected]> wrote: > > Hi Junwang, > > On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <[email protected]> wrote: > > On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <[email protected]> > > wrote: > > > I re-ran the benchmarks (same test as yours, different machine): > > > > > > create table pk (a numeric primary key); > > > create table fk (a bigint references pk); > > > insert into pk select generate_series(1, 2000000); > > > insert into fk select generate_series(1, 2000000, 2); > > > > > > master: 2444 ms (median of 3 runs) > > > 0001: 1382 ms (43% faster) > > > 0001+0002: 1202 ms (51% faster, 13% over 0001 alone) > > > > I can get similar improvement on my old mac intel chip: > > > > master: 12963.993 ms > > 0001: 6641.692 ms, 48.8% faster > > 0001+0002: 5771.703 ms, 55.5% faster > > > > > > Also, with int PK / int FK (1M rows): > > > > > > create table pk (a int primary key); > > > create table fk (a int references pk); > > > insert into pk select generate_series(1, 1000000); > > > insert into fk select generate_series(1, 1000000); > > > > > > master: 1000 ms > > > 0001: 520 ms (48% faster) > > > 0001+0002: 432 ms (57% faster, 17% over 0001 alone) > > > > master: 11134.583 ms > > 0001: 5240.298 ms, 52.9% faster > > 0001+0002: 4554.215 ms, 59.1% faster > > Thanks for testing, good to see similar numbers. I had forgotten to > note that these results are when these PK index probes don't do any > I/O, though you might be aware of that. Below, I report some numbers > that Tomas Vondra shared with me off-list where the probes do have to > perform I/O and there the benefits from only this patch set are only > marginal. > > > I don't have any additional comments on the patch except one minor nit, > > maybe merge the following two if conditions into one, not a strong opinion > > though. > > > > if (use_cache) > > { > > /* > > * The snapshot was registered once when the cache entry was created. > > * We just patch curcid to reflect the new command counter. > > * SnapshotSetCommandId() only patches process-global statics, not > > * registered copies, so we do it directly. > > * > > * The xmin/xmax/xip fields don't need refreshing: within a single > > * statement batch, only curcid changes between rows. > > */ > > Assert(fpentry && fpentry->snapshot != NULL); > > snapshot = fpentry->snapshot; > > snapshot->curcid = GetCurrentCommandId(false); > > } > > else > > snapshot = RegisterSnapshot(GetLatestSnapshot()); > > > > if (use_cache) > > { > > pk_rel = fpentry->pk_rel; > > idx_rel = fpentry->idx_rel; > > scandesc = fpentry->scandesc; > > slot = fpentry->slot; > > } > > else > > { > > pk_rel = table_open(riinfo->pk_relid, RowShareLock); > > idx_rel = index_open(riinfo->conindid, AccessShareLock); > > scandesc = index_beginscan(pk_rel, idx_rel, > > snapshot, NULL, > > riinfo->nkeys, 0); > > slot = table_slot_create(pk_rel, NULL); > > } > > Good idea, done. > > While polishing 0002, I revisited the snapshot caching semantics. The > previous commit message hand-waved about only curcid changing between > rows, but GetLatestSnapshot() also reflects other backends' commits, > so reusing the snapshot is a deliberate semantic change from the SPI > path. I think it's safe because curcid is all we need for > intra-statement visibility, concurrent commits either already happened > before our snapshot (and are visible) or are racing with our statement > and wouldn't be seen reliably even with per-row snapshots since the > order in which FK rows are checked is nondeterministic, and > LockTupleKeyShare prevents the PK row from disappearing regardless. In > essence, we're treating all the FK checks within a trigger-firing > cycle as a single plan execution that happens to scan N rows, rather > than N independent SPI queries each taking a fresh snapshot. That's > the natural model -- a normal SELECT ... FOR KEY SHARE plan doesn't > re-take GetLatestSnapshot() between rows either. > > Similarly, the permission check (schema USAGE + table SELECT) is now > done once at cache entry creation in ri_FastPathGetEntry() rather than > on every flush.
nice improvement. > The RI check runs as the PK table owner, so we're > verifying that the owner can access their own table -- a condition > that won't change unless someone explicitly revokes from the owner, > which would also break the SPI path. > > > > David Rowley mentioned off-list that it might be worth batching > > > multiple FK values into a single index probe, leveraging the > > > ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be > > > to buffer FK values across trigger invocations in the per-constraint > > > cache (0002 already has the right structure for this), build a > > > SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf > > > pages in one sorted traversal instead of one tree descent per row. The > > > locking and recheck would still be per-tuple, but the index traversal > > > cost drops significantly. Single-column FKs are the obvious starting > > > point. That seems worth exploring but can be done as a separate patch > > > on top of this. > > > > I will take a look at this in the following weeks. > > I ended up going ahead with the batching and SAOP idea that David > mentioned -- I had a proof-of-concept working shortly after posting v3 > and kept iterating on it. So attached set is now: > > 0001 - Core fast path (your 0001+0002 reworked, as before) > > 0002 - Per-batch resource caching (PK relation, index, scandesc, snapshot) > > 0003 - FK row buffering: materialize FK tuples into a per-constraint > batch buffer (64 rows), flush when full or at batch end > > 0004 - SK_SEARCHARRAY for single-column FKs: build an array from the > buffered FK values and do one index scan instead of 64 separate tree > descents. Multi-column FKs fall back to a per-row loop. > > 0003 is pure infrastructure -- it doesn't improve performance on its > own because the per-row index descent still dominates. The payoff > comes in 0004. > > Numbers (same machine as before, median of 3 runs): > > numeric PK / bigint FK, 1M rows: > master: 2487 ms > 0001..0004: 1168 ms (2.1x) > > int PK / int FK, 500K rows: > master: 1043 ms > 0001..0004: 335 ms (3.1x) > > The int/int case benefits most because the per-row cost is lower, so > the SAOP traversal savings are a larger fraction of the total. The > numeric/bigint case still sees a solid improvement despite the > cross-type cast overhead. > > Tomas Vondra also tested with an I/O-intensive workload (dataset > larger than shared_buffers, combined with his and Peter Geoghegan's > I/O prefetching patches) and confirmed that the batching + SAOP > approach helps there too, not just in the CPU-bound / memory-resident > case. In fact he showed that the patches here don't make a big dent > when the main bottleneck is I/O as shown in numbers that he shared in > an off-list email: > > master: 161617 ms > ri-check (0001..0004): 149446 ms (1.08x) > ri-check + i/o prefetching: 50885 ms (3.2x) > > So the RI patches alone only give ~8% here since most time is waiting > on reads. But the batching gives the prefetch machinery a window of > upcoming probes to issue readahead against, so the two together yield > 3.2x. impressive! > > Tomas also caught a memory context bug in the batch flush path: the > cached scandesc lives in TopTransactionContext, but the btree AM > defers _bt_preprocess_keys allocation to the first getnext call, which > pallocs into CurrentMemoryContext. If that's a short-lived > per-trigger-row context, the scandesc has dangling pointers on the > next rescan. Fixed by switching to TopTransactionContext before the > probe loop. > > Finally, I've fixed a number of other small and not-so-small bugs > found while polishing the old patches and made other stylistic > improvements. One notable change is that I introduced a FastPathMeta Yeah, this is much better than the fpmeta_valid field. > struct to store the fast path metadata instead of dumping those arrays > in the RI_ConstraintInfo. It's allocated lazily on first use and holds > the per-key compare entries, operator procedures, and index strategy > info needed by the scan key construction, so RI_ConstraintInfo doesn't > pay for them when the fast path isn't used. > > > On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <[email protected]> wrote: > > > > Hi Amit, > > > > On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <[email protected]> > > wrote: > > > > > > Hi Junwang, > > > > > > On Mon, Dec 1, 2025 at 3:09 PM Junwang Zhao <[email protected]> wrote: > > > > As Amit has already stated, we are approaching a hybrid "fast-path + > > > > fallback" > > > > design. > > > > > > > > 0001 adds a fast path optimization for foreign key constraint checks > > > > that bypasses the SPI executor, the fast path applies when the > > > > referenced > > > > table is not partitioned, and the constraint does not involve temporal > > > > semantics. > > > > > > > > With the following test: > > > > > > > > create table pk (a numeric primary key); > > > > create table fk (a bigint references pk); > > > > insert into pk select generate_series(1, 2000000); > > > > > > > > head: > > > > > > > > [local] zhjwpku@postgres:5432-90419=# insert into fk select > > > > generate_series(1, 2000000, 2); > > > > INSERT 0 1000000 > > > > Time: 13516.177 ms (00:13.516) > > > > > > > > [local] zhjwpku@postgres:5432-90419=# update fk set a = a + 1; > > > > UPDATE 1000000 > > > > Time: 15057.638 ms (00:15.058) > > > > > > > > patched: > > > > > > > > [local] zhjwpku@postgres:5432-98673=# insert into fk select > > > > generate_series(1, 2000000, 2); > > > > INSERT 0 1000000 > > > > Time: 8248.777 ms (00:08.249) > > > > > > > > [local] zhjwpku@postgres:5432-98673=# update fk set a = a + 1; > > > > UPDATE 1000000 > > > > Time: 10117.002 ms (00:10.117) > > > > > > > > 0002 cache fast-path metadata used by the index probe, at the current > > > > time only comparison operator hash entries, operator function OIDs > > > > and strategy numbers and subtypes for index scans. But this cache > > > > doesn't buy any performance improvement. > > > > > > > > Caching additional metadata should improve performance for foreign key > > > > checks. > > > > > > > > Amit suggested introducing a mechanism for ri_triggers.c to register a > > > > cleanup callback in the EState, which AfterTriggerEndQuery() could then > > > > invoke to release per-statement cached metadata (such as the > > > > IndexScanDesc). > > > > However, I haven't been able to implement this mechanism yet. > > > > > > Thanks for working on this. I've taken your patches as a starting > > > point and reworked the series into two patches (attached): 1st is your > > > 0001+0002 as the core patch that adds a gated fast-path alternative to > > > SPI and 2nd where I added per-statement resource caching. Doing the > > > latter turned out to be not so hard thanks to the structure you chose > > > to build the core fast path. Good call on adding the RLS and ACL test > > > cases, btw. > > > > > > So, 0001 is a functionally complete fast path: concurrency handling, > > > REPEATABLE READ crosscheck, cross-type operators, security context, > > > and metadata caching. 0002 implements the per-statement resource > > > caching we discussed, though instead of sharing the EState between > > > trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback > > > mechanism that fires at the end of each trigger-firing cycle > > > (per-statement for immediate constraints, or until COMMIT for deferred > > > ones). It layers resource caching on top so that the PK relation, > > > index, scan descriptor, and snapshot stay open across all FK trigger > > > invocations within a single trigger-firing cycle rather than being > > > opened and closed per row. > > > > > > Note that phe previous 0002 (metadata caching) is folded into 0001, > > > and most of the new fast-path logic added in 0001 now lives in > > > ri_FastPathCheck() rather than inline in RI_FKey_check(), so the > > > RI_FKey_check diff is just the gating call and SPI fallback. > > > > > > I re-ran the benchmarks (same test as yours, different machine): > > > > > > create table pk (a numeric primary key); > > > create table fk (a bigint references pk); > > > insert into pk select generate_series(1, 2000000); > > > insert into fk select generate_series(1, 2000000, 2); > > > > > > master: 2444 ms (median of 3 runs) > > > 0001: 1382 ms (43% faster) > > > 0001+0002: 1202 ms (51% faster, 13% over 0001 alone) > > > > I can get similar improvement on my old mac intel chip: > > > > master: 12963.993 ms > > 0001: 6641.692 ms, 48.8% faster > > 0001+0002: 5771.703 ms, 55.5% faster > > > > > > > > Also, with int PK / int FK (1M rows): > > > > > > create table pk (a int primary key); > > > create table fk (a int references pk); > > > insert into pk select generate_series(1, 1000000); > > > insert into fk select generate_series(1, 1000000); > > > > > > master: 1000 ms > > > 0001: 520 ms (48% faster) > > > 0001+0002: 432 ms (57% faster, 17% over 0001 alone) > > > > master: 11134.583 ms > > 0001: 5240.298 ms, 52.9% faster > > 0001+0002: 4554.215 ms, 59.1% faster > > > > > > > > The incremental gain from 0002 comes from eliminating per-row relation > > > open/close, scan begin/end, slot alloc/free, and replacing per-row > > > GetSnapshotData() with only curcid adjustment on the registered > > > snapshot copy in the cache. > > > > > > The two current limitations are partitioned referenced tables and > > > temporal foreign keys. Partitioned PKs are relatively uncommon in > > > practice, so the non-partitioned case should cover most FK workloads, > > > so I'm not sure it's worth the added complexity to support them. > > > Temporal FKs are inherently multi-row, so they're a poor fit for a > > > single-probe fast path. > > > > > > David Rowley mentioned off-list that it might be worth batching > > > multiple FK values into a single index probe, leveraging the > > > ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be > > > to buffer FK values across trigger invocations in the per-constraint > > > cache (0002 already has the right structure for this), build a > > > SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf > > > pages in one sorted traversal instead of one tree descent per row. The > > > locking and recheck would still be per-tuple, but the index traversal > > > cost drops significantly. Single-column FKs are the obvious starting > > > point. That seems worth exploring but can be done as a separate patch > > > on top of this. > > > > I will take a look at this in the following weeks. > > > > > > > > I think the series is in reasonable shape but would appreciate extra > > > eyeballs, especially on the concurrency handling in ri_LockPKTuple() > > > in 0001 and the snapshot lifecycle in 0002. Or anything else that > > > catches one's eye. > > > > > > -- > > > Thanks, Amit Langote > > > > I don't have any additional comments on the patch except one minor nit, > > maybe merge the following two if conditions into one, not a strong opinion > > though. > > > > if (use_cache) > > { > > /* > > * The snapshot was registered once when the cache entry was created. > > * We just patch curcid to reflect the new command counter. > > * SnapshotSetCommandId() only patches process-global statics, not > > * registered copies, so we do it directly. > > * > > * The xmin/xmax/xip fields don't need refreshing: within a single > > * statement batch, only curcid changes between rows. > > */ > > Assert(fpentry && fpentry->snapshot != NULL); > > snapshot = fpentry->snapshot; > > snapshot->curcid = GetCurrentCommandId(false); > > } > > else > > snapshot = RegisterSnapshot(GetLatestSnapshot()); > > > > if (use_cache) > > { > > pk_rel = fpentry->pk_rel; > > idx_rel = fpentry->idx_rel; > > scandesc = fpentry->scandesc; > > slot = fpentry->slot; > > } > > else > > { > > pk_rel = table_open(riinfo->pk_relid, RowShareLock); > > idx_rel = index_open(riinfo->conindid, AccessShareLock); > > scandesc = index_beginscan(pk_rel, idx_rel, > > snapshot, NULL, > > riinfo->nkeys, 0); > > slot = table_slot_create(pk_rel, NULL); > > } > > > > -- > > Regards > > Junwang Zhao > > > > -- > Thanks, Amit Langote -- Regards Junwang Zhao
