On Thu, Jul 31, 2025 at 09:53:20AM -0700, Paul A Jungwirth wrote: > Ian Ilyasov and I reviewed this patch. We think it is ready to commit > to back branches.
Thank you! > We assume you will also un-revert 8e7e672cda ("WAL-log inplace update > before revealing it to other sessions.")? We didn't look closely at > that patch, but it seems like there are no known problems with it. It > was just reverted because it depends on this patch. Right. I'm fine self-certifying that one. > Is there any way to add more testing around non-transactional > invalidations? It is a new "feature" but it is not really tested > anywhere. I don't think we could do this with regress tests, but > perhaps isolation tests would be suitable. I think src/test/isolation/specs/inplace-inval.spec is testing it. > Some of the comments felt a bit compressed. They make sense in the > context of this fix, but reading them cold seems like it will be > challenging. For example this took a lot of thinking to follow: > > * Construct shared cache inval if necessary. Because we pass a tuple > * version without our own inplace changes or inplace changes other > * sessions complete while we wait for locks, inplace update mustn't > * change catcache lookup keys. But we aren't bothering with index > * updates either, so that's true a fortiori. What do you think of the attached rewrite? I also removed this part: - * If we're mutating a tuple visible only to this transaction, there's an - * equivalent transactional inval from the action that created the tuple, - * and this inval is superfluous. That would have needed s/this transaction/this command/ to be correct, and I didn't feel it was saying something important enough to keep. There are plenty of ways for invals to be redundant. > Or this: > > * WAL contains likely-unnecessary commit-time invals from the > * CacheInvalidateHeapTuple() call in heap_inplace_update(). > > Why likely-unnecessary? I know you explain it at that callsite, but > some hint might help here. On further study, I think one could construct a logical decoding output plugin for which it's necessary. I've detailed that in the attached edits. This was in the heap_decode() changes, which is the part of the patch I understand the least. I likely should have made it a separate patch. Here's the surrounding change I made in 2024, in context diff format: *** 508,530 **** heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) /* * Inplace updates are only ever performed on catalog tuples and ! * can, per definition, not change tuple visibility. Since we ! * don't decode catalog tuples, we're not interested in the ! * record's contents. * ! * In-place updates can be used either by XID-bearing transactions ! * (e.g. in CREATE INDEX CONCURRENTLY) or by XID-less ! * transactions (e.g. VACUUM). In the former case, the commit ! * record will include cache invalidations, so we mark the ! * transaction as catalog modifying here. Currently that's ! * redundant because the commit will do that as well, but once we ! * support decoding in-progress relations, this will be important. */ - if (!TransactionIdIsValid(xid)) - break; - - (void) SnapBuildProcessChange(builder, xid, buf->origptr); - ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr); break; case XLOG_HEAP_CONFIRM: --- 508,526 ---- /* * Inplace updates are only ever performed on catalog tuples and ! * can, per definition, not change tuple visibility. Inplace ! * updates don't affect storage or interpretation of table rows, ! * so they don't affect logicalrep_write_tuple() outcomes. Hence, ! * we don't process invalidations from the original operation. If ! * inplace updates did affect those things, invalidations wouldn't ! * make it work, since there are no snapshot-specific versions of ! * inplace-updated values. Since we also don't decode catalog ! * tuples, we're not interested in the record's contents. * ! * WAL contains likely-unnecessary commit-time invals from the ! * CacheInvalidateHeapTuple() call in heap_inplace_update(). ! * Excess invalidation is safe. */ break; case XLOG_HEAP_CONFIRM: This code had been unchanged since commit b89e1510 (v9.4) introduced logical decoding. I'm making the following assumptions about the old code: - I guessed "decoding in-progress relations" was a typo for "decoding in-progress transactions", something we've supported since at least commit 4648243 "Add support for streaming to built-in logical replication" (2020-09, v14). If it's not a typo, I don't know what "in-progress relations" denoted here. - It said "redundant because the commit will do that as well", but I didn't find such code. I bet that it referenced the DecodeCommit() lines removed in commit c55040c "WAL Log invalidations at command end with wal_level=logical" (2020-07, v14). The same commit likely made the ReorderBufferXidSetCatalogChanges() call obsolete. - I had no idea why we call SnapBuildProcessChange(). Every other caller uses its return value. I guess there was a desire for its snapshot side effects, but I didn't follow that. Nothing snapshot-relevant happens at XLOG_HEAP_INPLACE. Removing this call doesn't break any test on v13 or on v9.4. Similarly, no test fails after removing both this and the ReorderBufferXidSetCatalogChanges() call. - In v9.4, this area of code (per-heapam-record-type decoding) had inval responsibilities. That went away in v14, so there's no need for the comment here to keep discussing invals. I now think it would be prudent to omit from back-patch the non-comment heap_decode() changes. While the above assumptions argue against needing the removed ReorderBufferXidSetCatalogChanges(), that's the sort of speculation I should keep out back-branch changes. > It's a bit surprising that wrongly leaving relhasindex=t is safe (for > example after BEGIN; CREATE INDEX; ROLLBACK;). I guess this column is > just to save us a lookup for tables with no index, and no harm is done > if we do the lookup needlessly but find no indexes. > And vacuum can > repair it later. Still it's a little unnerving. DROP INDEX doesn't clear it, either. That's longstanding, and it doesn't involve narrow race conditions. Hence, I'm not worrying about it. If it were broken, we'd have heard by now. > On Thu, Oct 31, 2024 at 09:20:52PM -0700, Noah Misch wrote: > > Here, one of the autovacuum workers had the guilty stack trace, appearing at > > the end of this message. heap_inplace_update_and_unlock() calls > > CacheInvalidateHeapTupleInplace() while holding BUFFER_LOCK_EXCLUSIVE on a > > buffer of pg_class. CacheInvalidateHeapTupleInplace() may call > > CatalogCacheInitializeCache(), which opens the cache's rel. If there's not > > a > > valid relcache entry for the catcache's rel, we scan pg_class to make a > > valid > > relcache entry. The ensuing hang makes sense. > > Personally I never expected that catcache could depend on relcache, > since it seems lower-level. But it makes sense that you need a > relcache of pg_class at least, so their relationship is more > complicated than just layers. Yes, relcache.c is indeed ... eclectic. > PrepareToInvalidateCacheTuple(Relation relation, > HeapTuple tuple, > HeapTuple newtuple, > - void (*function) (int, uint32, Oid)) > + void (*function) (int, uint32, Oid, void *), > + void *context) > > It's a little odd that PrepareToInvalidateCacheTuple takes a callback > function when it only has one caller, so it always calls > RegisterCatcacheInvalidation. Is it just to avoid adding dependencies > to inval.c? But it already #includes catcache.h and contains lots of > knowledge about catcache specifics. Maybe originally > PrepareToInvalidateCacheTuple was built to take > RegisterRelcacheInvalidation as well? Is it worth still passing the > callback? Looking at the history, it did have two callbacks in Postgres95 1.01. It was down to one callback when it got the name PrepareToInvalidateCacheTuple() in commit 81d08fc of 2001-01. I think the main alternative to today's callback would be to make RegisterCatcacheInvalidation() an extern for catcache.c to call. All the other Register* functions would remain static. I left things unchanged since that would be cleaner in one way and dirtier in another. > @@ -6511,6 +6544,7 @@ heap_inplace_unlock(Relation relation, > { > LockBuffer(buffer, BUFFER_LOCK_UNLOCK); > UnlockTuple(relation, &oldtup->t_self, InplaceUpdateTupleLock); > + ForgetInplace_Inval(); > } > > Is this the right place to add this? We think on balance yes, but the > question crossed my mind: Clearing the invals seems like a separate > responsibility from unlocking the buffer & tuple. After this patch, > our only remaining caller of heap_inplace_unlock is > systable_inplace_update_cancel, so perhaps it should call > ForgetInplace_Inval itself? OTOH we like that putting it here > guarantees it gets called, as a complement to building the invals in > heap_inplace_lock. Style principles behind the placement: - Inplace update puts some responsibilities in genam.c and others in heapam.c. A given task, e.g. draining the inval queue, should consistently appear on the same side of that boundary. - The heapam.c side of inplace update should cover roughly as much as the heapam.c side of heap_update(). Since heap_update() handles invals and the critical section, so should the heapam.c side of inplace update. It wouldn't take a strong reason to override those principles. postgr.es/m/20240831011043.2b.nmi...@google.com has an inventory of the ways to assign inval responsibility to heapam.c vs. genam.c. postgr.es/m/20241013004511.9c.nmi...@google.com has a related discussion, about the need for AtInplace_Inval() to be in the critical section. On Sun, Aug 03, 2025 at 12:32:31PM -0700, Paul A Jungwirth wrote: [details of the hang] > The concurrency, then, is analyzing pg_attribute in one worker and > pg_class in another. No locks would prevent that. It doesn't have to > be pg_attribute; it could be some user table. But since the repro > script adds & drops tables, pg_attribute is getting churned. Thanks for that detailed write-up! I agree with your findings.
From: Noah Misch <n...@leadboat.com> diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock index 843c2e5..16f7d78 100644 --- a/src/backend/access/heap/README.tuplock +++ b/src/backend/access/heap/README.tuplock @@ -199,3 +199,35 @@ under a reader holding a pin. A reader of a heap_fetch() result tuple may witness a torn read. Current inplace-updated fields are aligned and are no wider than four bytes, and current readers don't need consistency across fields. Hence, they get by with just fetching each field once. + +During logical decoding, caches reflect an inplace update no later than the +next XLOG_XACT_INVALIDATIONS. That record witnesses the end of a command. +Tuples of its cmin are then visible to decoding, as are inplace updates of any +lower LSN. Inplace updates of a higher LSN may also be visible, even if those +updates would have been invisible to a non-historic snapshot matching +decoding's historic snapshot. (In other words, decoding may see inplace +updates that were not visible to a similar snapshot taken during original +transaction processing.) That's a consequence of inplace update violating +MVCC: there are no snapshot-specific versions of inplace-updated values. This +all makes it hard to reason about inplace-updated column reads during logical +decoding, but the behavior does suffice for relhasindex. A relhasindex=t in +CREATE INDEX becomes visible no later than the new pg_index row. While it may +be visible earlier, that's harmless. Finding zero indexes despite +relhasindex=t is normal in more cases than this, e.g. after DROP INDEX. +Example of a case that meaningfully reacts to the inplace inval: + +CREATE TABLE cat (c int) WITH (user_catalog_table = true); +CREATE TABLE normal (d int); +... +CREATE INDEX ON cat (c)\; INSERT INTO normal VALUES (1); + +If the output plugin reads "cat" during decoding of the INSERT, it's fair to +want that read to see relhasindex=t and use the new index. + +An alternative would be to have decoding of XLOG_HEAP_INPLACE immediately +execute its invals. That would behave more like invals during original +transaction processing. It would remove the decoding-specific delay in e.g. a +decoding plugin witnessing a relfrozenxid change. However, a good use case +for that is unlikely, since the plugin would still witness relfrozenxid +changes prematurely. Hence, inplace update takes the trivial approach of +delegating to XLOG_XACT_INVALIDATIONS. diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 7491cc3..a1ef191 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -6371,13 +6371,28 @@ heap_inplace_lock(Relation relation, Assert(BufferIsValid(buffer)); /* - * Construct shared cache inval if necessary. Because we pass a tuple - * version without our own inplace changes or inplace changes other - * sessions complete while we wait for locks, inplace update mustn't - * change catcache lookup keys. But we aren't bothering with index - * updates either, so that's true a fortiori. After LockBuffer(), it - * would be too late, because this might reach a - * CatalogCacheInitializeCache() that locks "buffer". + * Register shared cache invals if necessary. Our input to inval can be + * weaker than heap_update() input to inval in these ways: + * + * - This passes only the old version of the tuple. Inval reacts only to + * catcache lookup key columns and pg_class.oid values stored in + * relcache-relevant catalog columns. All of those columns are indexed. + * Inplace update mustn't be used for any operations that could change + * those. Hence, the new tuple would provide no additional inval-relevant + * information. Those facts also make it fine to skip updating indexes. + * + * - Other sessions may finish inplace updates of this tuple between this + * step and LockTuple(). That's fine for the same reason: those inplace + * updates mustn't be changing columns that affect inval decisions. + * + * - The xwait found below may COMMIT between now and this function + * returning, making the tuple dead. That can change inval decisions, so + * we'll later react to it by forgetting the inval before returning. While + * it's tempting to just register invals after we've confirmed no xwait + * will COMMIT, the following obstacle precludes reordering steps that + * way. Registering invals might reach a CatalogCacheInitializeCache() + * that locks "buffer". That would hang indefinitely if running after our + * own LockBuffer(). Hence, we must register invals before LockBuffer(). */ CacheInvalidateHeapTupleInplace(relation, oldtup_ptr, NULL); @@ -6617,10 +6632,6 @@ heap_inplace_update_and_unlock(Relation relation, /* * Send invalidations to shared queue. SearchSysCacheLocked1() assumes we * do this before UnlockTuple(). - * - * If we're mutating a tuple visible only to this transaction, there's an - * equivalent transactional inval from the action that created the tuple, - * and this inval is superfluous. */ AtInplace_Inval(); @@ -6631,10 +6642,10 @@ heap_inplace_update_and_unlock(Relation relation, AcceptInvalidationMessages(); /* local processing of just-sent inval */ /* - * Queue a transactional inval. The immediate invalidation we just sent - * is the only one known to be necessary. To reduce risk from the - * transition to immediate invalidation, continue sending a transactional - * invalidation like we've long done. Third-party code might rely on it. + * Queue a transactional inval, for logical decoding and for third-party + * code that might have been relying on it since long before inplace + * update adopted immediate invalidation. See README.tuplock section + * "Reading inplace-updated columns" for logical decoding details. */ if (!IsBootstrapProcessingMode()) CacheInvalidateHeapTuple(relation, tuple, NULL); diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index cc03f07..5e15cb1 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -521,18 +521,9 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) /* * Inplace updates are only ever performed on catalog tuples and - * can, per definition, not change tuple visibility. Inplace - * updates don't affect storage or interpretation of table rows, - * so they don't affect logicalrep_write_tuple() outcomes. Hence, - * we don't process invalidations from the original operation. If - * inplace updates did affect those things, invalidations wouldn't - * make it work, since there are no snapshot-specific versions of - * inplace-updated values. Since we also don't decode catalog - * tuples, we're not interested in the record's contents. - * - * WAL contains likely-unnecessary commit-time invals from the - * CacheInvalidateHeapTuple() call in - * heap_inplace_update_and_unlock(). Excess invalidation is safe. + * can, per definition, not change tuple visibility. Since we + * also don't decode catalog tuples, we're not interested in the + * record's contents. */ break;