I've been testing the problem reported by Dave Gould by running "make installcheck-parallel" together with a tight loop of "vacuum full pg_class":
while true; do psql -c "vacuum full pg_class" regression; usleep 100000; done Even after fixing the cache-reset-recovery-order problem I described yesterday, there are still occasional bizarre failures, for example in this bit of truncate.sql: BEGIN; SELECT * FROM trunc_f; TRUNCATE trunc_f; SELECT * FROM trunc_f; ROLLBACK; BEGIN; SELECT * FROM trunc_f; TRUNCATE ONLY trunc_f; ERROR: attempted to update invisible tuple I'm not sure that there's only one bug remaining here, but I've identified this bug at least. In the first transaction above, the TRUNCATE has to update trunc_f's pg_class tuple with a new relfilenode value. Say the update invalidates the tuple at tid A and inserts the updated version at tid B. Then vacuum full gets control and rewrites pg_class. It correctly keeps both versions of the trunc_f tuple, but now they will be at new tid locations, say C and D. The original transaction meanwhile was blocked trying to re-open pg_class to reload trunc_f's pg_class tuple into its catcaches. Now it successfully does that, and the tuple it reads has tid D. It finishes out the transaction, and rolls back as per script. In the rollback, we know that we have to flush catcache entries for tuple versions inserted by the failed transaction ... but what inval.c has stored is an inval event against pg_class tid B. There is no such entry in the catcaches, so nothing happens, and the entry with tid D stays valid. Ooops. The relcache entry for trunc_f does get rebuilt correctly (without consulting the bogus catcache entry), so the next SELECT goes through all right. But when the second TRUNCATE wants to again update the pg_class tuple, it finds it in the catcaches ... and what it finds has tid D, so it attempts to heap_update that TID instead of the actually live tuple at tid C. In this particular case, the incorrectly targeted tuple is the just-invalidated version of trunc_f, so heap_update sees it as XMIN_INVALID and fails with "attempted to update invisible tuple". The potential consequences are hugely worse than that, though: with a bit more time between the two operations, it'd be possible for someone else to reclaim the dead tuple and replace it with something else. As long as the TID is live when we get to it, heap_update will blindly replace it, whether or not it has anything to do with the tuple we think we're replacing. I suspect that some of the other bizarre cases I'm seeing are artifacts of that type of outcome, but they're hard to reproduce so I've not been able to pin them down yet. In principle this same type of failure could have occurred before 9.0, since catcache inval has always been TID-based. It turns out we were saved by the fact that the old VACUUM FULL implementation would chicken out and not do any tuple moving if it found any INSERT_IN_PROGRESS or DELETE_IN_PROGRESS tuples. The pre-9.0 versions of CLUSTER reject such tuples too, so I don't think we need to do anything in those branches. But in 9.0 and up, we have a problem. So far I've thought of two possible avenues to fix it: 1. When a SHAREDINVALCATALOG_ID inval message comes in (telling us a VAC FULL or CLUSTER just finished on a system catalog), enter that message into the transaction's local inval list as well as executing it immediately. On transaction abort, redo the resulting cache flushes a second time. This would ensure we flushed any bad entries even though they were logged with obsolete TIDs elsewhere in the inval list. (We might need to do this on commit too, though right now I don't think so. Also, a cache reset event would a fortiori have to queue a similar entry to blow things away a second time at transaction end.) 2. Forget about targeting catcache invals by TID, and instead just use the key hash value to determine which cache entries to drop. Approach #2 seems a lot less invasive and more trustworthy, but it has the disadvantage that cache invals would become more likely to blow away entries unnecessarily (because of chance hashvalue collisions), even without any VACUUM FULL being done. If we could make approach #1 work reliably, it would result in more overhead during VACUUM FULL but less at other times --- or at least we could hope so. In an environment where lots of sinval overflows and consequent resets happen, we might come out behind due to doubling the number of catcache flushes forced by a reset event. Right at the moment I'm leaning to approach #2. I wonder if anyone sees it differently, or has an idea for a third approach? BTW, going forward it might be interesting to think about invalidating the catcaches based on xmin/xmax rather than specific TIDs. That would reduce the sinval traffic to one message per transaction that had affected the catalogs, instead of one per TID. But that clearly is not going to lead to something we'd dare back-patch. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers