I wrote: > Buildfarm member skink failed a couple days ago: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2016-11-25%2017%3A50%3A01
Ah ... I can reproduce this with moderate reliability (one failure every 10 or so iterations of the regression tests) by inserting a delay just before autovacuum's check of orphan status: *** a/src/backend/postmaster/autovacuum.c --- b/src/backend/postmaster/autovacuum.c *************** do_autovacuum(void) *** 2046,2051 **** --- 2046,2053 ---- { int backendID; + pg_usleep(100000); + backendID = GetTempNamespaceBackendId(classForm->relnamespace); /* We just ignore it if the owning backend is still active */ I think the sequence of events must be: 1. autovacuum starts its seqscan of pg_class, locking down the snapshot it's going to use for that. 2. Some backend completes its session and drops some temp table(s). 3. autovacuum's scan arrives at the pg_class entry for one of these tables. By now it's committed dead, but it's still visible according to the seqscan's snapshot, so we make the above test. Assuming the owning backend has vacated its sinval slot and no new session has occupied it, we'll decide the table is orphan and record its OID for later deletion. 4. The later code that tries to drop the table is able to see that it's gone by now. Kaboom. In existing releases, it would be about impossible for this race condition to persist long enough that we'd actually try to drop the table. It's definitely possible that we'd try to print "found orphan temp table", but guess what: the back-branch coding here is ereport(LOG, (errmsg("autovacuum: found orphan temp table \"%s\".\"%s\" in database \"%s\"", get_namespace_name(classForm->relnamespace), NameStr(classForm->relname), get_database_name(MyDatabaseId)))); The only part of that that would be at risk of failure is the get_namespace_name call, and since we don't ordinarily remove the pg_namespace entries for temp schemas, it's not likely to fail either. So the race condition does exist in released branches, but it would not cause an autovacuum crash even if libc is unforgiving about printf'ing a NULL string. At most it would cause bogus log entries claiming that temp tables have been orphaned when they haven't. I went digging in the buildfarm logs and was able to find one single instance of a "found orphan temp table" log entry that couldn't be blamed on a prior backend crash; it's in this report: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tick&dt=2015-07-29%2003%3A37%3A52 So the problem seems to be confirmed to exist, but be of low probability and low consequences, in back branches. I think we only need to fix it in HEAD. The lock acquisition and status recheck that I proposed before should be sufficient. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers