I've been chasing the intermittent "cache lookup failed for access method 403" failure at session startup that's been seen lately in the buildfarm, for instance here:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-06-04%2019%3A22%3A46 (Axolotl has shown this 3 times in the last 90 days, not sure if any others have seen it.) I hypothesized that this was triggered by the "VACUUM FULL pg_am" in the concurrently running vacuum.sql regression test, so I started running the regression tests in parallel with a shell script doing while sleep 0.1; do psql -c 'vacuum full pg_am' regression; done and sure enough, I can reproduce it once in awhile. I've not tracked it down yet, but I have gotten pretty annoyed by an unrelated problem: every so often the "DROP DATABASE regression" at the start of the regression test sequence will hang up for 5 seconds and then fail, complaining there's another session already in the DB. Meanwhile, the current "psql -c" call also hangs up for 5 seconds. What is happening is a sort of deadlock between InitPostgres, which does this: /* Now we can mark our PGPROC entry with the database ID */ /* (We assume this is an atomic store so no lock is needed) */ MyProc->databaseId = MyDatabaseId; and then this: if (!bootstrap) LockSharedObject(DatabaseRelationId, MyDatabaseId, 0, RowExclusiveLock); and dropdb/CountOtherDBBackends, which first gets the database lock and then looks to see if any other sessions are advertising the target database OID in their PGPROC. If such sessions exist and don't go away within 5 seconds then CountOtherDBBackends fails. So, if an incoming connection sets MyProc->databaseId between the time that dropdb acquires the database lock and the time that CountOtherDBBackends looks, we have an effective deadlock because that incoming session will then block on the database lock. AFAICS, this is easy to fix by swapping the order of the above-mentioned operations, ie get the lock before setting MyProc->databaseId, as per attached patch. Processes examining the PGPROC array must acquire the database lock before looking for sessions in the target DB, so they'll still see any conflicting session. On the other hand, the incoming session already has to recheck that the target DB is still there once it's got the lock, so there's no risk of problems on that side either. Unless somebody can see a problem with this, I propose to apply and back-patch this change. The behavior is infrequent, but it's pretty nasty when it does occur, since all incoming connections for the target database are hung up for 5 seconds. The case of DROP DATABASE might not be a big deal, but this would also for example cause unwanted failures in CREATE DATABASE if there were short-term connections to the template database. regards, tom lane
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index aa67f75..a5d88c1 100644 *** a/src/backend/utils/init/postinit.c --- b/src/backend/utils/init/postinit.c *************** InitPostgres(const char *in_dbname, Oid *** 848,865 **** strcpy(out_dbname, dbname); } - /* Now we can mark our PGPROC entry with the database ID */ - /* (We assume this is an atomic store so no lock is needed) */ - MyProc->databaseId = MyDatabaseId; - - /* - * We established a catalog snapshot while reading pg_authid and/or - * pg_database; but until we have set up MyDatabaseId, we won't react to - * incoming sinval messages for unshared catalogs, so we won't realize it - * if the snapshot has been invalidated. Assume it's no good anymore. - */ - InvalidateCatalogSnapshot(); - /* * Now, take a writer's lock on the database we are trying to connect to. * If there is a concurrently running DROP DATABASE on that database, this --- 848,853 ---- *************** InitPostgres(const char *in_dbname, Oid *** 867,875 **** * pg_database). * * Note that the lock is not held long, only until the end of this startup ! * transaction. This is OK since we are already advertising our use of ! * the database in the PGPROC array; anyone trying a DROP DATABASE after ! * this point will see us there. * * Note: use of RowExclusiveLock here is reasonable because we envision * our session as being a concurrent writer of the database. If we had a --- 855,868 ---- * pg_database). * * Note that the lock is not held long, only until the end of this startup ! * transaction. This is OK since we will advertise our use of the ! * database in the PGPROC array before dropping the lock (in fact, that's ! * the next thing to do). Anyone trying a DROP DATABASE after this point ! * will see us in PGPROC once they have the lock. ! * ! * Ordering is important here because we don't want to advertise ourselves ! * as being in this database until we have the lock; otherwise we create ! * what amounts to a deadlock with CountOtherDBBackends(). * * Note: use of RowExclusiveLock here is reasonable because we envision * our session as being a concurrent writer of the database. If we had a *************** InitPostgres(const char *in_dbname, Oid *** 881,886 **** --- 874,891 ---- LockSharedObject(DatabaseRelationId, MyDatabaseId, 0, RowExclusiveLock); + /* Now we can mark our PGPROC entry with the database ID */ + /* (We assume this is an atomic store so no lock is needed) */ + MyProc->databaseId = MyDatabaseId; + + /* + * We established a catalog snapshot while reading pg_authid and/or + * pg_database; but until we have set up MyDatabaseId, we won't react to + * incoming sinval messages for unshared catalogs, so we won't realize it + * if the snapshot has been invalidated. Assume it's no good anymore. + */ + InvalidateCatalogSnapshot(); + /* * Recheck pg_database to make sure the target database hasn't gone away. * If there was a concurrent DROP DATABASE, this ensures we will die
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers