Hi, We encountered a deadlock involving VACUUM FULL (surprise surprise! :)) in PG 8.3.13 (and still not fixed in 9.0 AFAICS although the window appears much smaller). The call spits out the following deadlock info:
ERROR: SQLSTATE 40P01: deadlock detected DETAIL: Process 12479 waits for AccessExclusiveLock on relation 2663 of database 16384; blocked by process 14827. Process 14827 waits for AccessShareLock on relation 1259 of database 16384; blocked by process 12479. LOCATION: DeadLockReport, deadlock.c:918 It looked familiar, so I dug up the archives and found that Tom had committed a fix for a similar deadlock via git commitid: 715120e7 However this current deadlock involved an index with oid 2663, which is ClassNameNspIndexId. Clearly this was another case of locking the index directly without taking a lock on the parent catalog. Further sleuthing revealed that the culprit function was InitCatCachePhase2, which directly calls index_open in the process startup phase. Reproducing this was easy once you know the culprit, (excruciatingly difficult if you do not know the exact race window). I added a sleep inside the InitCatCachePhase2 function before calling index_open. Then I invoked a "VACUUM FULL pg_class" from another session, halting it in gdb just before taking the exclusive lock via try_relation_open. When a new PG process sleeps inside InitCatCachePhase2, we then take the lock in the VF process, waiting just after it. When the startup continues after the sleep, it will take the ClassNameNspIndexId share lock, but hang to take a share lock on pg_class in RelationReloadIndexInfo. Simply continue the VF process in gdb which will try to take the exclusive lock to vacuum the index. This will reproduce the deadlock in all its glory. The fix is similar to the earlier commit by Tom. I tested this fix against 8.3.13. We lock the parent catalog now before calling index_open. Patch against git HEAD attached with this mail. I guess we will backpatch this? Tom's last commit was backpatched till 8.2 I think. Regards, Nikhils
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index d0e364e..c9386aa 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -26,6 +26,7 @@ #ifdef CATCACHE_STATS #include "storage/ipc.h" /* for on_proc_exit */ #endif +#include "storage/lmgr.h" #include "utils/builtins.h" #include "utils/fmgroids.h" #include "utils/inval.h" @@ -967,8 +968,16 @@ InitCatCachePhase2(CatCache *cache, bool touch_index) { Relation idesc; + /* + * We must lock the underlying catalog before locking the index to + * avoid deadlock, since RelationReloadIndexInfo might well need to + * read the catalog, and if anyone else is exclusive-locking this + * catalog and index they'll be doing it in that order. + */ + LockRelationOid(cache->cc_reloid, AccessShareLock); idesc = index_open(cache->cc_indexoid, AccessShareLock); index_close(idesc, AccessShareLock); + UnlockRelationOid(cache->cc_reloid, AccessShareLock); } }
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers