On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote: > On Mon, Aug 22, 2011 at 3:31 AM, daveg <da...@sonic.net> wrote: > > So far I've got: > > > > - affects system tables > > - happens very soon after process startup > > - in 8.4.7 and 9.0.4 > > - not likely to be hardware or OS related > > - happens in clusters for period of a few second to many minutes > > > > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's > > hard to get downtime to pick up new builds. Any other ideas on getting to > > the bottom of this? > > I've been thinking this one over, and doing a little testing. I'm > still stumped, but I have a few thoughts. What that error message is > really saying is that the LOCALLOCK bookkeeping doesn't match the > PROCLOCK bookkeeping; it doesn't tell us which one is to blame. ... > My second thought is that perhaps a process is occasionally managing > to exit without fully cleaning up the associated PROCLOCK entry. At > first glance, it appears that this would explain the observed > symptoms. A new backend gets the PGPROC belonging to the guy who > didn't clean up after himself, hits the error, and disconnects, > sticking himself right back on to the head of the SHM_QUEUE where the > next connection will inherit the same PGPROC and hit the same problem. > But it's not clear to me what could cause the system to get into this > state in the first place, or how it would eventually right itself. > > It might be worth kludging up your system to add a test to > InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are > either NULL or empty, along the lines of the attached patch (which > assumes that assertions are enabled; otherwise, put in an elog() of > some sort). Actually, I wonder if we shouldn't move all the > SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than > doing it over again every time someone calls InitProcess(). Besides > being a waste of cycles, it's probably less robust this way. If > there somehow are leftovers in one of those queues, the next > successful call to LockReleaseAll() ought to clean up the mess, but of > course there's no chance of that working if we've nuked the queue > pointers.
I did this in the elog flavor as we don't build production images with asserts. It has been running on all hosts for a few days. Today it hit the extra checks in initproc. 00:02:32.782 8626 [unknown] [unknown] LOG: connection received: host=bk0 port=42700 00:02:32.783 8627 [unknown] [unknown] LOG: connection received: host=op2 port=45876 00:02:32.783 8627 d61 apps FATAL: Initprocess myProclocks[4] not empty: queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558) 00:02:32.783 8626 d35 postgres LOG: connection authorized: user=postgres database=c35 00:02:32.783 21535 LOG: server process (PID 8627) exited with exit code 1 00:02:32.783 21535 LOG: terminating any other active server processes 00:02:32.783 8626 c35 postgres WARNING: terminating connection because of crash of another server process The patch that produced this is attached. If you can think of anything I can add to this to help I'd be happy to do so. Also, can I clean this up and continue somehow? Maybe clear the queue instead having to have a restart? Or is there a way to just pause this proc here, maybe mark it not to be used and exit, or just to sleep forever so I can debug later? Thanks -dg -- David Gould da...@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
--- postgresql-9.0.4/src/backend/storage/lmgr/proc.c 2011-04-14 20:15:53.000000000 -0700 +++ postgresql-9.0.4.dg/src/backend/storage/lmgr/proc.c 2011-08-23 17:30:03.505176019 -0700 @@ -323,7 +323,15 @@ MyProc->waitLock = NULL; MyProc->waitProcLock = NULL; for (i = 0; i < NUM_LOCK_PARTITIONS; i++) + { + SHM_QUEUE *queue = &(MyProc->myProcLocks[i]); + if (! (!queue->prev || queue->prev == queue || + !queue->next || queue->next == queue) + ) + elog(FATAL, "Initprocess myProclocks[%d] not empty: queue %p (prev %p, next %p) ", + i, queue, queue->prev, queue->next); SHMQueueInit(&(MyProc->myProcLocks[i])); + } MyProc->recoveryConflictPending = false; /*
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers