On Tue, Aug 23, 2011 at 12:15:23PM -0400, Robert Haas wrote:
> On Mon, Aug 22, 2011 at 3:31 AM, daveg <da...@sonic.net> wrote:
> > So far I've got:
> >
> >  - affects system tables
> >  - happens very soon after process startup
> >  - in 8.4.7 and 9.0.4
> >  - not likely to be hardware or OS related
> >  - happens in clusters for period of a few second to many minutes
> >
> > I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
> > hard to get downtime to pick up new builds. Any other ideas on getting to
> > the bottom of this?
> 
> I've been thinking this one over, and doing a little testing. I'm
> still stumped, but I have a few thoughts.  What that error message is
> really saying is that the LOCALLOCK bookkeeping doesn't match the
> PROCLOCK bookkeeping; it doesn't tell us which one is to blame.
... 
> My second thought is that perhaps a process is occasionally managing
> to exit without fully cleaning up the associated PROCLOCK entry.  At
> first glance, it appears that this would explain the observed
> symptoms.  A new backend gets the PGPROC belonging to the guy who
> didn't clean up after himself, hits the error, and disconnects,
> sticking himself right back on to the head of the SHM_QUEUE where the
> next connection will inherit the same PGPROC and hit the same problem.
>  But it's not clear to me what could cause the system to get into this
> state in the first place, or how it would eventually right itself.
> 
> It might be worth kludging up your system to add a test to
> InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
> either NULL or empty, along the lines of the attached patch (which
> assumes that assertions are enabled; otherwise, put in an elog() of
> some sort).  Actually, I wonder if we shouldn't move all the
> SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
> doing it over again every time someone calls InitProcess().  Besides
> being a waste of cycles, it's probably less robust this way.   If
> there somehow are leftovers in one of those queues, the next
> successful call to LockReleaseAll() ought to clean up the mess, but of
> course there's no chance of that working if we've nuked the queue
> pointers.

I did this in the elog flavor as we don't build production images with asserts.
It has been running on all hosts for a few days. Today it hit the extra
checks in initproc.

00:02:32.782  8626  [unknown] [unknown]  LOG:  connection received: host=bk0 
port=42700
00:02:32.783  8627  [unknown] [unknown]  LOG:  connection received: host=op2 
port=45876
00:02:32.783  8627  d61 apps  FATAL:  Initprocess myProclocks[4] not empty: 
queue 0x2ae6b4b895f8 (prev 0x2ae6b4a2b558, next 0x2ae6b4a2b558) 
00:02:32.783  8626  d35 postgres  LOG:  connection authorized: user=postgres 
database=c35
00:02:32.783  21535  LOG:  server process (PID 8627) exited with exit code 1
00:02:32.783  21535  LOG:  terminating any other active server processes
00:02:32.783  8626  c35 postgres  WARNING:  terminating connection because of 
crash of another server process

The patch that produced this is attached. If you can think of anything I
can add to this to help I'd be happy to do so. Also, can I clean this up
and continue somehow? Maybe clear the queue instead having to have a restart?
Or is there a way to just pause this proc here, maybe mark it not to be used
and exit, or just to sleep forever so I can debug later?

Thanks

-dg

-- 
David Gould       da...@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.
--- postgresql-9.0.4/src/backend/storage/lmgr/proc.c    2011-04-14 
20:15:53.000000000 -0700
+++ postgresql-9.0.4.dg/src/backend/storage/lmgr/proc.c 2011-08-23 
17:30:03.505176019 -0700
@@ -323,7 +323,15 @@
        MyProc->waitLock = NULL;
        MyProc->waitProcLock = NULL;
        for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+       {
+               SHM_QUEUE *queue = &(MyProc->myProcLocks[i]);
+               if (! (!queue->prev || queue->prev == queue ||
+                      !queue->next || queue->next == queue)
+                   )
+                       elog(FATAL, "Initprocess myProclocks[%d] not empty: 
queue %p (prev %p, next %p) ",
+                               i, queue, queue->prev, queue->next);
                SHMQueueInit(&(MyProc->myProcLocks[i]));
+       }
        MyProc->recoveryConflictPending = false;
 
        /*
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to