Hello,
we found this issue year ago -
http://www.postgresql.org/message-id/cafj8prahvzupfbx+8ey-xhfwbo8bxvu_ynmbapsdj8w-ara...@mail.gmail.com
I try to simulate this error, but without success - so I prepared patch
that had to help with identification of this issue. Important part is
backport process startup from 9.2. After applying we detected this issue
newer.
Regards
Pavel
2013/11/29 Tom Lane <[email protected]>
> Daniel Wood <[email protected]> writes:
> > ... Presuming your fix is putting PG_SETMASK(&UnBlockSig)
> > immediately before each of the 6 calls to ereport(ERROR,...) I've been
> > running the stress test with both this fix and the lock already held fix.
>
> I'm now planning to put it in error cleanup instead, but that's good
> enough for proving that the problem is what I thought it was.
>
> > I get plenty of lock timeout errors as expected. However, once in a
> great
> > while I get: sqlcode = -400, sqlstate = 57014, sqlerrmc = canceling
> > statement due to user request
> > My stress test certainly doesn't do a user cancel. Should this be
> expected?
>
> I think I see what must be happening there: the lock timeout interrupt is
> happening at some point after the lock has been granted, but before
> ProcSleep reaches its disable_timeouts call. QueryCancelPending gets set,
> and will be honored next time something does CHECK_FOR_INTERRUPTS.
> But because ProcSleep told disable_timeouts to clear the LOCK_TIMEOUT
> indicator bit, ProcessInterrupts thinks the cancel must've been a plain
> user SIGINT, and reports it that way.
>
> What we should probably do about this is change ProcSleep to not clear the
> LOCK_TIMEOUT indicator bit, same as we already do in LockErrorCleanup,
> which is the less-race-condition-y path out of a lock timeout.
>
> (It would be cooler if the timeout handler had a way to realize that the
> lock is already granted, and not issue a query cancel in the first place.
> But having a signal handler poking at shared memory state is a little too
> scary for my taste.)
>
> It strikes me that this also means that places where we throw away pending
> cancels by clearing QueryCancelPending, such as the sigsetjmp stanza in
> postgres.c, had better reset the LOCK_TIMEOUT indicator bit. Otherwise,
> a thrown-away lock timeout cancel might cause a later SIGINT cancel to be
> misreported.
>
> regards, tom lane
>
>
> --
> Sent via pgsql-hackers mailing list ([email protected])
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
*** ./src/backend/storage/lmgr/proc.c.orig 2013-06-27 20:57:34.357713067 +0200
--- ./src/backend/storage/lmgr/proc.c 2013-06-27 21:19:04.917502553 +0200
***************
*** 179,185 ****
InitProcGlobal(void)
{
PGPROC *procs;
! int i;
bool found;
/* Create the ProcGlobal shared structure */
--- 179,186 ----
InitProcGlobal(void)
{
PGPROC *procs;
! int i,
! j;
bool found;
/* Create the ProcGlobal shared structure */
***************
*** 223,228 ****
--- 224,233 ----
InitSharedLatch(&procs[i].procLatch);
procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs;
ProcGlobal->freeProcs = &procs[i];
+
+ /* Initialize myProcLocks[] shred memory queues. */
+ for (j = 0; j < NUM_LOCK_PARTITIONS; j++)
+ SHMQueueInit(&(procs[i].myProcLocks[j]));
}
/*
***************
*** 242,247 ****
--- 247,255 ----
InitSharedLatch(&procs[i].procLatch);
procs[i].links.next = (SHM_QUEUE *) ProcGlobal->autovacFreeProcs;
ProcGlobal->autovacFreeProcs = &procs[i];
+
+ for (j = 0; j < NUM_LOCK_PARTITIONS; j++)
+ SHMQueueInit(&(procs[i].myProcLocks[j]));
}
/*
***************
*** 253,258 ****
--- 261,269 ----
AuxiliaryProcs[i].pid = 0; /* marks auxiliary proc as not in use */
PGSemaphoreCreate(&(AuxiliaryProcs[i].sem));
InitSharedLatch(&AuxiliaryProcs[i].procLatch);
+
+ for (j = 0; j < NUM_LOCK_PARTITIONS; j++)
+ SHMQueueInit(&(procs[i].myProcLocks[j]));
}
/* Create ProcStructLock spinlock, too */
***************
*** 268,274 ****
{
/* use volatile pointer to prevent code rearrangement */
volatile PROC_HDR *procglobal = ProcGlobal;
- int i;
/*
* ProcGlobal should be set up already (if we are a backend, we inherit
--- 279,284 ----
***************
*** 358,365 ****
MyProc->lwWaitLink = NULL;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
! for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
! SHMQueueInit(&(MyProc->myProcLocks[i]));
MyProc->recoveryConflictPending = false;
/* Initialize fields for sync rep */
--- 368,393 ----
MyProc->lwWaitLink = NULL;
MyProc->waitLock = NULL;
MyProc->waitProcLock = NULL;
!
! /*
! * Our patch
! *
! * we would to diagnose how much often is proces startup find
! * orphaned PGPROCLOCKS. Don't use ereport, because process is
! * not fully prepared in this moment.
! */
! {
! int i;
!
! for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
! {
! if (!SHMQueueEmpty(&(MyProc->myProcLocks[i])))
! {
! fprintf(stderr, "InitProcess: found orphaned PGPROCLOCK proc:%p, pid:%d\n", MyProc, MyProcPid);
! }
! }
! }
!
MyProc->recoveryConflictPending = false;
/* Initialize fields for sync rep */
***************
*** 729,734 ****
--- 757,807 ----
SyncRepCleanupAtProcExit();
/*
+ * Our patch
+ *
+ * we would to diagnose how much often is proces startup find
+ * orphaned PGPROCLOCKS. Don't use ereport, because process is
+ * not fully prepared in this moment.
+ */
+ {
+ int i;
+ bool found_orphaned_locks = false;
+
+ for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+ {
+ if (!SHMQueueEmpty(&(MyProc->myProcLocks[i])))
+ {
+ fprintf(stderr, "ProcKill: found orphaned PGPROCLOCK proc:%p, pid:%d\n", MyProc, MyProcPid);
+ found_orphaned_locks = true;
+ break;
+ }
+ }
+
+ if (found_orphaned_locks)
+ {
+ LockReleaseAll(DEFAULT_LOCKMETHOD, true);
+ /* Release transaction-level advisory locks */
+ LockReleaseAll(USER_LOCKMETHOD, false);
+ found_orphaned_locks = false;
+
+ for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+ {
+ if (!SHMQueueEmpty(&(MyProc->myProcLocks[i])))
+ {
+ fprintf(stderr, "ProcKill: found orphaned PGPROCLOCK proc:%p, pid:%d again\n", MyProc, MyProcPid);
+ found_orphaned_locks = true;
+ break;
+ }
+ }
+
+ if (!found_orphaned_locks)
+ fprintf(stderr, "ProcKill: orphaned PGPROCLOCK proc:%p, pid:%d was removed\n", MyProc, MyProcPid);
+ else
+ fprintf(stderr, "ProcKill: orphaned PGPROCLOCK proc:%p, pid:%d was not removed\n", MyProc, MyProcPid);
+ }
+ }
+
+ /*
* Release any LW locks I am holding. There really shouldn't be any, but
* it's cheap to check again before we cut the knees off the LWLock
* facility by releasing our PGPROC ...
*** ./src/include/miscadmin.h.orig 2013-06-27 21:15:39.477104870 +0200
--- ./src/include/miscadmin.h 2013-06-27 21:15:49.679124666 +0200
***************
*** 83,89 ****
--- 83,92 ----
#define CHECK_FOR_INTERRUPTS() \
do { \
if (InterruptPending) \
+ { \
+ fprintf(stderr, "CHECK_FOR_INTERRUPTS: func:%s file:%s line:%d, pid:%d\n", __func__, __FILE__, __LINE__, MyProcPid); \
ProcessInterrupts(); \
+ } \
} while(0)
#else /* WIN32 */
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers