Re: [HACKERS] Re: [COMMITTERS] pgsql: Introduce group locking to prevent parallel processes from deadl

Robert Haas Mon, 15 Feb 2016 16:42:07 -0800

On Sun, Feb 14, 2016 at 5:26 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Sun, Feb 14, 2016 at 1:33 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> pgprocno of the specific PGPROC, or of the group leader?  If it's
>>> the former, I'm pretty suspicious that there are deadlock and/or
>>> linked-list-corruption hazards here.  If it's the latter, at least
>>> the comments around this are misleading.
>
>> Leader.  The other way would be nuts.
>
> ... and btw, either BecomeLockGroupMember() has got this backwards, or
> I'm still fundamentally confused.


Yeah, you're right.  Attached is a draft patch that tries to clean up
that and a bunch of other things that you raised.  It hasn't really
been tested yet, so it might be buggy; I wrote it during my long plane
flight.  Fixes include:

1. It removes the groupLeader flag from PGPROC.  You're right: we don't need it.

2. It fixes this problem with BecomeLockGroupMember().  You're right:
the old way was backwards.

3. It fixes TopoSort() to be less inefficient by checking whether the
EDGE is for the correct lock before doing anything else.  I realized
this while updating comments related to EDGE.

4. It adds some text to the lmgr README.

5. It reflows the existing text in the lmgr README to fit within 80 columns.

6. It adjusts some other comments.

Possibly this should be broken up into multiple patches, but I'm just
sending it along all together for now.

> Also, after fixing that it would be good to add a comment explaining why
> it's not fundamentally unsafe for BecomeLockGroupMember() to examine
> leader->pgprocno without having any relevant lock.  AFAICS, that's only
> okay because the pgprocno fields are never changed after InitProcGlobal,
> even when a PGPROC is recycled.

I am sort of disinclined to think that this needs a comment.  Do we
really have a comment every other place that pgprocno is referenced
without a lock?  Or maybe there are none, but it seems like overkill
to me to comment on that at every site of use.  Better to comment on
the pgprocno definition itself and say "never changes".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index cb9c7d6..8eaa91c 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -591,26 +591,27 @@ Group Locking
 
 As if all of that weren't already complicated enough, PostgreSQL now supports
 parallelism (see src/backend/access/transam/README.parallel), which means that
-we might need to resolve deadlocks that occur between gangs of related processes
-rather than individual processes.  This doesn't change the basic deadlock
-detection algorithm very much, but it makes the bookkeeping more complicated.
+we might need to resolve deadlocks that occur between gangs of related
+processes rather than individual processes.  This doesn't change the basic
+deadlock detection algorithm very much, but it makes the bookkeeping more
+complicated.
 
 We choose to regard locks held by processes in the same parallel group as
-non-conflicting.  This means that two processes in a parallel group can hold
-a self-exclusive lock on the same relation at the same time, or one process
-can acquire an AccessShareLock while the other already holds AccessExclusiveLock.
+non-conflicting.  This means that two processes in a parallel group can hold a
+self-exclusive lock on the same relation at the same time, or one process can
+acquire an AccessShareLock while the other already holds AccessExclusiveLock.
 This might seem dangerous and could be in some cases (more on that below), but
 if we didn't do this then parallel query would be extremely prone to
 self-deadlock.  For example, a parallel query against a relation on which the
 leader had already AccessExclusiveLock would hang, because the workers would
-try to lock the same relation and be blocked by the leader; yet the leader can't
-finish until it receives completion indications from all workers.  An undetected
-deadlock results.  This is far from the only scenario where such a problem
-happens.  The same thing will occur if the leader holds only AccessShareLock,
-the worker seeks AccessShareLock, but between the time the leader attempts to
-acquire the lock and the time the worker attempts to acquire it, some other
-process queues up waiting for an AccessExclusiveLock.  In this case, too, an
-indefinite hang results.
+try to lock the same relation and be blocked by the leader; yet the leader
+can't finish until it receives completion indications from all workers.  An
+undetected deadlock results.  This is far from the only scenario where such a
+problem happens.  The same thing will occur if the leader holds only
+AccessShareLock, the worker seeks AccessShareLock, but between the time the
+leader attempts to acquire the lock and the time the worker attempts to
+acquire it, some other process queues up waiting for an AccessExclusiveLock.
+In this case, too, an indefinite hang results.
 
 It might seem that we could predict which locks the workers will attempt to
 acquire and ensure before going parallel that those locks would be acquired
@@ -618,7 +619,7 @@ successfully.  But this is very difficult to make work in a general way.  For
 example, a parallel worker's portion of the query plan could involve an
 SQL-callable function which generates a query dynamically, and that query
 might happen to hit a table on which the leader happens to hold
-AccessExcusiveLock.  By imposing enough restrictions on what workers can do,
+AccessExclusiveLock.  By imposing enough restrictions on what workers can do,
 we could eventually create a situation where their behavior can be adequately
 restricted, but these restrictions would be fairly onerous, and even then, the
 system required to decide whether the workers will succeed at acquiring the
@@ -627,27 +628,63 @@ necessary locks would be complex and possibly buggy.
 So, instead, we take the approach of deciding that locks within a lock group
 do not conflict.  This eliminates the possibility of an undetected deadlock,
 but also opens up some problem cases: if the leader and worker try to do some
-operation at the same time which would ordinarily be prevented by the heavyweight
-lock mechanism, undefined behavior might result.  In practice, the dangers are
-modest.  The leader and worker share the same transaction, snapshot, and combo
-CID hash, and neither can perform any DDL or, indeed, write any data at all.
-Thus, for either to read a table locked exclusively by the other is safe enough.
-Problems would occur if the leader initiated parallelism from a point in the
-code at which it had some backend-private state that made table access from
-another process unsafe, for example after calling SetReindexProcessing and
-before calling ResetReindexProcessing, catastrophe could ensue, because the
-worker won't have that state.  Similarly, problems could occur with certain
-kinds of non-relation locks, such as relation extension locks.  It's no safer
-for two related processes to extend the same relation at the time than for
-unrelated processes to do the same.  However, since parallel mode is strictly
-read-only at present, neither this nor most of the similar cases can arise at
-present.  To allow parallel writes, we'll either need to (1) further enhance
-the deadlock detector to handle those types of locks in a different way than
-other types; or (2) have parallel workers use some other mutual exclusion
-method for such cases; or (3) revise those cases so that they no longer use
-heavyweight locking in the first place (which is not a crazy idea, given that
-such lock acquisitions are not expected to deadlock and that heavyweight lock
-acquisition is fairly slow anyway).
+operation at the same time which would ordinarily be prevented by the
+heavyweight lock mechanism, undefined behavior might result.  In practice, the
+dangers are modest.  The leader and worker share the same transaction,
+snapshot, and combo CID hash, and neither can perform any DDL or, indeed,
+write any data at all.  Thus, for either to read a table locked exclusively by
+the other is safe enough.  Problems would occur if the leader initiated
+parallelism from a point in the code at which it had some backend-private
+state that made table access from another process unsafe, for example after
+calling SetReindexProcessing and before calling ResetReindexProcessing,
+catastrophe could ensue, because the worker won't have that state.  Similarly,
+problems could occur with certain kinds of non-relation locks, such as
+relation extension locks.  It's no safer for two related processes to extend
+the same relation at the time than for unrelated processes to do the same.
+However, since parallel mode is strictly read-only at present, neither this
+nor most of the similar cases can arise at present.  To allow parallel writes,
+we'll either need to (1) further enhance the deadlock detector to handle those
+types of locks in a different way than other types; or (2) have parallel
+workers use some other mutual exclusion method for such cases; or (3) revise
+those cases so that they no longer use heavyweight locking in the first place
+(which is not a crazy idea, given that such lock acquisitions are not expected
+to deadlock and that heavyweight lock acquisition is fairly slow anyway).
+
+Group locking adds four new members to each PGPROC: lockGroupLeaderIdentifier,
+lockGroupLeader, lockGroupMembers, and lockGroupLink. The first is simply a
+safety mechanism. A newly started parallel worker has to try to join the
+leader's lock group, but it has no guarantee that the group leader is still
+alive by the time it gets started. We try to ensure that the parallel leader
+dies after all workers in normal cases, but also that the system could survive
+relatively intact if that somehow fails to happen. This is one of the
+precautions against such a scenario: the leader relays its PGPROC and also its
+PID to the worker, and the worker fails to join the lock group unless the
+given PGPROC still has the same PID. We assume that PIDs are not recycled
+quickly enough for this interlock to fail.
+
+A PGPROC's lockGroupLeader is NULL for processes not involved in parallel
+query. When a process wants to cooperate with parallel workers, it becomes a
+lock group leader, which means setting this field back to point to its own
+PGPROC. When a parallel worker starts up, it points this field at the leader,
+with the above-mentioned interlock. The lockGroupMembers field is only used in
+the leader; it is a list of the workers. The lockGroupLink field is used to
+link the leader and all workers into the leader's list. All of these fields
+are protected by the lock manager locks; the lock manager lock that protects
+the lockGroupLeaderIdentifier, lockGroupLeader, and lockGroupMembers fields in
+a given PGPROC is chosen by taking pgprocno modulo the number of lock manager
+partitions. This unusual arrangement has a major advantage: the deadlock
+detector can count on the fact that no lockGroupLeader field can change while
+the deadlock detector is running, because it knows that it holds all the lock
+manager locks.  A PGPROC's lockGroupLink is protected by the lock manager
+partition lock for the group of which it is a part.  If it's not part of any
+group, this field is unused and can only be examined or modified by the
+process that owns the PGPROC.
+
+We institute a further coding rule that a process cannot join or leave a lock
+group while owning any PROCLOCK.  Therefore, given a lock manager lock
+sufficient to examine PROCLOCK *proclock, it also safe to examine
+proclock->tag.myProc->lockGroupLeader (but not the other fields mentioned in
+the previous paragraph).
 
 User Locks (Advisory Locks)
 ---------------------------
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 69f678b..d81746d 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -33,12 +33,21 @@
 #include "utils/memutils.h"
 
 
-/* One edge in the waits-for graph */
+/*
+ * One edge in the waits-for graph.
+ *
+ * waiter and blocker may or may not be members of a lock group, but if either
+ * is, it will be the leader rather than any other member of the lock group.
+ * The group leaders act as representatives of the whole group even though
+ * those particular processes need not be waiting at all.  There will be at
+ * least one member of the group on the wait queue for the given lock, maybe
+ * more.
+ */
 typedef struct
 {
-	PGPROC	   *waiter;			/* the waiting process */
-	PGPROC	   *blocker;		/* the process it is waiting for */
-	LOCK	   *lock;			/* the lock it is waiting for */
+	PGPROC	   *waiter;			/* the leader of the waiting lock group */
+	PGPROC	   *blocker;		/* the leader of the group it is waiting for */
+	LOCK	   *lock;			/* the lock being waited for */
 	int			pred;			/* workspace for TopoSort */
 	int			link;			/* workspace for TopoSort */
 } EDGE;
@@ -906,6 +915,10 @@ TopoSort(LOCK *lock,
 	MemSet(afterConstraints, 0, queue_size * sizeof(int));
 	for (i = 0; i < nConstraints; i++)
 	{
+		/* Skip constraints not related to this lock. */
+		if (constraints[i].lock != lock)
+			continue;
+
 		/*
 		 * Find a representative process that is on the lock queue and part of
 		 * the waiting lock group.  This may or may not be the leader, which
@@ -934,10 +947,7 @@ TopoSort(LOCK *lock,
 				break;
 			}
 		}
-
-		/* If no matching waiter, constraint is not relevant to this lock. */
-		if (jj < 0)
-			continue;
+		Assert(jj >= 0);
 
 		/*
 		 * Similarly, find a representative process that is on the lock queue
@@ -963,10 +973,7 @@ TopoSort(LOCK *lock,
 				}
 			}
 		}
-
-		/* If no matching blocker, constraint is not relevant to this lock. */
-		if (kk < 0)
-			continue;
+		Assert(kk >= 0);
 
 		beforeConstraints[jj]++;	/* waiter must come before */
 		/* add this constraint to list of after-constraints for blocker */
@@ -1006,8 +1013,8 @@ TopoSort(LOCK *lock,
 			return false;
 
 		/*
-		 * Output everything in the lock group.  There's no point in outputing
-		 * an ordering where members of the same lock group are not
+		 * Output everything in the lock group.  There's no point in
+		 * outputting an ordering where members of the same lock group are not
 		 * consecutive on the wait queue: if some other waiter is between two
 		 * requests that belong to the same group, then either it conflicts
 		 * with both of them and is certainly not a solution; or it conflicts
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index fef59a2..3ec440f 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -1137,18 +1137,6 @@ SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
 	{
 		uint32		partition = LockHashPartition(hashcode);
 
-		/*
-		 * It might seem unsafe to access proclock->groupLeader without a lock,
-		 * but it's not really.  Either we are initializing a proclock on our
-		 * own behalf, in which case our group leader isn't changing because
-		 * the group leader for a process can only ever be changed by the
-		 * process itself; or else we are transferring a fast-path lock to the
-		 * main lock table, in which case that process can't change it's lock
-		 * group leader without first releasing all of its locks (and in
-		 * particular the one we are currently transferring).
-		 */
-		proclock->groupLeader = proc->lockGroupLeader != NULL ?
-			proc->lockGroupLeader : proc;
 		proclock->holdMask = 0;
 		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
@@ -1287,6 +1275,8 @@ LockCheckConflicts(LockMethod lockMethodTable,
 	int			i;
 	SHM_QUEUE  *procLocks;
 	PROCLOCK   *otherproclock;
+	PGPROC	   *leader;
+
 
 	/*
 	 * first check for global conflicts: If no locks conflict with my request,
@@ -1330,10 +1320,16 @@ LockCheckConflicts(LockMethod lockMethodTable,
 		return STATUS_OK;
 	}
 
+	/*
+	 * A process cannot enter or leave a lock group while owning any PROCLOCK.
+	 * Therefore, if we hold a lock sufficient to examine the PROCLOCK at all,
+	 * we can also examine its lockGroupLeader.
+	 */
+	leader = proclock->tag.myProc->lockGroupLeader;
+
 	/* If no group locking, it's definitely a conflict. */
-	if (proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)
+	if (leader == NULL)
 	{
-		Assert(proclock->tag.myProc == MyProc);
 		PROCLOCK_PRINT("LockCheckConflicts: conflicting (simple)",
 					   proclock);
 		return STATUS_FOUND;
@@ -1351,8 +1347,18 @@ LockCheckConflicts(LockMethod lockMethodTable,
 		SHMQueueNext(procLocks, procLocks, offsetof(PROCLOCK, lockLink));
 	while (otherproclock != NULL)
 	{
-		if (proclock != otherproclock &&
-			proclock->groupLeader == otherproclock->groupLeader &&
+		PGPROC	   *otherLeader;
+
+		/* See comments for leader, above, to see why this is safe. */
+		otherLeader = otherproclock->tag.myProc;
+		if (otherLeader->lockGroupLeader != NULL)
+			otherLeader = otherLeader->lockGroupLeader;
+
+		/*
+		 * If it's some other PROCLOCK in the lock group, subtract out the
+		 * conflicts from our remaining conflicts.
+		 */
+		if (proclock != otherproclock && leader == otherLeader &&
 			(otherproclock->holdMask & conflictMask) != 0)
 		{
 			int	intersectMask = otherproclock->holdMask & conflictMask;
@@ -3312,13 +3318,6 @@ PostPrepare_Locks(TransactionId xid)
 			proclocktag.myProc = newproc;
 
 			/*
-			 * Update groupLeader pointer to point to the new proc.  (We'd
-			 * better not be a member of somebody else's lock group!)
-			 */
-			Assert(proclock->groupLeader == proclock->tag.myProc);
-			proclock->groupLeader = newproc;
-
-			/*
 			 * Update the proclock.  We should not find any existing entry for
 			 * the same hash key, since there can be only one entry for any
 			 * given lock with my own proc.
@@ -3867,7 +3866,6 @@ lock_twophase_recover(TransactionId xid, uint16 info,
 	if (!found)
 	{
 		Assert(proc->lockGroupLeader == NULL);
-		proclock->groupLeader = proc;
 		proclock->holdMask = 0;
 		proclock->releaseMask = 0;
 		/* Add proclock to appropriate lists */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 844222e..cff8c08 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1016,7 +1016,11 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 			SHMQueueNext(procLocks, procLocks, offsetof(PROCLOCK, lockLink));
 		while (otherproclock != NULL)
 		{
-			if (otherproclock->groupLeader == leader)
+			/*
+			 * It's safe to examine lockGroupLeader because it can't change
+			 * while that process owns otherproclock.
+			 */
+			if (otherproclock->tag.myProc->lockGroupLeader == leader)
 				myHeldLocks |= otherproclock->holdMask;
 			otherproclock = (PROCLOCK *)
 				SHMQueueNext(procLocks, &otherproclock->lockLink,
@@ -1050,7 +1054,8 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		{
 			/*
 			 * If we're part of the same locking group as this waiter, its
-			 * locks neither conflict with ours nor contribute to aheadRequsts.
+			 * locks neither conflict with ours nor contribute to
+			 * aheadRequests.
 			 */
 			if (leader != NULL && leader == proc->lockGroupLeader)
 			{
@@ -1798,7 +1803,7 @@ BecomeLockGroupMember(PGPROC *leader, int pid)
 	Assert(pid != 0);
 
 	/* Try to join the group. */
-	leader_lwlock = LockHashPartitionLockByProc(MyProc);
+	leader_lwlock = LockHashPartitionLockByProc(leader);
 	LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);
 	if (leader->lockGroupLeaderIdentifier == pid)
 	{
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 6b4e365..a3e19e1 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -346,7 +346,6 @@ typedef struct PROCLOCK
 	PROCLOCKTAG tag;			/* unique identifier of proclock object */
 
 	/* data */
-	PGPROC	   *groupLeader;	/* group leader, or NULL if no lock group */
 	LOCKMASK	holdMask;		/* bitmask for lock types currently held */
 	LOCKMASK	releaseMask;	/* bitmask for lock types to be released */
 	SHM_QUEUE	lockLink;		/* list link in LOCK's list of proclocks */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Introduce group locking to prevent parallel processes from deadl

Reply via email to