On Sat, Feb 13, 2016 at 1:32 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <rh...@postgresql.org> writes:
>> Introduce group locking to prevent parallel processes from deadlocking.
> I'm fairly unhappy about this patch, because it has introduced a large
> amount of new complexity into the lock manager with effectively no
> documentation.  Yeah, there's several paragraphs added to lmgr/README,
> but they're just a handwavy apologia for the high-level decision to allow
> exclusive locks taken by parallel workers to not conflict.  I see nothing
> whatever explaining how it works, that is any useful docs about the new
> data structures or invariants.  For example, I can't figure out whether
> you have broken the pg_locks display (by causing relevant lock state to
> not get displayed, or get displayed in a useless fashion because one
> can no longer tell which entries block which other entries).  I can't
> even tell for sure if it works at all: the README says only that locks
> taken by members of the same parallel group don't conflict, but the
> implementation looks more like some operations get applied to the group
> leader rather than to followers, which is something completely different.

OK.  I don't know exactly what new documentation needs to be written,
but I'm happy to try to work to improve it.  Obviously, it was clear
enough to me, but that's not saying a lot.  Let me try to answer some
of your questions here via email, and then we can decide what of that
should go into comments, the lmgr README, or someplace else.

First, the overall concept here is that processes can either be a
member of a lock group or a member of no lock group.  The concept of a
lock group is formally separate from the concept of a parallel group
created by a ParallelContext, but it is not clear that there will ever
be any other context in which a lock group will be a good idea.  It is
not impossible to imagine: for example, suppose you had a group of
backends that all had TCP client connections, and those processes all
wanted to ingest data and stuff it in a table but without allowing any
unrelated process to touch the table, say because it was going to be
inconsistent during the operation and until indexes were afterwards
rebuilt.  I don't have any plans to implement anything like that but I
felt it was better to keep the concept of a lock group - which is a
group of processes that cooperate so closely that their locks need not
conflict - from the concept of a parallel context - which is a leader
process that is most likely connected to a user plus a bunch of
ephemeral background workers that aren't.  That way, if somebody later
wants to try to reuse the lock grouping stuff for something else,
nothing will get in the way of that; if not, no harm done, but keeping
the two things decoupled is at least easier to understand, IMHO.

Second, I can review the data structure changes and the associated
invariants.  Each PGPROC has four new members:
lockGroupLeaderIdentifier, lockGroupLeader, lockGroupMembers, and
lockGroupLink.  The first is simply a safety mechanism.  A newly
started parallel worker has to try to join the leader's lock group,
but it has no guarantee that the group leader is still alive by the
time it gets started.  I tried to ensure that the parallel leader dies
after all workers in normal cases, but also that the system could
survive relatively intact if that somehow fails to happen.  This is
one of the precautions against such a scenario: the leader relays its
PGPROC and also its PID to the worker, and the worker fails to join
the lock group unless the given PGPROC still has the same PID.  I
assume that PIDs are not recycled quickly enough for this interlock to

lockGroupLeader is NULL for processes not involved in parallel query.
When a process wants to cooperate with parallel workers, it becomes a
lock group leader, which means setting this field back to point to its
own PGPROC.  When a parallel worker starts up, it points this field at
the leader, with the above-mentioned interlock.  The lockGroupMembers
field is only used in the leader; it is a list of the workers.  The
lockGroupLink field is used to link the leader and all workers into
the leader's list.  All of these fields are protected by the lock
manager locks; the lock manager lock that protects the fields in a
given PGPROC is chosen by taking pgprocno modulo the number of lock
manager partitions.  This unusual arrangement has a major advantage:
the deadlock detector can count on the fact that no lockGroupLeader
field can change while the deadlock detector is running, because it
knows that it holds all the lock manager locks.  Earlier versions of
the patch protected these fields with the PGPROC's backendLock, but it
was very hard to be confident that this wouldn't lead to deadlock
where one process acquires a lock manager lock and then a backendLock
and another process does the reverse.  Even if that didn't happen, it
was inconvenient not to be able to access the fields without
additional locking.

Each PROCLOCK now has a new groupLeader flag.  While a PGPROC's
lockGroupLeader may be NULL if that process is not involved in group
locking, a PROCLOCK's groupLeader always has a non-NULL value; it
points back to the PGPROC that acquired the lock.  There's no horribly
principled reason for this different contract; it just turned out to
be more convenient.  PGPROCs need to distinguish whether they are
involved in group locking or not, and using NULL as a sentinel value
for lockGroupLeader is a convenient way to do that; but it turns out
we don't really need to know that for a PROCLOCK, so sending
groupLeader equal to myProc when there's no group locking involved is
more convenient.

Within the deadlock detector, an EDGE now represents a constraint
between two lock groups rather than between two processes; therefore,
the waiter and blocker are the respective group leaders rather than
the processes actually waiting.  A LOCK * is added to the EDGE data
structure for this reason -- different processes in the lock group
might be waiting for different locks, and we need to be sure about
which lock we're constraining.   It's possible for there to be
multiple processes on the same wait queue for the same LOCK; in fact,
this is probably the most common scenario.  Imagine that the leader
acquires a lock, probably AccessShareLock, but then before the workers
start up, someone queues up for an AccessExclusiveLock.  The workers
will all be soft-blocked by that other process, but must go ahead of
it; the leader won't release it's lock until some time after the query
finishes, and that won't happen until the workers finish.  TopoSort()
handles this by emitting all processes in the same lock group adjacent
to each other in the output wait ordering.  As the comment explains:

         * Output everything in the lock group.  There's no point in outputing
         * an ordering where members of the same lock group are not
         * consecutive on the wait queue: if some other waiter is between two
         * requests that belong to the same group, then either it conflicts
         * with both of them and is certainly not a solution; or it conflicts
         * with at most one of them and is thus isomorphic to an ordering
         * where the group members are consecutive.

There are a number of other new comments in TopoSort() which I believe
to be helpful for understanding how it goes about its mission.

With respect to pg_locks - and for that matter also pg_stat_activity -
I think you are right that improvement is needed.  I really haven't
really done anything about those during the whole course of developing
parallel query, so the information that they show for parallel workers
is precisely the same information that they would show for any other
background worker.   Now, background workers do show up in both
places, but it's not straightforward to identify which background
workers go with which parallel leader or whether group locking is in
use.  The simplest thing we could do to make that easier is, in
pg_stat_activity, have parallel workers advertise the PID that
launched them in a new field; and in pg_locks, have members of a lock
group advertise the leader's PID in a new field.  I think that would
make it pretty simple to figure out what you want to know.  Although
such an improvement is probably necessary, I have my doubts about
whether it's entirely sufficient - I feel like there might be other
improvements we should also make, but I think somebody (or somebodys)
who is (or are) at arm's length from the project needs to try it out
and provide some input on what would make it more usable.  I don't
think that it's probably technically hard to implement any good design
we might agree on - but figuring out what a good design would look
like may not be so easy.

By the way, I'm pretty sure that you being stupid is not the problem
here.  At the same time, I did make an effort to document the things
that seemed to me to be the key points.  I think I've missed some
things and that should be fixed, but that doesn't mean that I didn't
make an effort.  It just means that code is always easier to
understand when you wrote it yourself, which for you is true of the
old code in this file and false of the new code.  I think that the
preexisting code here is pretty hard to understand as things stand -
it took me over a year of study to figure out exactly what to change
and how.  The README is basically just a cut-and-paste of an email you
posted to pgsql-hackers, and there is a heck of a lot of complexity
there that's not really explained anywhere; you just have to reverse
engineer it from what the code does.  Moreover, this code has been
around for ten years with zero test coverage.  If we get some
buildfarm members set up with force_parallel_mode=regress and
max_parallel_degree>0, the new code will have more test coverage as of
that instant than the existing code has ever had.  It's wholly unclear
how some of the code in deadlock.c has been tested by anyone,
anywhere, ever.  I wouldn't be surprised if the savedList = false case
in TestConfiguration() has been hit zero times outside of somebody
manually modifying the source tree to hit it.  If that's ever
happened, it was probably only on your machine during the development
of this code.  My point is that I'm completely willing to admit that I
have not done everything perfectly here, but I also don't want to
overstate the level of perfection coming in.  This code is
battle-tested and it will be really bad if I've broken that and we
don't find it before release, but the existing state of affairs did
not make avoiding breakage as straightforward as it might have been.

I hope this helps and please let me know what more you think I should do.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to