On Sat, Feb 13, 2016 at 1:32 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Robert Haas <rh...@postgresql.org> writes: >> Introduce group locking to prevent parallel processes from deadlocking. > > I'm fairly unhappy about this patch, because it has introduced a large > amount of new complexity into the lock manager with effectively no > documentation. Yeah, there's several paragraphs added to lmgr/README, > but they're just a handwavy apologia for the high-level decision to allow > exclusive locks taken by parallel workers to not conflict. I see nothing > whatever explaining how it works, that is any useful docs about the new > data structures or invariants. For example, I can't figure out whether > you have broken the pg_locks display (by causing relevant lock state to > not get displayed, or get displayed in a useless fashion because one > can no longer tell which entries block which other entries). I can't > even tell for sure if it works at all: the README says only that locks > taken by members of the same parallel group don't conflict, but the > implementation looks more like some operations get applied to the group > leader rather than to followers, which is something completely different.
OK. I don't know exactly what new documentation needs to be written, but I'm happy to try to work to improve it. Obviously, it was clear enough to me, but that's not saying a lot. Let me try to answer some of your questions here via email, and then we can decide what of that should go into comments, the lmgr README, or someplace else. First, the overall concept here is that processes can either be a member of a lock group or a member of no lock group. The concept of a lock group is formally separate from the concept of a parallel group created by a ParallelContext, but it is not clear that there will ever be any other context in which a lock group will be a good idea. It is not impossible to imagine: for example, suppose you had a group of backends that all had TCP client connections, and those processes all wanted to ingest data and stuff it in a table but without allowing any unrelated process to touch the table, say because it was going to be inconsistent during the operation and until indexes were afterwards rebuilt. I don't have any plans to implement anything like that but I felt it was better to keep the concept of a lock group - which is a group of processes that cooperate so closely that their locks need not conflict - from the concept of a parallel context - which is a leader process that is most likely connected to a user plus a bunch of ephemeral background workers that aren't. That way, if somebody later wants to try to reuse the lock grouping stuff for something else, nothing will get in the way of that; if not, no harm done, but keeping the two things decoupled is at least easier to understand, IMHO. Second, I can review the data structure changes and the associated invariants. Each PGPROC has four new members: lockGroupLeaderIdentifier, lockGroupLeader, lockGroupMembers, and lockGroupLink. The first is simply a safety mechanism. A newly started parallel worker has to try to join the leader's lock group, but it has no guarantee that the group leader is still alive by the time it gets started. I tried to ensure that the parallel leader dies after all workers in normal cases, but also that the system could survive relatively intact if that somehow fails to happen. This is one of the precautions against such a scenario: the leader relays its PGPROC and also its PID to the worker, and the worker fails to join the lock group unless the given PGPROC still has the same PID. I assume that PIDs are not recycled quickly enough for this interlock to fail. lockGroupLeader is NULL for processes not involved in parallel query. When a process wants to cooperate with parallel workers, it becomes a lock group leader, which means setting this field back to point to its own PGPROC. When a parallel worker starts up, it points this field at the leader, with the above-mentioned interlock. The lockGroupMembers field is only used in the leader; it is a list of the workers. The lockGroupLink field is used to link the leader and all workers into the leader's list. All of these fields are protected by the lock manager locks; the lock manager lock that protects the fields in a given PGPROC is chosen by taking pgprocno modulo the number of lock manager partitions. This unusual arrangement has a major advantage: the deadlock detector can count on the fact that no lockGroupLeader field can change while the deadlock detector is running, because it knows that it holds all the lock manager locks. Earlier versions of the patch protected these fields with the PGPROC's backendLock, but it was very hard to be confident that this wouldn't lead to deadlock where one process acquires a lock manager lock and then a backendLock and another process does the reverse. Even if that didn't happen, it was inconvenient not to be able to access the fields without additional locking. Each PROCLOCK now has a new groupLeader flag. While a PGPROC's lockGroupLeader may be NULL if that process is not involved in group locking, a PROCLOCK's groupLeader always has a non-NULL value; it points back to the PGPROC that acquired the lock. There's no horribly principled reason for this different contract; it just turned out to be more convenient. PGPROCs need to distinguish whether they are involved in group locking or not, and using NULL as a sentinel value for lockGroupLeader is a convenient way to do that; but it turns out we don't really need to know that for a PROCLOCK, so sending groupLeader equal to myProc when there's no group locking involved is more convenient. Within the deadlock detector, an EDGE now represents a constraint between two lock groups rather than between two processes; therefore, the waiter and blocker are the respective group leaders rather than the processes actually waiting. A LOCK * is added to the EDGE data structure for this reason -- different processes in the lock group might be waiting for different locks, and we need to be sure about which lock we're constraining. It's possible for there to be multiple processes on the same wait queue for the same LOCK; in fact, this is probably the most common scenario. Imagine that the leader acquires a lock, probably AccessShareLock, but then before the workers start up, someone queues up for an AccessExclusiveLock. The workers will all be soft-blocked by that other process, but must go ahead of it; the leader won't release it's lock until some time after the query finishes, and that won't happen until the workers finish. TopoSort() handles this by emitting all processes in the same lock group adjacent to each other in the output wait ordering. As the comment explains: /* * Output everything in the lock group. There's no point in outputing * an ordering where members of the same lock group are not * consecutive on the wait queue: if some other waiter is between two * requests that belong to the same group, then either it conflicts * with both of them and is certainly not a solution; or it conflicts * with at most one of them and is thus isomorphic to an ordering * where the group members are consecutive. */ There are a number of other new comments in TopoSort() which I believe to be helpful for understanding how it goes about its mission. With respect to pg_locks - and for that matter also pg_stat_activity - I think you are right that improvement is needed. I really haven't really done anything about those during the whole course of developing parallel query, so the information that they show for parallel workers is precisely the same information that they would show for any other background worker. Now, background workers do show up in both places, but it's not straightforward to identify which background workers go with which parallel leader or whether group locking is in use. The simplest thing we could do to make that easier is, in pg_stat_activity, have parallel workers advertise the PID that launched them in a new field; and in pg_locks, have members of a lock group advertise the leader's PID in a new field. I think that would make it pretty simple to figure out what you want to know. Although such an improvement is probably necessary, I have my doubts about whether it's entirely sufficient - I feel like there might be other improvements we should also make, but I think somebody (or somebodys) who is (or are) at arm's length from the project needs to try it out and provide some input on what would make it more usable. I don't think that it's probably technically hard to implement any good design we might agree on - but figuring out what a good design would look like may not be so easy. By the way, I'm pretty sure that you being stupid is not the problem here. At the same time, I did make an effort to document the things that seemed to me to be the key points. I think I've missed some things and that should be fixed, but that doesn't mean that I didn't make an effort. It just means that code is always easier to understand when you wrote it yourself, which for you is true of the old code in this file and false of the new code. I think that the preexisting code here is pretty hard to understand as things stand - it took me over a year of study to figure out exactly what to change and how. The README is basically just a cut-and-paste of an email you posted to pgsql-hackers, and there is a heck of a lot of complexity there that's not really explained anywhere; you just have to reverse engineer it from what the code does. Moreover, this code has been around for ten years with zero test coverage. If we get some buildfarm members set up with force_parallel_mode=regress and max_parallel_degree>0, the new code will have more test coverage as of that instant than the existing code has ever had. It's wholly unclear how some of the code in deadlock.c has been tested by anyone, anywhere, ever. I wouldn't be surprised if the savedList = false case in TestConfiguration() has been hit zero times outside of somebody manually modifying the source tree to hit it. If that's ever happened, it was probably only on your machine during the development of this code. My point is that I'm completely willing to admit that I have not done everything perfectly here, but I also don't want to overstate the level of perfection coming in. This code is battle-tested and it will be really bad if I've broken that and we don't find it before release, but the existing state of affairs did not make avoiding breakage as straightforward as it might have been. I hope this helps and please let me know what more you think I should do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers