On Mon, Oct 30, 2017 at 3:17 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote: > On Fri, Oct 27, 2017 at 12:03 AM, Robert Haas <robertmh...@gmail.com> wrote: >> On Thu, Oct 26, 2017 at 12:36 PM, Masahiko Sawada <sawada.m...@gmail.com> >> wrote: >>> Since the previous patch conflicts with current HEAD, I attached the >>> updated patch for next CF. >> >> I think we should back up here and ask ourselves a couple of questions: > > Thank you for summarizing of the purpose and discussion of this patch. > >> 1. What are we trying to accomplish here? >> >> 2. Is this the best way to accomplish it? >> >> To the first question, the problem as I understand it as follows: >> Heavyweight locks don't conflict between members of a parallel group. >> However, this is wrong for LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE, >> LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN. Currently, those cases >> don't arise, because parallel operations are strictly read-only >> (except for inserts by the leader into a just-created table, when only >> one member of the group can be taking the lock anyway). However, once >> we allow writes, they become possible, so some solution is needed. >> >> To the second question, there are a couple of ways we could fix this. >> First, we could continue to allow these locks to be taken in the >> heavyweight lock manager, but make them conflict even between members >> of the same lock group. This is, however, complicated. A significant >> problem (or so I think) is that the deadlock detector logic, which is >> already quite hard to test, will become even more complicated, since >> wait edges between members of a lock group need to exist at some times >> and not other times. Moreover, to the best of my knowledge, the >> increased complexity would have no benefit, because it doesn't look to >> me like we ever take any other heavyweight lock while holding one of >> these four kinds of locks. Therefore, no deadlock can occur: if we're >> waiting for one of these locks, the process that holds it is not >> waiting for any other heavyweight lock. This gives rise to a second >> idea: move these locks out of the heavyweight lock manager and handle >> them with separate code that does not have deadlock detection and >> doesn't need as many lock modes. I think that idea is basically >> sound, although it's possibly not the only sound idea. > > I'm on the same page. > >> >> However, that makes me wonder whether we shouldn't be a bit more >> aggressive with this patch: why JUST relation extension locks? Why >> not all four types of locks listed above? Actually, tuple locks are a >> bit sticky, because they have four lock modes. The other three kinds >> are very similar -- all you can do is "take it" (implicitly, in >> exclusive mode), "try to take it" (again, implicitly, in exclusive >> mode), or "wait for it to be released" (i.e. share lock and then >> release). Another idea is to try to handle those three types and >> leave the tuple locking problem for another day. >> >> I suggest that a good thing to do more or less immediately, regardless >> of when this patch ends up being ready, would be to insert an >> insertion that LockAcquire() is never called while holding a lock of >> one of these types. If that assertion ever fails, then the whole >> theory that these lock types don't need deadlock detection is wrong, >> and we'd like to find out about that sooner or later. > > I understood. I'll check that first.
I've checked whether LockAcquire is called while holding a lock of one of four types: LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE, LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN. To summary, I think that we cannot move these four lock types together out of heavy-weight lock, but can move only the relation extension lock with tricks. Here is detail of the survey. * LOCKTAG_RELATION_EXTENSION There is a path that LockRelationForExtension() could be called while holding another relation extension lock. In brin_getinsertbuffer(), we acquire a relation extension lock for a index relation and could initialize a new buffer (brin_initailize_empty_new_buffer()). During initializing a new buffer, we call RecordPageWithFreeSpace() which eventually can call fsm_readbuf(rel, addr, true) where the third argument is "extend". We can process this problem by having the list (or local hash) of acquired locks and skip acquiring the lock if already had. For other call paths calling LockRelationForExtension, I don't see any problem. * LOCKTAG_PAGE, LOCKTAG_TUPLE, LOCKTAG_SPECULATIVE_INSERTION There is a path that we can acquire a relation extension lock while holding these lock. For LOCKTAG_PAGE, in ginInsertCleanup() we acquire a page lock for the meta page and process the pending list which could acquire a relation extension lock for a index relation. For LOCKTAG_TUPLE, in heap_update() we acquire a tuple lock and could call RelationGetBufferForTuple(). For LOCKTAG_SPECULATIVE_INSERTION, in ExecInsert() we acquire a speculative insertion lock and call heap_insert and ExecInsertIndexTuples(). The operation that is called while holding each lock type can acquire a relation extension lock. Also the following is the list of places where we call LockAcquire() with four lock types (result of git grep "XXX"). I've checked based on the following list. * LockRelationForExtension() contrib/bloom/blutils.c: LockRelationForExtension(index, ExclusiveLock); contrib/pgstattuple/pgstattuple.c: LockRelationForExtension(rel, ExclusiveLock); src/backend/access/brin/brin_pageops.c: LockRelationForExtension(idxrel, ShareLock); src/backend/access/brin/brin_pageops.c: LockRelationForExtension(irel, ExclusiveLock); src/backend/access/brin/brin_revmap.c: LockRelationForExtension(irel, ExclusiveLock); src/backend/access/gin/ginutil.c: LockRelationForExtension(index, ExclusiveLock); src/backend/access/gin/ginvacuum.c: LockRelationForExtension(index, ExclusiveLock); src/backend/access/gin/ginvacuum.c: LockRelationForExtension(index, ExclusiveLock); src/backend/access/gist/gistutil.c: LockRelationForExtension(r, ExclusiveLock); src/backend/access/gist/gistvacuum.c: LockRelationForExtension(rel, ExclusiveLock); src/backend/access/gist/gistvacuum.c: LockRelationForExtension(rel, ExclusiveLock); src/backend/access/heap/hio.c: LockRelationForExtension(relation, ExclusiveLock); src/backend/access/heap/hio.c: LockRelationForExtension(relation, ExclusiveLock); src/backend/access/heap/visibilitymap.c: LockRelationForExtension(rel, ExclusiveLock); src/backend/access/nbtree/nbtpage.c: LockRelationForExtension(rel, ExclusiveLock); src/backend/access/nbtree/nbtree.c: LockRelationForExtension(rel, ExclusiveLock); src/backend/access/spgist/spgutils.c: LockRelationForExtension(index, ExclusiveLock); src/backend/access/spgist/spgvacuum.c: LockRelationForExtension(index, ExclusiveLock); src/backend/commands/vacuumlazy.c: LockRelationForExtension(onerel, ExclusiveLock); src/backend/storage/freespace/freespace.c: LockRelationForExtension(rel, ExclusiveLock); src/backend/storage/lmgr/lmgr.c:LockRelationForExtension(Relation relation, LOCKMODE lockmode) * ConditionalLockRelationForExtension src/backend/access/heap/hio.c: else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock)) src/backend/storage/lmgr/lmgr.c:ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode) * LockPage src/backend/access/gin/ginfast.c: LockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock); * ConditionalLockPage src/backend/access/gin/ginfast.c: if (!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock)) * LockTuple src/backend/access/heap/heapam.c: LockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock) * ConditionalLockTuple src/backend/access/heap/heapam.c: ConditionalLockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock) src/backend/storage/lmgr/lmgr.c:ConditionalLockTuple(Relation relation, ItemPointer tid, LOCKMODE lockmode) * SpeculativeInsertionLockAcquire src/backend/executor/nodeModifyTable.c: specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId()); > If this direction has no problem > and we changed these three locks so that it uses new lock mechanism, > we'll not be able to use these locks at the same time. Since it also > means that we impose a limitation to the future we should think > carefully about it. We can implement the deadlock detection mechanism > for it again but it doesn't make sense. > >> >> On the details of the patch, it appears that RelExtLockAcquire() >> executes the wait-for-lock code with the partition lock held, and then >> continues to hold the partition lock for the entire time that the >> relation extension lock is held. That not only makes all code that >> runs while holding the lock non-interruptible but makes a lot of the >> rest of this code pointless. How is any of this atomics code going to >> be reached by more than one process at the same time if the entire >> bucket is exclusive-locked? I would guess that the concurrency is not >> very good here for the same reason. Of course, just releasing the >> bucket lock wouldn't be right either, because then ext_lock might go >> away while we've got a pointer to it, which wouldn't be good. I think >> you could make this work if each lock had both a locker count and a >> pin count, and the object can only be removed when the pin_count is 0. >> So the lock algorithm would look like this: >> >> - Acquire the partition LWLock. >> - Find the item of interest, creating it if necessary. If out of >> memory for more elements, sweep through the table and reclaim >> 0-pin-count entries, then retry. >> - Increment the pin count. >> - Attempt to acquire the lock atomically; if we succeed, release the >> partition lock and return. >> - If this was a conditional-acquire, then decrement the pin count, >> release the partition lock and return. >> - Release the partition lock. >> - Sleep on the condition variable until we manage to atomically >> acquire the lock. >> >> The unlock algorithm would just decrement the pin count and, if the >> resulting value is non-zero, broadcast on the condition variable. > > Thank you for the suggestion! > >> Although I think this will work, I'm not sure this is actually a great >> algorithm. Every lock acquisition has to take and release the >> partition lock, use at least two more atomic ops (to take the pin and >> the lock), and search a hash table. I don't think that's going to be >> staggeringly fast. Maybe it's OK. It's not that much worse, possibly >> not any worse, than what the main lock manager does now. However, >> especially if we implement a solution specific to relation locks, it >> seems like it would be better if we could somehow optimize based on >> the facts that (1) many relation locks will not conflict and (2) it's >> very common for the same backend to take and release the same >> extension lock over and over again. I don't have a specific proposal >> right now. > > Yeah, we can optimize based on the purpose of the solution. In either > case I should answer the above question first. > >> >> Whatever we end up with, I think we should write some kind of a test >> harness to benchmark the number of acquire/release cycles per second >> that we can do with the current relation extension lock system vs. the >> proposed new system. Ideally, we'd be faster, since we're proposing a >> more specialized mechanism. But at least we should not be slower. >> pgbench isn't a good test because the relation extension lock will >> barely be taken let alone contended; we need to check something like >> parallel copies into the same table to see any effect. >> > > I did a benchmark using a custom script that always updates the > primary key (disabling HOT updates). But parallel copies into the same > tale would also be good. Thank you. > Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers