On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas <robertmh...@gmail.com> wrote:
> I'll write another email with my thoughts about the rest of the patch.

I think that the README changes for this patch need a fairly large
amount of additional work.  Here are a few things I notice:

- The confusion between buckets and pages hasn't been completely
cleared up.  If you read the beginning of the README, the terminology
is clearly set forth.  It says:

>> A hash index consists of two or more "buckets", into which tuples are placed 
>> whenever their hash key maps to the bucket number.  Each bucket in the hash 
>> index comprises one or more index pages.  The bucket's first page is 
>> permanently assigned to it when the bucket is created. Additional pages, 
>> called "overflow pages", are added if the bucket receives too many tuples to 
>> fit in the primary bucket page."

But later on, you say:

>> Scan will take a lock in shared mode on the primary bucket or on one of the 
>> overflow page.

So the correct terminology here would be "primary bucket page" not
"primary bucket".

- In addition, notice that there are two English errors in this
sentence: the word "the" needs to be added to the beginning of the
sentence, and the last word needs to be "pages" rather than "page".
There are a considerable number of similar minor errors; if you can't
fix them, I'll make a pass over it and clean it up.

- The whole "lock definitions" section seems to me to be pretty loose
and imprecise about what is happening.  For example, it uses the term
"split-in-progress" without first defining it.  The sentence quoted
above says that scans take a lock in shared mode either on the primary
page or on one of the overflow pages, but it's not to document code by
saying that it will do either A or B without explaining which one!  In
fact, I think that a scan will take a content lock first on the
primary bucket page and then on each overflow page in sequence,
retaining a pin on the primary buffer page throughout the scan.  So it
is not one or the other but both in a particular sequence, and that
can and should be explained.

Another problem with this section is that even when it's precise about
what is going on, it's probably duplicating what is or should be in
the following sections where the algorithms for each operation are
explained.  In the original wording, this section explains what each
lock protects, and then the following sections explain the algorithms
in the context of those definitions.  Now, this section contains a
sketch of the algorithm, and then the following sections lay it out
again in more detail.  The question of what each lock protects has
been lost.  Here's an attempt at some text to replace what you have

Concurrency control for hash indexes is provided using buffer content
locks, buffer pins, and cleanup locks.   Here as elsewhere in
PostgreSQL, cleanup lock means that we hold an exclusive lock on the
buffer and have observed at some point after acquiring the lock that
we hold the only pin on that buffer.  For hash indexes, a cleanup lock
on a primary bucket page represents the right to perform an arbitrary
reorganization of the entire bucket, while a cleanup lock on an
overflow page represents the right to perform a reorganization of just
that page.  Therefore, scans retain a pin on both the primary bucket
page and the overflow page they are currently scanning, if any.
Splitting a bucket requires a cleanup lock on both the old and new
primary bucket pages.  VACUUM therefore takes a cleanup lock on every
bucket page in turn order to remove tuples.  It can also remove tuples
copied to a new bucket by any previous split operation, because the
cleanup lock taken on the primary bucket page guarantees that no scans
which started prior to the most recent split can still be in progress.
After cleaning each page individually, it attempts to take a cleanup
lock on the primary bucket page in order to "squeeze" the bucket down
to the minimum possible number of pages.

As I was looking at the old text regarding deadlock risk, I realized
what may be a serious problem.  Suppose process A is performing a scan
of some hash index.  While the scan is suspended, it attempts to take
a lock and is blocked by process B.  Process B, meanwhile, is running
VACUUM on that hash index.  Eventually, it will do
LockBufferForCleanup() on the hash bucket on which process A holds a
buffer pin, resulting in an undetected deadlock. In the current
coding, A would hold a heavyweight lock and B would attempt to acquire
a conflicting heavyweight lock, and the deadlock detector would kill
one of them.  This patch probably breaks that.  I notice that that's
the only place where we attempt to acquire a buffer cleanup lock
unconditionally; every place else, we acquire the lock conditionally,
so there's no deadlock risk.  Once we resolve this problem, the
paragraph about deadlock risk in this section should be revised to
explain whatever solution we come up with.

By the way, since VACUUM must run in its own transaction, B can't be
holding arbitrary locks, but that doesn't seem quite sufficient to get
us out of the woods.  It will at least hold ShareUpdateExclusiveLock
on the relation being vacuumed, and process A could attempt to acquire
that same lock.

Also in regards to deadlock, I notice that you added a paragraph
saying that we lock higher-numbered buckets before lower-numbered
buckets.  That's fair enough, but what about the metapage?  The reader
algorithm suggests that the metapage must lock must be taken after the
bucket locks, because it tries to grab the bucket lock conditionally
after acquiring the metapage lock, but that's not documented here.

The reader algorithm itself seems to be a bit oddly explained.

      pin meta page and take buffer content lock in shared mode
+    compute bucket number for target hash key
+    read and pin the primary bucket page

So far, I'm with you.

+    conditionally get the buffer content lock in shared mode on
primary bucket page for search
+    if we didn't get the lock (need to wait for lock)

"didn't get the lock" and "wait for the lock" are saying the same
thing, so this is redundant, and the statement that it is "for search"
on the previous line is redundant with the introductory text
describing this as the reader algorithm.

+        release the buffer content lock on meta page
+        acquire buffer content lock on primary bucket page in shared mode
+        acquire the buffer content lock in shared mode on meta page


+        to check for possibility of split, we need to recompute the bucket and
+        verify, if it is a correct bucket; set the retry flag

This makes it sound like we set the retry flag whether it was the
correct bucket or not, which isn't sensible.

+    else if we get the lock, then we can skip the retry path

This line is totally redundant.  If we don't set the retry flag, then
of course we can skip the part guarded by if (retry).

+    if (retry)
+        loop:
+            compute bucket number for target hash key
+            release meta page buffer content lock
+            if (correct bucket page is already locked)
+                break
+            release any existing content lock on bucket page (if a
concurrent split happened)
+            pin primary bucket page and take shared buffer content lock
+            retake meta page buffer content lock in shared mode

This is the part I *really* don't understand.  It makes sense to me
that we need to loop until we get the correct bucket locked with no
concurrent splits, but why is this retry loop separate from the
previous bit of code that set the retry flag.  In other words, why is
not something like this?

pin the meta page and take shared content lock on it
compute bucket number for target hash key
if (we can't get a shared content lock on the target bucket without blocking)
        release meta page content lock
        take a shared content lock on the target primary bucket page
        take a shared content lock on the metapage
        if (previously-computed target bucket has not been split)

Another thing I don't quite understand about this algorithm is that in
order to conditionally lock the target primary bucket page, we'd first
need to read and pin it.  And that doesn't seem like a good thing to
do while we're holding a shared content lock on the metapage, because
of the principle that we don't want to hold content locks across I/O.

 -- then, per read request:
    release pin on metapage
-    read current page of bucket and take shared buffer content lock
-        step to next page if necessary (no chaining of locks)
+    if the split is in progress for current bucket and this is a new bucket
+        release the buffer content lock on current bucket page
+        pin and acquire the buffer content lock on old bucket in shared mode
+        release the buffer content lock on old bucket, but not pin
+        retake the buffer content lock on new bucket
+        mark the scan such that it skips the tuples that are marked
as moved by split

Aren't these steps done just once per scan?  If so, I think they
should appear before "-- then, per read request" which AIUI is
intended to imply a loop over tuples.

+    step to next page if necessary (no chaining of locks)
+        if the scan indicates moved by split, then move to old bucket
after the scan
+        of current bucket is finished
     get tuple
     release buffer content lock and pin on current page
 -- at scan shutdown:
-    release bucket share-lock

Don't we have a pin to release at scan shutdown in the new system?

Well, I was hoping to get through the whole patch in one email, but
I'm not even all the way through the README.  However, it's late, so
I'm stopping here for now.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to