On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas <robertmh...@gmail.com> wrote: > I'll write another email with my thoughts about the rest of the patch.
I think that the README changes for this patch need a fairly large amount of additional work. Here are a few things I notice: - The confusion between buckets and pages hasn't been completely cleared up. If you read the beginning of the README, the terminology is clearly set forth. It says: >> A hash index consists of two or more "buckets", into which tuples are placed >> whenever their hash key maps to the bucket number. Each bucket in the hash >> index comprises one or more index pages. The bucket's first page is >> permanently assigned to it when the bucket is created. Additional pages, >> called "overflow pages", are added if the bucket receives too many tuples to >> fit in the primary bucket page." But later on, you say: >> Scan will take a lock in shared mode on the primary bucket or on one of the >> overflow page. So the correct terminology here would be "primary bucket page" not "primary bucket". - In addition, notice that there are two English errors in this sentence: the word "the" needs to be added to the beginning of the sentence, and the last word needs to be "pages" rather than "page". There are a considerable number of similar minor errors; if you can't fix them, I'll make a pass over it and clean it up. - The whole "lock definitions" section seems to me to be pretty loose and imprecise about what is happening. For example, it uses the term "split-in-progress" without first defining it. The sentence quoted above says that scans take a lock in shared mode either on the primary page or on one of the overflow pages, but it's not to document code by saying that it will do either A or B without explaining which one! In fact, I think that a scan will take a content lock first on the primary bucket page and then on each overflow page in sequence, retaining a pin on the primary buffer page throughout the scan. So it is not one or the other but both in a particular sequence, and that can and should be explained. Another problem with this section is that even when it's precise about what is going on, it's probably duplicating what is or should be in the following sections where the algorithms for each operation are explained. In the original wording, this section explains what each lock protects, and then the following sections explain the algorithms in the context of those definitions. Now, this section contains a sketch of the algorithm, and then the following sections lay it out again in more detail. The question of what each lock protects has been lost. Here's an attempt at some text to replace what you have here: === Concurrency control for hash indexes is provided using buffer content locks, buffer pins, and cleanup locks. Here as elsewhere in PostgreSQL, cleanup lock means that we hold an exclusive lock on the buffer and have observed at some point after acquiring the lock that we hold the only pin on that buffer. For hash indexes, a cleanup lock on a primary bucket page represents the right to perform an arbitrary reorganization of the entire bucket, while a cleanup lock on an overflow page represents the right to perform a reorganization of just that page. Therefore, scans retain a pin on both the primary bucket page and the overflow page they are currently scanning, if any. Splitting a bucket requires a cleanup lock on both the old and new primary bucket pages. VACUUM therefore takes a cleanup lock on every bucket page in turn order to remove tuples. It can also remove tuples copied to a new bucket by any previous split operation, because the cleanup lock taken on the primary bucket page guarantees that no scans which started prior to the most recent split can still be in progress. After cleaning each page individually, it attempts to take a cleanup lock on the primary bucket page in order to "squeeze" the bucket down to the minimum possible number of pages. === As I was looking at the old text regarding deadlock risk, I realized what may be a serious problem. Suppose process A is performing a scan of some hash index. While the scan is suspended, it attempts to take a lock and is blocked by process B. Process B, meanwhile, is running VACUUM on that hash index. Eventually, it will do LockBufferForCleanup() on the hash bucket on which process A holds a buffer pin, resulting in an undetected deadlock. In the current coding, A would hold a heavyweight lock and B would attempt to acquire a conflicting heavyweight lock, and the deadlock detector would kill one of them. This patch probably breaks that. I notice that that's the only place where we attempt to acquire a buffer cleanup lock unconditionally; every place else, we acquire the lock conditionally, so there's no deadlock risk. Once we resolve this problem, the paragraph about deadlock risk in this section should be revised to explain whatever solution we come up with. By the way, since VACUUM must run in its own transaction, B can't be holding arbitrary locks, but that doesn't seem quite sufficient to get us out of the woods. It will at least hold ShareUpdateExclusiveLock on the relation being vacuumed, and process A could attempt to acquire that same lock. Also in regards to deadlock, I notice that you added a paragraph saying that we lock higher-numbered buckets before lower-numbered buckets. That's fair enough, but what about the metapage? The reader algorithm suggests that the metapage must lock must be taken after the bucket locks, because it tries to grab the bucket lock conditionally after acquiring the metapage lock, but that's not documented here. The reader algorithm itself seems to be a bit oddly explained. pin meta page and take buffer content lock in shared mode + compute bucket number for target hash key + read and pin the primary bucket page So far, I'm with you. + conditionally get the buffer content lock in shared mode on primary bucket page for search + if we didn't get the lock (need to wait for lock) "didn't get the lock" and "wait for the lock" are saying the same thing, so this is redundant, and the statement that it is "for search" on the previous line is redundant with the introductory text describing this as the reader algorithm. + release the buffer content lock on meta page + acquire buffer content lock on primary bucket page in shared mode + acquire the buffer content lock in shared mode on meta page OK... + to check for possibility of split, we need to recompute the bucket and + verify, if it is a correct bucket; set the retry flag This makes it sound like we set the retry flag whether it was the correct bucket or not, which isn't sensible. + else if we get the lock, then we can skip the retry path This line is totally redundant. If we don't set the retry flag, then of course we can skip the part guarded by if (retry). + if (retry) + loop: + compute bucket number for target hash key + release meta page buffer content lock + if (correct bucket page is already locked) + break + release any existing content lock on bucket page (if a concurrent split happened) + pin primary bucket page and take shared buffer content lock + retake meta page buffer content lock in shared mode This is the part I *really* don't understand. It makes sense to me that we need to loop until we get the correct bucket locked with no concurrent splits, but why is this retry loop separate from the previous bit of code that set the retry flag. In other words, why is not something like this? pin the meta page and take shared content lock on it compute bucket number for target hash key if (we can't get a shared content lock on the target bucket without blocking) loop: release meta page content lock take a shared content lock on the target primary bucket page take a shared content lock on the metapage if (previously-computed target bucket has not been split) break; Another thing I don't quite understand about this algorithm is that in order to conditionally lock the target primary bucket page, we'd first need to read and pin it. And that doesn't seem like a good thing to do while we're holding a shared content lock on the metapage, because of the principle that we don't want to hold content locks across I/O. -- then, per read request: release pin on metapage - read current page of bucket and take shared buffer content lock - step to next page if necessary (no chaining of locks) + if the split is in progress for current bucket and this is a new bucket + release the buffer content lock on current bucket page + pin and acquire the buffer content lock on old bucket in shared mode + release the buffer content lock on old bucket, but not pin + retake the buffer content lock on new bucket + mark the scan such that it skips the tuples that are marked as moved by split Aren't these steps done just once per scan? If so, I think they should appear before "-- then, per read request" which AIUI is intended to imply a loop over tuples. + step to next page if necessary (no chaining of locks) + if the scan indicates moved by split, then move to old bucket after the scan + of current bucket is finished get tuple release buffer content lock and pin on current page -- at scan shutdown: - release bucket share-lock Don't we have a pin to release at scan shutdown in the new system? Well, I was hoping to get through the whole patch in one email, but I'm not even all the way through the README. However, it's late, so I'm stopping here for now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers