On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > [ new patches ]
Attached is yet another incremental patch with some suggested changes. + * This expects that the caller has acquired a cleanup lock on the target + * bucket (primary page of a bucket) and it is reponsibility of caller to + * release that lock. This is confusing, because it makes it sound like we retain the lock through the entire execution of the function, which isn't always true. I would say that caller must acquire a cleanup lock on the target primary bucket page before calling this function, and that on return that page will again be write-locked. However, the lock might be temporarily released in the meantime, which visiting overflow pages. (Attached patch has a suggested rewrite.) + * During scan of overflow pages, first we need to lock the next bucket and + * then release the lock on current bucket. This ensures that any concurrent + * scan started after we start cleaning the bucket will always be behind the + * cleanup. Allowing scans to cross vacuum will allow it to remove tuples + * required for sanctity of scan. This comment says that it's bad if other scans can pass our cleanup scan, but it doesn't explain why. I think it's because we don't have page-at-a-time mode yet, and cleanup might decrease the TIDs for existing index entries. (Attached patch has a suggested rewrite, but might need further adjustment if my understanding of the reasons is incomplete.) + if (delay) + vacuum_delay_point(); You don't really need "delay". If we're not in a cost-accounted VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(), which should be safe (and a good idea) regardless. (Fixed in attached.) + if (callback && callback(htup, callback_state)) + { + /* mark the item for deletion */ + deletable[ndeletable++] = offno; + if (tuples_removed) + *tuples_removed += 1; + } + else if (bucket_has_garbage) + { + /* delete the tuples that are moved by split. */ + bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup ), + maxbucket, + highmask, + lowmask); + /* mark the item for deletion */ + if (bucket != cur_bucket) + { + /* + * We expect tuples to either belong to curent bucket or + * new_bucket. This is ensured because we don't allow + * further splits from bucket that contains garbage. See + * comments in _hash_expandtable. + */ + Assert(bucket == new_bucket); + deletable[ndeletable++] = offno; + } + else if (num_index_tuples) + *num_index_tuples += 1; + } + else if (num_index_tuples) + *num_index_tuples += 1; + } OK, a couple things here. First, it seems like we could also delete any tuples where ItemIdIsDead, and that seems worth doing. In fact, I think we should check it prior to invoking the callback, because it's probably quite substantially cheaper than the callback. Second, repeating deletable[ndeletable++] = offno and *num_index_tuples += 1 doesn't seem very clean to me; I think we should introduce a new bool to track whether we're keeping the tuple or killing it, and then use that to drive which of those things we do. (Fixed in attached.) + if (H_HAS_GARBAGE(bucket_opaque) && + !H_INCOMPLETE_SPLIT(bucket_opaque)) This is the only place in the entire patch that use H_INCOMPLETE_SPLIT(), and I'm wondering if that's really correct even here. Don't you really want H_OLD_INCOMPLETE_SPLIT() here? (And couldn't we then remove H_INCOMPLETE_SPLIT() itself?) There's no garbage to be removed from the "new" bucket until the next split, when it will take on the role of the "old" bucket. I think it would be a good idea to change this so that LH_BUCKET_PAGE_HAS_GARBAGE doesn't get set until LH_BUCKET_OLD_PAGE_SPLIT is cleared. The current way is confusing, because those tuples are NOT garbage until the split is completed! Moreover, both of the places that care about LH_BUCKET_PAGE_HAS_GARBAGE need to make sure that LH_BUCKET_OLD_PAGE_SPLIT is clear before they do anything about LH_BUCKET_PAGE_HAS_GARBAGE, so the change I am proposing would actually simplify the code very slightly. +#define H_OLD_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT) +#define H_NEW_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) The code isn't consistent about the use of these macros, or at least not in a good way. When you care about LH_BUCKET_OLD_PAGE_SPLIT, you test it using the macro; when you care about H_NEW_INCOMPLETE_SPLIT, you ignore the macro and test it directly. Either get rid of both macros and always test directly, or keep both macros and use both of them. Using a macro for one but not the other is strange. I wonder if we should rename these flags and macros. Maybe LH_BUCKET_OLD_PAGE_SPLIT -> LH_BEING_SPLIT and LH_BUCKET_NEW_PAGE_SPLIT -> LH_BEING_POPULATED. I think that might be clearer. When LH_BEING_POPULATED is set, the bucket is being filled - that is, populated - from the old bucket. And maybe LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too. + * Copy bucket mapping info now; The comment in _hash_expandtable + * where we copy this information and calls _hash_splitbucket explains + * why this is OK. After a semicolon, the next word should not be capitalized. There shouldn't be two spaces after a semicolon, either. Also, _hash_splitbucket appears to have a verb before it (calls) and a verb after it (explains) and I have no idea what that means. + for (;;) + { + mask = lowmask + 1; + new_bucket = old_bucket | mask; + if (new_bucket > metap->hashm_maxbucket) + { + lowmask = lowmask >> 1; + continue; + } + blkno = BUCKET_TO_BLKNO(metap, new_bucket); + break; + } I can't help feeling that it should be possible to do this without looping. Can we ever loop more than once? How? Can we just use an if-then instead of a for-loop? Can't _hash_get_oldbucket_newblock call _hash_get_oldbucket_newbucket instead of duplicating the logic? I still don't like the names of these functions very much. If you said "get X from Y", it would be clear that you put in Y and you get out X. If you say "X 2 Y", it would be clear that you put in X and you get out Y. As it is, it's not very clear which is the input and which is the output. + bool primary_buc_page) I think we could just go with "primary_page" here. (Fixed in attached.) + /* + * Acquiring cleanup lock to clear the split-in-progress flag ensures that + * there is no pending scan that has seen the flag after it is cleared. + */ + _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE); + opage = BufferGetPage(bucket_obuf); + oopaque = (HashPageOpaque) PageGetSpecialPointer(opage); I don't understand the comment, because the code *isn't* acquiring a cleanup lock. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
more-hash-tweaks.patch
Description: application/download
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers