On Thu, Mar 6, 2014 at 7:07 PM, Greg Stark <st...@mit.edu> wrote:
> On Thu, Mar 6, 2014 at 11:14 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>> I've been tempted to implement a new type of hash index that allows both WAL
>>> and high concurrency, simply by disallowing bucket splits.  At the index
>>> creation time you use a storage parameter to specify the number of buckets,
>>> and that is that. If you mis-planned, build a new index with more buckets,
>>> possibly concurrently, and drop the too-small one.
>>
>> Yeah, we could certainly do something like that.  It sort of sucks,
>> though.  I mean, it's probably pretty easy to know that starting with
>> the default 2 buckets is not going to be enough; most people will at
>> least be smart enough to start with, say, 1024.  But are you going to
>> know whether you need 32768 or 1048576 or 33554432?  A lot of people
>> won't, and we have more than enough reasons for performance to degrade
>> over time as it is.
>
> The other thought I had was that you can do things lazily in vacuum.
> So when you probe you need to check multiple pages until vacuum comes
> along and rehashes everything.

I was thinking about this, too.  Cleaning up the old bucket after the
split is pretty similar to vacuuming.  And lo and behold, vacuum also
needs to lock the entire bucket.  AFAICT, there are two reasons for
this.  First, when we resume a suspended scan, we assume that the next
match on the page, if any, will occur on the same page at an offset
greater than the offset where we found the previous match.  The code
copes with the possibility of current insertions by refinding the last
item we returned and scanning forward from there, but assumes the
pointer we're refinding can't have moved to a lower offset.  I think
this could be easily fixed - at essentially no cost - by changing
hashgettuple so that, if the forward scan errors out, it tries
scanning backward rather than just giving up.  Second, vacuum compacts
each bucket that it modifies using _hash_squeezebucket, which scans
from the two ends of the index in toward the middle, filling any free
space on earlier pages by pulling tuples from the end of the bucket
chain.  This is a little thornier.  We could change vacuum so that it
only removes TIDs from the individual pages, without actually trying
to free up pages, but that seems undesirable.

However, I think we might be able to get by with making bucket
compaction less aggressive, without eliminating it altogether.
Suppose that whenever we remove items from a page, we consider
consolidating the page with its immediate predecessor and successor in
the bucket chain.  This means our utilization will be a little over
50% in the worst case where we have a full page, a page with one item,
another full page, another page with one item, and so on.  But that's
not all bad, because compacting a bucket chain to the minimum possible
size when it may well be about to suffer more inserts isn't
necessarily a good thing anyway.  Also, doing this means that we don't
need to lock out scans from the entire bucket chain in order to
compact.  It's sufficient to make sure that nobody's in the middle of
scanning the two pages we want to merge.

That turns out not to be possible right now.  When a scan is
suspended, we still hold a pin on the page we're scanning.  But when
_hash_readnext or _hash_readprev walks the bucket chain, it drops both
lock and pin on the current buffer and then pins and locks the new
buffer.  That, however, seems like it could easily be changed: drop
lock on current buffer, acquire pin on new buffer, drop pin on current
buffer, lock new buffer.  Then, if we want to merge two buffers, it's
enough to lock both of them for cleanup.  To avoid any risk of
deadlock, and also to avoid waiting for a long-running suspended scan
to wake up and complete, we can do this conditionally; if we fail to
get either cleanup lock, then just don't merge the pages; take an
exclusive lock only and remove whatever you need to remove, leaving
the rest.  Merging pages is only a performance optimization, so if it
fails now and then, no real harm done.

(A side benefit of this approach is that we could opportunistically
attempt to compact pages containing dead items any time we can manage
a ConditionalLockBufferForCleanup() on the page, sort of like what HOT
pruning does for heap blocks.  We could even, after pruning away dead
items, attempt to merge the page with siblings, so that even without
vacuum the bucket chains can gradually shrink if the index tuples are
discovered to be pointing to dead heap tuples.)

With the above changes, vacuuming a bucket can happen without taking
the heavyweight lock in exclusive mode, leaving bucket splitting as
the only operation that requires it.  And we could perhaps use
basically the same algorithm to clean the tuples out of the old bucket
after the split.  The problem is that, when we're vacuuming, we know
that no scan currently in progress can still care about any of the
tuples we're removing.  I suppose a SnapshotAny scan might care, but
we don't and don't need to support that for hash indexes, I think, or
at least not concurrently.  When we're cleaning up after a bucket
split, however, we don't know whether there are any pre-split scans
still in flight.  Given the locking changes described above, I had the
notion that we could solve that problem by starting a new scan of the
bucket, locking each buffer for cleanup.  If, contrary to present
practice, each scan always kept a pin, when we got to the end of the
bucket chain, we'd know that any remaining scans started after ours,
and thus had the latest information.  There are at least two problems
with this idea.  First, it might make post-split cleanup take a really
long time, if somebody's got a cursor paused in the middle of a bucket
chain somewhere.  Second, hash scans can be backed up, which is either
an undetected-deadlock hazard or a returning-wrong-answers hazard
depending on the exactly how you implement this.

So I'm still a bit stumped about how to implement the "wait out scans"
portion of the design I sketched out previously.  In essence, that's
what the heavyweight locking is implementing today, for both vacuum
and bucket splits, and shuffling the work around between those two
code paths in some way may be a good idea, but the core issue is that
if we're not going to use heavyweight locks for waiting out scans,
then we need some other mechanism.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to