For making hash indexes usable in production systems, we need to improve
its concurrency and make them crash-safe by WAL logging them.  The first
problem I would like to tackle is improve the concurrency of hash
indexes.  First
advantage, I see with improving concurrency of hash indexes is that it has
the potential of out performing btree for "equal to" searches (with my WIP
patch attached with this mail, I could see hash index outperform btree
index by 20 to 30% for very simple cases which are mentioned later in this
e-mail).   Another advantage as explained by Robert [1] earlier is that if
we remove heavy weight locks under which we perform arbitrarily large
number of operations, it can help us to sensibly WAL log it.  With this
patch, I would also like to make hash indexes capable of completing the
incomplete_splits which can occur due to interrupts (like cancel) or errors
or crash.

I have studied the concurrency problems of hash index and some of the
solutions proposed for same previously and based on that came up with below
solution which is based on idea by Robert [1], community discussion on
thread [2] and some of my own thoughts.

Maintain a flag that can be set and cleared on the primary bucket page,
call it split-in-progress, and a flag that can optionally be set on
particular index tuples, call it moved-by-split. We will allow scans of all
buckets and insertions into all buckets while the split is in progress, but
(as now) we will not allow more than one split for a bucket to be in
progress at the same time.  We start the split by updating metapage to
incrementing the number of buckets and set the split-in-progress flag in
primary bucket pages for old and new buckets (lets number them as old
bucket - N+1/2; new bucket - N + 1 for the matter of discussion). While the
split-in-progress flag is set, any scans of N+1 will first scan that
bucket, ignoring any tuples flagged moved-by-split, and then ALSO scan
bucket N+1/2. To ensure that vacuum doesn't clean any tuples from old or
new buckets till this scan is in progress, maintain a pin on both of the
buckets (first pin on old bucket needs to be acquired). The moved-by-split
flag never has any effect except when scanning the new bucket that existed
at the start of that particular scan, and then only if
the split-in-progress flag was also set at that time.

Once the split operation has set the split-in-progress flag, it will begin
scanning bucket (N+1)/2.  Every time it finds a tuple that properly belongs
in bucket N+1, it will insert the tuple into bucket N+1 with the
moved-by-split flag set.  Tuples inserted by anything other than a split
operation will leave this flag clear, and tuples inserted while the split
is in progress will target the same bucket that they would hit if the split
were already complete.  Thus, bucket N+1 will end up with a mix
of moved-by-split tuples, coming from bucket (N+1)/2, and unflagged tuples
coming from parallel insertion activity.  When the scan of bucket (N+1)/2
is complete, we know that bucket N+1 now contains all the tuples that are
supposed to be there, so we clear the split-in-progress flag on both
buckets.  Future scans of both buckets can proceed normally.  Split
operation needs to take a cleanup lock on primary bucket to ensure that it
doesn't start if there is any Insertion happening in the bucket.  It will
leave the lock on primary bucket, but not pin as it proceeds for next
overflow page.  Retaining pin on primary bucket will ensure that vacuum
doesn't start on this bucket till the split is finished.

Insertion will happen by scanning the appropriate bucket and needs to
retain pin on primary bucket to ensure that concurrent split doesn't
happen, otherwise split might leave this tuple unaccounted.

Now for deletion of tuples from (N+1/2) bucket, we need to wait for the
completion of any scans that began before we finished populating bucket
N+1, because otherwise we might remove tuples that they're still expecting
to find in bucket (N+1)/2. The scan will always maintain a pin on primary
bucket and Vacuum can take a buffer cleanup lock (cleanup lock includes
Exclusive lock on bucket and wait till all the pins on buffer becomes zero)
on primary bucket for the buffer.  I think we can relax the requirement for
vacuum to take cleanup lock (instead take Exclusive Lock on buckets where
no split has happened) with the additional flag has_garbage which will be
set on primary bucket, if any tuples have been moved from that bucket,
however I think for squeeze phase (in this phase, we try to move the tuples
from later overflow pages to earlier overflow pages in the bucket and then
if there are any empty overflow pages, then we move them to kind of a free
pool) of vacuum, we need a cleanup lock, otherwise scan results might get

Incomplete Splits
Incomplete splits can be completed either by vacuum or insert as both needs
exclusive lock on bucket.  If vacuum finds split-in-progress flag on a
bucket then it will complete the split operation, vacuum won't see this
flag if actually split is in progress on that bucket as vacuum needs
cleanup lock and split retains pin till end of operation.  To make it work
for Insert operation, one simple idea could be that if insert finds
split-in-progress flag, then it releases the current exclusive lock on
bucket and tries to acquire a cleanup lock on bucket, if it gets cleanup
lock, then it can complete the split and then the insertion of tuple, else
it will have a exclusive lock on bucket and just perform the insertion of
tuple.  The disadvantage of trying to complete the split in vacuum is that
split might require new pages and allocating new pages at time of vacuum is
not advisable.  The disadvantage of doing it at time of Insert is that
Insert might skip it even if there is some scan on the bucket is going on
as scan will also retain pin on the bucket, but I think that is not a big
deal.  The actual completion of split can be done in two ways: (a) scan the
new bucket and build a hash table with all of the TIDs you find there.
When copying tuples from the old bucket, first probe the hash table; if you
find a match, just skip that tuple (idea suggested by Robert Haas offlist)
 (b) delete all the tuples that are marked as moved_by_split in the new
bucket and perform the split operation from the beginning using old bucket.

Although, I don't think it is a very good idea to take any performance data
with WIP patch, still I couldn't resist myself from doing so and below are
the performance numbers.  To get the performance data, I have dropped the
primary key constraint on pgbench_accounts and created a hash index on aid
column as below.

alter table pgbench_accounts drop constraint pgbench_accounts_pkey;
create index pgbench_accounts_pkey on pgbench_accounts using hash(aid);

Below data is for read-only pgbench test and is a median of 3 5-min runs.
The performance tests are executed on a power-8 m/c.

Data fits in shared buffers
scale_factor - 300
shared_buffers - 8GB

Patch_Ver/Client count 1 8 16 32 64 72 80 88 96 128
HEAD-Btree 19397 122488 194433 344524 519536 527365 597368 559381 614321
HEAD-Hindex 18539 141905 218635 363068 512067 522018 492103 484372 440265
Patch 22504 146937 235948 419268 637871 637595 674042 669278 683704 639967
% improvement between HEAD-Hash index vs Patch and HEAD-Btree index vs
Patch-Hash index is:

Head-Hash vs Patch 21.38 3.5 7.9 15.47 24.56 22.14 36.97 38.17 55.29 62.74
Head-Btree vs. Patch 16.01 19.96 21.35 21.69 22.77 20.9 12.83 19.64 11.29
This data shows that patch improves the performance of hash index upto
62.74 and it also makes hash-index faster than btree-index by ~20% (most
client counts show the performance improvement in the range of 15~20%.

For the matter of comparison with btree, I think the impact of performance
improvement of hash index will be more when the data doesn't fit shared
buffers and the performance data for same is as below:

Data doesn't fits in shared buffers
scale_factor - 3000
shared_buffers - 8GB

Client_Count/Patch 16 64 96
Head-Btree 170042 463721 520656
Patch-Hash 227528 603594 659287
% diff 33.8 30.16 26.62
The performance with hash-index is ~30% better than Btree.  Note, that for
now,  I have not taken the data for HEAD- Hash index.  I think there will
many more cases like when hash index is on char (20) column where the
performance of hash-index can be much better than btree-index for equal to

Note that this patch is a very-much WIP patch and I am posting it mainly to
facilitate the discussion.  Currently, it doesn't have any code to perform
incomplete splits, the logic for locking/pins during Insert is yet to be
done and many more things.

[1] -
[2] -

With Regards,
Amit Kapila.

Attachment: concurrent_hash_index_v1.patch
Description: Binary data

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to