On Sun, 2007-09-02 at 13:04 -0500, Kenneth Marshall wrote:
> Dear PostgreSQL Hackers:
> After following the hackers mailing list for quite a while,
> I am going to start investigating what will need to be done
> to improve hash index performance. Below are the pieces of
> this project that I am currently considering:
> 1. Characterize the current hash index implementation against
> the BTree index, with a focus on space utilization and
> lookup performance against a collection of test data. This
> will give a baseline performance test to evaluate the impact
> of changes. I initially do not plan to bench the hash creation
> process since my initial focus will be on lookup performance.
> 2. Evaluate the performance of different hash index implementations
> and/or changes to the current implementation. My current plan is
> to keep the implementation as simple as possible and still provide
> the desired performance. Several hash index suggestions deal with
> changing the layout of the keys on a page to improve lookup
> performance, including reducing the bucket size to a fraction of
> a page or only storing the hash value on the page, instead of
> the index value itself. My goal in this phase is to produce one
> or more versions with better performance than the current BTree.
> 3. Look at build time and concurrency issues with the addition of
> some additional tests to the test bed. (1)
> 4. Repeat as needed.
> This is the rough plan. Does anyone see anything critical that
> is missing at this point? Please send me any suggestions for test
> data and various performance test ideas, since I will be working
> on that first.
I'd be particularly interested in large indexes, say ~ 0.5 - 2GB. There
are likely to be various effects apparent as the indexes grow. It would
be too easy to do all the tests with smaller indexes and miss things.
Other factors are:
My general experience is that hash-based indexes are better when the
range of inputs is relatively well-known, allowing a fast lookup. If
that is the only benefit of hash indexes, a flexible hashing scheme may
simply weaken the benefit-case for using them. If that's true, should
the index build process examine the key values in the data to determine
the best parameters to use? Kind of ANALYZE before build.
My current feeling is that they ought to be very good at handling
read-mostly situations such as privilege checking or UPDATE-intensive
situations such as Customer-Current-Credit tracking, when the number of
customers is large.
It might also be worth looking at lossy hash indexes, i.e. the index
stores only the block numbers. That would need to be part of the
discussion around how lossy we will allow indexes to be.
We currently have two kinds of full text index with different
concurrency use cases, so it should be acceptable to have hash indexes
have a clear benefit in one use case but a clear loss in another.
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster