On 10/01/2016 02:44 AM, Andres Freund wrote:
On 2016-07-26 17:43:33 -0700, Andres Freund wrote:
In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined. There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
Attached is a significantly improved version of this series. The main
- The hash table now uses robin-hood style hash collision handling. See the
commit message of 0002 (or simplehash.h) for an introduction. That
significantly reduces the impact of "clustering" due to linear
Interesting. Have you looked at cuckoo hashing? That seems to be the
go-to hashing in several database-related papers I've read recently, so
I guess there's a reason for that (IIRC it has constant worst case for
lookups, constant amortized time for inserts/deletes). Not sure how it
compares to robin-hood hashing regarding cache-friendliness though.
- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.
Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
Well, I have rather bad experience with running benchmarks on laptops
anyway - a lot of noise due to power management, etc. What about running
a bigger benchmark - say, TPC-DS - and evaluating the impact?
I think a crucial part of the benchmarking will be identifying and
measuring corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places switched
to the new implementation (e.g. execGrouping merges the duplicates to a
single group, by definition).
This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
mandatory, but they do provide a quite consistent improvement. There
are other threads where they proved to be beneficial, so I see little
reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
for the issue mentioned in , to improve peformance in heavily lossy
scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates
While not quite perfect yet, I do think this is at a state where input
is needed. I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.
So, is it the right time to do some benchmarking?
I think there's several possible additional users of simplehash.h,
e.g. catcache.c - which has an open coded, not particularly fast hash
table and frequently shows up in profiles - but I think the above two
conversions are plenty to start with.
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: