On 10/01/2016 09:59 PM, Andres Freund wrote:
Hi,
On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote:
On 10/01/2016 02:44 AM, Andres Freund wrote:
Hi,
On 2016-07-26 17:43:33 -0700, Andres Freund wrote:
In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined. There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
execGrouping.c.
Attached is a significantly improved version of this series. The main
changes are:
- The hash table now uses robin-hood style hash collision handling. See the
commit message of 0002 (or simplehash.h) for an introduction. That
significantly reduces the impact of "clustering" due to linear
addressing.
Interesting. Have you looked at cuckoo hashing?
Yes. I don't think it's a good fit for modern CPUs. Because it
doesn't probe linearly you constantly have random uncached memory
accesses.
I've played with a few schemes, and they all seemed to be slower
than linear probing deviations, unless you intentionally used a
wrong hash-distribution, or intentionally "unbalanced" the hash-space
by iterating over either end of the keyspace and moved the entries
around.
OK, understood.
- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.
Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.
Well, I have rather bad experience with running benchmarks on laptops anyway
- a lot of noise due to power management, etc.
Well, with factor ~2x improvements thats not really a big detractor.
Using a new CPU makes some forms of analysis easier (better PMUs).
True.
>
For longrunning tests I agree.
What about running a bigger benchmark - say, TPC-DS - and evaluating
the impact?
Worthwhile, although obviously the impact will be a lot smaller,
since they're not entirely bottlenecked on hash-aggs and bitmap
scans.
Sure, the improvement won't be as significant as on the simple queries,
but it's interesting IMHO.
I think a crucial part of the benchmarking will be identifying and
measuring corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places
switched to the new implementation (e.g. execGrouping merges the
duplicates to a single group, by definition).
Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group. I don't really
see any case where repeated keys should cause an issue for a hash table?
Yep, that's pretty much what I suggested. I was wondering more about the
other places where this hash table might be used - I've been thinking
about hashjoins, but now that I think of it - that's a fairly
specialized and tuned code. In any case, let's not complicate the
discussion for now.
This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
mandatory, but they do provide a quite consistent improvement. There
are other threads where they proved to be beneficial, so I see little
reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
for the issue mentioned in [1], to improve peformance in heavily lossy
scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates
While not quite perfect yet, I do think this is at a state where input
is needed. I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.
So, is it the right time to do some benchmarking?
That's one thing that's required, yes. The other is whether we can live
with the uglyness of implementing templates via macros. I do think we
can, but...
Hmmm ... not sure. If one of the points is to get rid of function calls
determined at runtime (which make it impossible for the compiler to
optimize the code), then I can think of three options:
(a) copy-paste the code for each place
(b) use some templating
(c) use JIT
I think (b) is way better than (a), and I don't think we're using JIT
anywhere at this point. So while the macro-based templates look a bit
awkward, I'm not aware of a better C thing.
A few comments, after quickly skimming through the first two patches:
1) SH_CREATE does this:
/* round up size to the next power of 2, eases lookups */
if (size < 2)
size = 2;
else
size = sh_pow2_int(size);
I very much doubt a hash table with 2 buckets is very useful. What about
using some reasonable lower boundary - IIRC hashjoins use 1024 buckets,
which might be a bit too much, but perhaps 256 would work?
2) tb->resize_above
I'd say 'resize_threshold' is a better name. Also, 0.8 should be defined
as a constant somewhere (not sure if it makes sense to make it part of
SH_TYPE, so that hash tables may use different load factors).
3) 'size' is a rather poor name for parameter (e.g. in SH_CREATE or
SH_RESIZE), as it's unclear whether it's 'element size in bytes', 'total
size in bytes' or 'number of entries'.
4) SH_CONTAINS sounds more like a function checking whether a hash table
contains a key, not as a 'type of hash table entry'. What about
SH_ENTRY_TYPE instead? Also, SH_KEYTYPE should be SH_KEY_TYPE I guess.
5) SH_RESIZE
Do we expect to shrink hash tables? If not, SH_RESIZE should probably
check that newsize > oldsize. If yes, it should check that there's
enough space for all entries (with the load factor).
It's not entirely clear why is it guaranteed that there's always an
element with optimal position (when load factor < 1)? Adding an
explanation or a link to a paper would be helpful, I guess.
6) SH_STAT
This bit is a bit suspicious:
uint32 max_collisions = 0;
...
/* single contained element is not a collision */
curcoll--;
total_collisions += curcoll;
if (curcoll > max_collisions)
max_collisions = curcoll - 1;
Shouldn't the last line be just "max_collisions = curcoll"?
7) Should hash agg size estimation for the planner consider the fillfactor?
I think we should account for fillfactor - we should account for the
allocated memory, even if there's free space in the hash table. After
all, that's what hashjoins do.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers