Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

Tomas Vondra Sat, 01 Oct 2016 16:38:39 -0700

On 10/01/2016 09:59 PM, Andres Freund wrote:

Hi,


On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote:

On 10/01/2016 02:44 AM, Andres Freund wrote:

Hi,

On 2016-07-26 17:43:33 -0700, Andres Freund wrote:

In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined.  There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
execGrouping.c.


Attached is a significantly improved version of this series.  The main
changes are:

- The hash table now uses robin-hood style hash collision handling. See the
  commit message of 0002 (or simplehash.h) for an introduction. That
  significantly reduces the impact of "clustering" due to linear
  addressing.


Interesting. Have you looked at cuckoo hashing?


Yes. I don't think it's a good fit for modern CPUs. Because it
doesn't probe linearly you constantly have random uncached memory
accesses.

I've played with a few schemes, and they all seemed to be slower
than linear probing deviations, unless you intentionally used a
wrong hash-distribution, or intentionally "unbalanced" the hash-space
by iterating over either end of the keyspace and moved the entries
around.


OK, understood.

- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.


Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.


Well, I have rather bad experience with running benchmarks on laptops anyway
- a lot of noise due to power management, etc.


Well, with factor ~2x improvements thats not really a big detractor.
Using a new CPU makes some forms of analysis easier (better PMUs).


True.

>

For longrunning tests I agree.

What about running a bigger benchmark - say, TPC-DS - and evaluating
the impact?


Worthwhile, although obviously the impact will be a lot smaller,
since they're not entirely bottlenecked on hash-aggs and bitmap
scans.

Sure, the improvement won't be as significant as on the simple queries,but it's interesting IMHO.

I think a crucial part of the benchmarking will be identifying and
measuring corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places
switched to the new implementation (e.g. execGrouping merges the
duplicates to a single group, by definition).


Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group.  I don't really
see any case where repeated keys should cause an issue for a hash table?

Yep, that's pretty much what I suggested. I was wondering more about theother places where this hash table might be used - I've been thinkingabout hashjoins, but now that I think of it - that's a fairlyspecialized and tuned code. In any case, let's not complicate thediscussion for now.

This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
  mandatory, but they do provide a quite consistent improvement. There
  are other threads where they proved to be beneficial, so I see little
  reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
  for the issue mentioned in [1], to improve peformance in heavily lossy
  scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates


While not quite perfect yet, I do think this is at a state where input
is needed.  I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.


So, is it the right time to do some benchmarking?


That's one thing that's required, yes. The other is whether we can live
with the uglyness of implementing templates via macros. I do think we
can, but...

Hmmm ... not sure. If one of the points is to get rid of function callsdetermined at runtime (which make it impossible for the compiler tooptimize the code), then I can think of three options:


(a) copy-paste the code for each place
(b) use some templating
(c) use JIT

I think (b) is way better than (a), and I don't think we're using JITanywhere at this point. So while the macro-based templates look a bitawkward, I'm not aware of a better C thing.


A few comments, after quickly skimming through the first two patches:

1) SH_CREATE does this:

        /* round up size to the next power of 2, eases lookups */
        if (size < 2)
                size = 2;
        else
                size = sh_pow2_int(size);

I very much doubt a hash table with 2 buckets is very useful. What aboutusing some reasonable lower boundary - IIRC hashjoins use 1024 buckets,which might be a bit too much, but perhaps 256 would work?


2) tb->resize_above

I'd say 'resize_threshold' is a better name. Also, 0.8 should be definedas a constant somewhere (not sure if it makes sense to make it part ofSH_TYPE, so that hash tables may use different load factors).

3) 'size' is a rather poor name for parameter (e.g. in SH_CREATE orSH_RESIZE), as it's unclear whether it's 'element size in bytes', 'totalsize in bytes' or 'number of entries'.

4) SH_CONTAINS sounds more like a function checking whether a hash tablecontains a key, not as a 'type of hash table entry'. What aboutSH_ENTRY_TYPE instead? Also, SH_KEYTYPE should be SH_KEY_TYPE I guess.


5) SH_RESIZE

Do we expect to shrink hash tables? If not, SH_RESIZE should probablycheck that newsize > oldsize. If yes, it should check that there'senough space for all entries (with the load factor).

It's not entirely clear why is it guaranteed that there's always anelement with optimal position (when load factor < 1)? Adding anexplanation or a link to a paper would be helpful, I guess.


6) SH_STAT

This bit is a bit suspicious:

    uint32 max_collisions = 0;

    ...

    /* single contained element is not a collision */
    curcoll--;
    total_collisions += curcoll;
    if (curcoll > max_collisions)
        max_collisions = curcoll - 1;

Shouldn't the last line be just "max_collisions = curcoll"?

7) Should hash agg size estimation for the planner consider the fillfactor?

I think we should account for fillfactor - we should account for theallocated memory, even if there's free space in the hash table. Afterall, that's what hashjoins do.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

Reply via email to