On 2017-03-01 10:39:11 +0530, Kuntal Ghosh wrote:
> On Wed, Mar 1, 2017 at 9:33 AM, Kuntal Ghosh <kuntalghosh.2...@gmail.com> 
> wrote:
> > On Wed, Mar 1, 2017 at 9:19 AM, Andres Freund <and...@anarazel.de> wrote:
> >> That's without the patch in
> >> http://archives.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de
> >> ? With that patch it should complete without that change, right?
> >>
> > No, it's with the patch in the above link which is
> > 0001-Resize-simplehash.h-table-in-case-of-long-runs.patch. But, I've
> > kept the SH_FILLFACTOR to 0.8 and SH_MAX_FILLFACTOR to 0.95. I'll do
> > another round of testing with the constants provided by you.
> >
> I've tested with the patch
> 0001-Resize-simplehash.h-table-in-case-of-long-runs.patch and the
> results are same, i.e., hash table grows even when it is only 10-12%
> filled. I've attached the logfile for reference.

So, I observed similar things, where the *leader backend* reports a lot
of "long runs".

WARNING:  clumping, growing this thing
LOG:  size: 1048576, members: 158071, filled: 0.150748, total chain: 95622, max 
chain: 35, avg chain: 0.604931, total_collisions: 35909, max_collisions: 5, 
avg_collisions: 0.227170
STATEMENT:  EXPLAIN ANALYZE select l_orderkey from lineitem group by l_orderkey;
WARNING:  clumping, growing this thing
LOG:  size: 2097152, members: 238992, filled: 0.113960, total chain: 1850141, 
max chain: 422, avg chain: 7.741435, total_collisions: 55363, max_collisions: 
5, avg_collisions: 0.231652
STATEMENT:  EXPLAIN ANALYZE select l_orderkey from lineitem group by l_orderkey;
WARNING:  clumping, growing this thing
LOG:  size: 4194304, members: 881089, filled: 0.210068, total chain: 567907, 
max chain: 29, avg chain: 0.644551, total_collisions: 174681, max_collisions: 
6, avg_collisions: 0.198256
STATEMENT:  EXPLAIN ANALYZE select l_orderkey from lineitem group by l_orderkey;
WARNING:  clumping, growing this thing
LOG:  size: 8388608, members: 5397649, filled: 0.643450, total chain: 5752690, 
max chain: 22, avg chain: 1.065777, total_collisions: 1437957, max_collisions: 
8, avg_collisions: 0.266404
STATEMENT:  EXPLAIN ANALYZE select l_orderkey from lineitem group by l_orderkey;

The pertinent part is that there's "forced" resizes due to long runs at
0.15%, 0.11%, 0.21%, 0.64%.  Note that there's a good number of
"ordinary" resizes, and at the end there's 7500000 tuples in the
hashtable.  I.e. there's a number of resizes after the last one due to
clumping, and all of those are entirely guided through fillfactor; the
final hash-table is entirely reasonably sized.

Trying to do the math, that just didn't sit well with me. There's just
no way there could be that many and such long runs, without something
broken.  And indeed:

Thinking about this even more I realized the reason this happens is that
b81b5a96f4 didn't actually address the problem of inserting in
hash-table order.   To understand, let's first look at the query plan:

Finalize HashAggregate  (cost=899227.75..903176.55 rows=394880 width=4) (actual 
time=6255.957..7415.152 rows=7500000 loops=1)
  Group Key: l_orderkey
  ->  Gather  (cost=652427.75..893304.55 rows=2369280 width=4) (actual 
time=1741.706..3699.248 rows=7940881 loops=1)
        Workers Planned: 6
        Workers Launched: 6
        ->  Partial HashAggregate  (cost=651427.75..655376.55 rows=394880 
width=4) (actual time=1731.468..1956.693 rows=11344
              Group Key: l_orderkey
              ->  Parallel Seq Scan on lineitem  (cost=0.00..638927.80 
rows=4999980 width=4) (actual time=0.025..711.771 rows
Planning time: 0.192 ms
Execution time: 7700.027 ms

The important part is the estimated rows=394880 and actual
rows=7500000. That means that the hash-table on the leader backend will
first be created suitably for 394880 rows.  But we obviously get a lot

The problem then is that even after introducing the hash_iv stuff
(b81b5a96f), we essentially still iterate over the tuples in
hash-order.  TupleHashTableHash() computes the hash for a single-column
group condition as:

    uint32 hashkey = hashtable->hash_iv;

    hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
    attr = slot_getattr(slot, att, &isNull);
    if (!isNull)                        /* treat nulls as having hash key 0 */
        uint32 hkey;

        hkey = DatumGetUInt32(FunctionCall1(&hashfunctions[i], attr));
        hashkey ^= hkey;

the resulting hash-values aren't actually meaningfully influenced by the
IV. Because we just xor with the IV, most hash-value that without the IV
would have fallen into a single hash-bucket, fall into a single
hash-bucket afterwards as well; just somewhere else in the hash-range.

Which then leads to the issue that we insert hundreds of rows into the
leader's hash-value that fall into the same hash-bucket (as the leader's
hashtable is quite small at this point, the hash-buckets on the worker
tables have *far* fewer entries per bucket / are far smaller on the
workers than the leaders).  Which then leads to super-long runs.

As soon as the table gets to it "real" size, the issue of the overlong
runs is gone, because now the hash-table has the appropriate size for
its contents.

So that explains why there's a there's a lot of resizes when the
fillfactor is low (as you've observed seen in the 10-20% range) - in
specific hash-ranges the fillfactor is extremely high, but in other
parts the table is effectively empty.

The question remains what to do about that... I think it's pretty clear
that we need a "defense" against such unbalanced hashtables - now that I
understand how we can get into the situation of these being so
unbalanced I'm a lot less uncomfortable adding the necessary logic.

In addition to that it seems quite worthwhile to provide an iterator
that's not vulnerable to this.  An approach that I am, seemingly
successfully, testing is to iterate the hashtable in multiple (in my
case 23, because why not) passes, accessing only every nth element. That
allows the data to be inserted in a lot less "dense" fashion.  But
that's more an optimization, so I'll just push something like the patch
mentioned in the thread already.

Makes some sense?

- Andres

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to