Re: [HACKERS] Really bad blowups with hash outer join and nulls

Tomas Vondra Sun, 15 Feb 2015 17:17:57 -0800

Hi,

On 16.2.2015 00:50, Andrew Gierth wrote:
>>>>>> "Tom" == Tom Lane <[email protected]> writes:
>
> I've now tried the attached patch to correct the bucketsize
> estimates, and it does indeed stop the planner from considering the
> offending path (in this case it just does the join the other way
> round).
> 
> One thing I couldn't immediately see how to do was account for the
> case where there are a lot of nulls in the table but a strict qual
> (or an IS NOT NULL) filters them out; this patch will be overly
> pessimistic about such cases. Do estimates normally try and take
> things like this into account? I didn't find any other relevant
> examples.


Improving the estimates is always good, but it's not going to fix the
case of non-NULL values (it shouldn't be all that difficult to create
such examples with a value whose hash starts with a bunch of zeroes).

>  Tom> There may also be something we can do in the executor, but it
>  Tom> would take closer analysis to figure out what's going wrong.  I
>  Tom> don't think kluging the behavior for NULL in particular is the
>  Tom> answer.
> 
> The point with nulls is that a hash value of 0 is currently special
> in two distinct ways: it's always in batch 0 and bucket 0 regardless
> of how many batches and buckets there are, and it's the result of
> hashing a null. These two special cases interact in a worst-case
> manner, so it seems worthwhile to avoid that.

I think you're right this is a flaw in general - all it takes is a
sufficiently common value with a hash value falling into the first batch
(i.e. either 0 or starting with a lot of 0 bits, thus giving hashno==0).

I think this might be solved by relaxing the check a bit. Currently we
do this (see nodeHash.c:735):

    if (nfreed == 0 || nfreed == ninmemory)
    {
        hashtable->growEnabled = false;
    }

which only disables nbatch growth if *all* the tuples remain in the
batch. That's rather strict, and it takes a single tuple to break this.

With each nbatch increase we expect 50:50 split, i.e. 1/2 the tuples
staying in the batch, 1/2 moved to the new one. So a much higher ratio
is suspicious and most likely mean the same hash value, so what if we
did something like this:

    if ((nfreed >= 0.9 * (nfreed + ninmemory)) ||
        (nfreed <= 0.1 * (nfreed + ninmemory)))
    {
        hashtable->growEnabled = false;
    }

which disables nbatch growth if either >=90% tuples remained in the
first batch, or >= 90% tuples were moved from it. The exact thresholds
might be set a bit differently, but I think 10:90 sounds about good.

Trivial patch implementing this attached - with it the explain analyze
looks like this:

test=# explain analyze select status, count(*) from q3 left join m3 on
(m3.id=q3.id) group by status;

                               QUERY PLAN
----------------------------------------------------------------------
 HashAggregate  (cost=64717.63..64717.67 rows=4 width=4) (actual
time=514.619..514.630 rows=5 loops=1)
   Group Key: m3.status
   ->  Hash Right Join  (cost=18100.00..62217.63 rows=500000 width=4)
                    (actual time=75.260..467.911 rows=500108 loops=1)
         Hash Cond: (m3.id = q3.id)
         ->  Seq Scan on m3  (cost=0.00..18334.00 rows=1000000 width=37)
                        (actual time=0.003..91.799 rows=1000000 loops=1)
         ->  Hash  (cost=7943.00..7943.00 rows=500000 width=33)
                   (actual time=74.916..74.916 rows=500000 loops=1)
               Buckets: 65536 (originally 65536)
               Batches: 32 (originally 16)  Memory Usage: 8824kB
               ->  Seq Scan on q3
                   (cost=0.00..7943.00 rows=500000 width=33)
                   (actual time=0.005..27.395 rows=500000 loops=1)
 Planning time: 0.172 ms
 Execution time: 514.663 ms
(10 rows)

Without the patch this runs in ~240 seconds and the number of batches
explodes to ~131k.

In theory it might happen that threre just a few hash values and all of
them are exactly the same within the first N bits (thus falling into the
first batch), but N+1 bits are different. So the next split would
actually help. But I think we can leave that up to probability, just
like all the hash code.


-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index abd70b3..6b3f471 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -732,7 +732,8 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 * group any more finely. We have to just gut it out and hope the server
 	 * has enough RAM.
 	 */
-	if (nfreed == 0 || nfreed == ninmemory)
+    if ((nfreed >= 0.9 * (nfreed + ninmemory)) || 
+        (nfreed <= 0.1 * (nfreed + ninmemory)))
 	{
 		hashtable->growEnabled = false;
 #ifdef HJDEBUG

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Really bad blowups with hash outer join and nulls

Reply via email to