Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

Joel Jacobson Tue, 03 Mar 2026 07:06:14 -0800

On Sun, Mar 1, 2026, at 22:12, Tom Lane wrote:
> Aside: you could argue that failing to consider stanullfrac is wrong,
> and maybe it is.  But the more I looked at this code the more
> convinced I got that it was only partially accounting for nulls
> anyway.  That seems like perhaps something to look into later.


How about adjusting estfract for the null fraction before clamping?

```diff
@@ -4461,20 +4473,27 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node 
*hashkey, double nbuckets,
    /*
     * Initial estimate of bucketsize fraction is 1/nbuckets as long as the
     * number of buckets is less than the expected number of distinct values;
     * otherwise it is 1/ndistinct.
     */
    if (ndistinct > nbuckets)
        estfract = 1.0 / nbuckets;
    else
        estfract = 1.0 / ndistinct;

+   /*
+    * Adjust for null fraction.  NULL keys are not inserted into the hash
+    * table, but inner_path_rows in final_cost_hashjoin includes them, so we
+    * must discount estfract to compensate.
+    */
+   estfract *= (1.0 - stanullfrac);
+
    /*
     * Clamp the bucketsize fraction to be not less than the MCV frequency,
     * since whichever bucket the MCV values end up in will have at least that
     * size.  This has no effect if *mcv_freq is still zero.
     */
    estfract = Max(estfract, *mcv_freq);

    *bucketsize_frac = (Selectivity) estfract;

    ReleaseVariableStats(vardata);
```

Here is an extreme example where 98% of all hj_nullfrac_inner.val are NULL:

CREATE TABLE hj_nullfrac_inner (id int, val int);
INSERT INTO hj_nullfrac_inner
  SELECT g, CASE WHEN g <= 1000 THEN g ELSE NULL END
  FROM generate_series(1, 50000) g;
CREATE INDEX ON hj_nullfrac_inner(val);

CREATE TABLE hj_nullfrac_outer (id int, val int);
INSERT INTO hj_nullfrac_outer
  SELECT g, ((g-1) % 1000)+1
  FROM generate_series(1, 500000) g;

ANALYZE hj_nullfrac_inner;
ANALYZE hj_nullfrac_outer;

EXPLAIN ANALYZE
SELECT * FROM hj_nullfrac_outer o JOIN hj_nullfrac_inner i ON o.val = i.val;

Both master (HEAD) and v2 results in the same plan:

                                                                           
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.30..20022.95 rows=499834 width=16) (actual 
time=0.016..289.314 rows=500000.00 loops=1)
   Buffers: shared hit=5212 read=1
   ->  Seq Scan on hj_nullfrac_outer o  (cost=0.00..7213.00 rows=500000 
width=8) (actual time=0.008..31.449 rows=500000.00 loops=1)
         Buffers: shared hit=2213
   ->  Memoize  (cost=0.30..0.32 rows=1 width=8) (actual time=0.000..0.000 
rows=1.00 loops=500000)
         Cache Key: o.val
         Cache Mode: logical
         Estimates: capacity=1000 distinct keys=1000 lookups=500000 hit 
percent=99.80%
         Hits: 499000  Misses: 1000  Evictions: 0  Overflows: 0  Memory Usage: 
106kB
         Buffers: shared hit=2999 read=1
         ->  Index Scan using hj_nullfrac_inner_val_idx on hj_nullfrac_inner i  
(cost=0.29..0.31 rows=1 width=8) (actual time=0.002..0.002 rows=1.00 loops=1000)
               Index Cond: (val = o.val)
               Index Searches: 1000
               Buffers: shared hit=2999 read=1
 Planning:
   Buffers: shared hit=87 read=7
 Planning Time: 0.419 ms
 Execution Time: 303.107 ms
(18 rows)

With v2+stanullfrac adjustment, we get a Hash Join, that is faster:

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1347.00..15436.61 rows=500161 width=16) (actual 
time=6.054..153.595 rows=500000.00 loops=1)
   Hash Cond: (o.val = i.val)
   Buffers: shared hit=2435
   ->  Seq Scan on hj_nullfrac_outer o  (cost=0.00..7213.00 rows=500000 
width=8) (actual time=0.008..31.597 rows=500000.00 loops=1)
         Buffers: shared hit=2213
   ->  Hash  (cost=722.00..722.00 rows=50000 width=8) (actual time=6.041..6.042 
rows=1000.00 loops=1)
         Buckets: 65536  Batches: 1  Memory Usage: 552kB
         Buffers: shared hit=222
         ->  Seq Scan on hj_nullfrac_inner i  (cost=0.00..722.00 rows=50000 
width=8) (actual time=0.004..3.016 rows=50000.00 loops=1)
               Buffers: shared hit=222
 Planning:
   Buffers: shared hit=6
 Planning Time: 0.130 ms
 Execution Time: 167.903 ms
(14 rows)

Here is elog debugging comparing v2 vs v2+stanullfrac adjustment, for
the above example:

v2:
ndistinct=1003.000000
nbuckets=65536.000000
stanullfrac=0.979933
mcv_freq=0.000020
estfract before clamping=0.000997
estfract after clamping=0.000997

v2+stanullfrac adjustment:
ndistinct=972.000000
nbuckets=65536.000000
stanullfrac=0.980567
mcv_freq=0.000020
estfract before stanullfrac adjustment=0.001029
estfract after stanullfrac adjustment=0.000020
estfract after clamping=0.000020

/Joel

fix-stanullfrac.patch
Description: Binary data

Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

Reply via email to