Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

Quan Zongliang Fri, 16 Jun 2023 15:33:25 -0700



On 2023/6/16 23:39, Tomas Vondra wrote:



On 6/16/23 11:25, Quan Zongliang wrote:


We have a small table with only 23 rows and 21 values.

The resulting MCV and histogram is as follows
stanumbers1 | {0.08695652,0.08695652}
stavalues1  | {v1,v2}
stavalues2  |
{v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21}

An incorrect number of rows was estimated when HashJoin was done with
another large table (about 2 million rows).

Hash Join  (cost=1.52..92414.61 rows=2035023 width=0) (actual
time=1.943..1528.983 rows=3902 loops=1)


That's interesting. I wonder how come the estimate gets this bad simply
by skipping values entries with a single row in the sample, which means
we know the per-value selectivity pretty well.

I guess the explanation has to be something strange happening when
estimating the join condition selectivity, where we combine MCVs from
both sides of the join (which has to be happening here, otherwise it
would not matter what gets to the MCV).

It'd be interesting to know what's in the other MCV, and what are the
other statistics for the attributes (ndistinct etc.).

Or even better, a reproducer SQL script that builds two tables and then
joins them.

The other table is severely skewed. Most rows cannot JOIN the smalltable. This special case causes the inaccuracy of cost calculation.

The reason is that the MCV of the small table excludes values with rows
of 1. Put them in the MCV in the statistics to get the correct result.

Using the conservative samplerows <= attstattarget doesn't completely
solve this problem. It can solve this case.

After modification we get statistics without histogram:
stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... }
stavalues1  | {v,v2, ... }

And we have the right estimates:
Hash Join  (cost=1.52..72100.69 rows=3631 width=0) (actual
time=1.447..1268.385 rows=3902 loops=1)


I'm not against building a "complete" MCV, but I guess the case where
(samplerows <= num_mcv) is pretty rare. Why shouldn't we make the MCV
complete whenever we decide (ndistinct <= num_mcv)?

That would need to happen later, because we don't have the ndistinct
estimate yet at this point - we'd have to do the loop a bit later (or
likely twice).

FWIW the patch breaks the calculation of nmultiple (and thus likely the
ndistinct estimate).

It's not just a small table. If a column's value is nearly unique. Italso causes the same problem because we exclude values that occur onlyonce. samplerows <= num_mcv just solves one scenario.

Perhaps we should discard this (dups cnt > 1) restriction?


regards

Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

Reply via email to