Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

Quan Zongliang Fri, 16 Jun 2023 02:25:39 -0700


We have a small table with only 23 rows and 21 values.


The resulting MCV and histogram is as follows
stanumbers1 | {0.08695652,0.08695652}
stavalues1  | {v1,v2}

stavalues2 |{v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21}

An incorrect number of rows was estimated when HashJoin was done withanother large table (about 2 million rows).

Hash Join (cost=1.52..92414.61 rows=2035023 width=0) (actualtime=1.943..1528.983 rows=3902 loops=1)

The reason is that the MCV of the small table excludes values with rowsof 1. Put them in the MCV in the statistics to get the correct result.

Using the conservative samplerows <= attstattarget doesn't completelysolve this problem. It can solve this case.


After modification we get statistics without histogram:
stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... }
stavalues1  | {v,v2, ... }

And we have the right estimates:

Hash Join (cost=1.52..72100.69 rows=3631 width=0) (actualtime=1.447..1268.385 rows=3902 loops=1)



Regards,

--
Quan Zongliang
Beijing Vastdata Co., LTD

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 52ef462dba..08ea4243f5 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2518,7 +2518,7 @@ compute_scalar_stats(VacAttrStatsP stats,
                        {
                                /* Reached end of duplicates of this value */
                                ndistinct++;
-                               if (dups_cnt > 1)
+                               if (dups_cnt > 1 || samplerows <= num_mcv)
                                {
                                        nmultiple++;
                                        if (track_cnt < num_mcv ||

Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

Reply via email to