Please find the patch attached below for your review.

Thanks & Regards,

Ravi Revathy

Member Technical Staff

ZOHO Corporation











---- On Wed, 27 Nov 2024 18:41:13 +0530 Ravi <revath...@zohocorp.com> wrote ---



Hi Developers,

     Currently, PostgreSQL relies on table statistics, extracted within the 
examine_simple_variable function, to estimate join selectivity. However, when 
dealing with subqueries that include GROUP BY clauses even for the single 
length clauses which result in distinct rows, the planner often defaults to an 
assumption of 200 distinct rows. This leads to inaccurate cardinality 
predictions, potentially resulting in suboptimal join plans.



Problem Example



Consider the following query:



explain select * from t1 left join (select a, max(b) from t2 group by a) t2 on 
t1.a = t2.a;







The resulting plan predicts a high cardinality for the join, and places the 
larger dataset on the hash side:

                                     QUERY PLAN                                 
  

--------------------------------------------------------------------------------

Hash Join  (cost=943037.92..955323.45 rows=6963818 width=16)

   Hash Cond: (t1.a = t2.a)

   ->  Seq Scan on t1  (cost=0.00..289.00 rows=20000 width=8)

   ->  Hash  (cost=893538.50..893538.50 rows=3017074 width=8)

         ->  HashAggregate  (cost=777429.49..893538.50 rows=3017074 width=8)

               Group Key: t2.a

               Planned Partitions: 64

               ->  Seq Scan on t2  (cost=0.00..158673.98 rows=11000098 width=8)

(8 rows)






Here, the join cardinality is overestimated, and table t2 with larger dataset 
being placed on the hash side, despite t1 having fewer rows.




Proposed Solution:

In subqueries with a GROUP BY clause that has a single grouping column, it is 
reasonable to assume the result set contains unique values for that column.

By taking this assumption, we can consider the output of the aggregate node as 
unique and instead of assuming a default distinct row count (200), we should 
derive the estimate from the HashAggregate node’s row count.



Execution Plan after the patch applied:

                                  QUERY PLAN                                

--------------------------------------------------------------------------

Hash Join  (cost=777968.49..935762.27 rows=20000 width=16)

   Hash Cond: (t2.a = t1.a)

   ->  HashAggregate  (cost=777429.49..893538.50 rows=3017074 width=8)

         Group Key: t2.a

         Planned Partitions: 64

         ->  Seq Scan on t2  (cost=0.00..158673.98 rows=11000098 width=8)

   ->  Hash  (cost=289.00..289.00 rows=20000 width=8)

         ->  Seq Scan on t1  (cost=0.00..289.00 rows=20000 width=8)

(8 rows)







Can you confirm if my assumption about leveraging the distinct row property of 
a GROUP BY clause with a single grouping column for improving join cardinality 
estimation is valid? If not, I would appreciate suggestions or corrections 
regarding this approach.



Thanks & Regards,

Ravi Revathy

Member Technical Staff

ZOHO Corporation

Attachment: join_cardinality.patch
Description: Binary data

Reply via email to