Hi, 

      Thanks everyone for your suggestions. I would like to add  explain 
analyze of both the plans so that we can have broader picture.



I have a work_mem of 1000 MB.



The Plan which we get regularly with table being analyzed .

tpch=# explain analyze  select b from tab2 left join tab1 on a = b; 

                                                           QUERY PLAN           
                                                

--------------------------------------------------------------------------------------------------------------------------------

 Hash Left Join  (cost=945515.68..1071064.34 rows=78264 width=4) (actual 
time=9439.410..20445.620 rows=78264 loops=1)

   Hash Cond: (tab2.b = tab1.a)

   ->  Seq Scan on tab2  (cost=0.00..1129.64 rows=78264 width=4) (actual 
time=0.006..5.116 rows=78264 loops=1)

   ->  Hash  (cost=442374.30..442374.30 rows=30667630 width=4) (actual 
time=9133.593..9133.593 rows=30667722 loops=1)

         Buckets: 33554432  Batches: 2  Memory Usage: 801126kB

         ->  Seq Scan on tab1  (cost=0.00..442374.30 rows=30667630 width=4) 
(actual time=0.030..3584.652 rows=30667722 loops=1)

 Planning time: 0.055 ms

 Execution time: 20472.603 ms

(8 rows)








I reproduced the  other plan by not analyzing the smaller table.

tpch=# explain analyze  select b from tab2 left join tab1 on a = b; 

                                                        QUERY PLAN              
                                          

--------------------------------------------------------------------------------------------------------------------------

 Hash Right Join  (cost=2102.88..905274.97 rows=78039 width=4) (actual 
time=15.331..7590.406 rows=78264 loops=1)

   Hash Cond: (tab1.a = tab2.b)

   ->  Seq Scan on tab1  (cost=0.00..442375.48 rows=30667748 width=4) 
(actual time=0.046..2697.480 rows=30667722 loops=1)

   ->  Hash  (cost=1127.39..1127.39 rows=78039 width=4) (actual 
time=15.133..15.133 rows=78264 loops=1)

         Buckets: 131072  Batches: 1  Memory Usage: 3776kB

         ->  Seq Scan on tab2  (cost=0.00..1127.39 rows=78039 width=4) 
(actual time=0.009..5.516 rows=78264 loops=1)

 Planning time: 0.053 ms

 Execution time: 7592.688 ms

(8 rows)



 
The actual plan seems to be Slower. The smaller table (tab2) has exactly each 
row duplicated 8 times  and all the rows in larger table (tab2) are distinct. 
what may be the exact reason  and  can we fix this ?
 
P.s I have also attached a sql file to reproduce this 



---- On Tue, 13 Mar 2018 12:42:12 +0530 Ashutosh Bapat 
<ashutosh.ba...@enterprisedb.com> wrote ----




On Mon, Mar 12, 2018 at 10:02 PM, Narendra Pradeep U U 

<narendra.prad...@zohocorp.com> wrote: 

> Hi , 

> 

> Recently I came across a case where the planner choose larger table on 

> hash side. I am not sure whether it is an intended behavior or we are 

> missing something. 

> 

> I have two tables (a and b) each with single column in it. One table 

> 'a' is large with around 30 million distinct rows and other table 'b' has 

> merely 70,000 rows with one-seventh (10,000) distinct rows. I have 
analyzed 

> both the table. But while joining both the table I get the larger table on 

> hash side. 

> 

> tpch=# explain select b from b left join a on a = b; 

> QUERY PLAN 

> 
---------------------------------------------------------------------------------------------------------
 

> Hash Left Join (cost=824863.75..950104.42 rows=78264 width=4) 

> Hash Cond: (b.b = a.a)o 

> -> Foreign Scan on b (cost=0.00..821.64 rows=78264 width=4) 

> CStore File: 

> /home/likewise-open/pg96/data/cstore_fdw/1818708/1849879 

> CStore File Size: 314587 

> -> Hash (cost=321721.22..321721.22 rows=30667722 width=4) 

> -> Foreign Scan on a (cost=0.00..321721.22 rows=30667722 width=4) 

> CStore File: 

> /home/likewise-open/pg96/data/cstore_fdw/1818708/1849876 

> CStore File Size: 123236206 

> (9 rows) 

> 

> 

> 

> I would like to know the reason for choosing this plan and Is there a easy 

> fix to prevent such plans (especially like this one where it choose a 
larger 

> hash table) ? 

 

A plan with larger table being hashed doesn't necessarily bad 

performing one. During partition-wise join analysis I have seen plans 

with larger table being hashed perform better than the plans with 

smaller table being hashed. But I have seen the other way around as 

well. Although, I don't know an easy way to force which side of join 

gets hashed. I tried that under the debugger. In your case, if you run 

EXPLAIN ANALYZE on this query, produce outputs of two plans: one with 

larger table being hashed and second with the smaller one being 

hashed, you will see which of them performs better. 

 

-- 

Best Wishes, 

Ashutosh Bapat 

EnterpriseDB Corporation 

The Postgres Database Company 

 






Attachment: join_plan.sql
Description: Binary data

Reply via email to