Alexander Behm created IMPALA-6575:
--------------------------------------

             Summary: Avoid double-counting of predicates in join cardinality 
estimation
                 Key: IMPALA-6575
                 URL: https://issues.apache.org/jira/browse/IMPALA-6575
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0, Impala 2.8.0, 
Impala 2.7.0, Impala 2.6.0, Impala 2.5.0
            Reporter: Alexander Behm


The cardinality of an inner join may be significantly underestimated if (1) an 
equivalent predicate exists on both join inputs, (2) the join condition 
involves the same column as that predicate, and (3) Impala believes the join to 
be FK/PK.

The reason for this underestimation is that the planner double-counts the 
selectivity of predicates on the join input:
* First, the selectivity reduces the cardinality of the join input
* Second, since the join is FK/PK, the build-side selectivity is applied to the 
join cardinality
This second adjustment is not correct in this specific situation because the 
predicate selectivity has already been applied to the probe-side join input.

Example:
{code}
explain select count(*) from functional.alltypes a join functional.alltypes b 
on a.id = b.id and a.id < 10 and b.id < 10;
+------------------------------------------------------------------------------------------------+
| Explain String                                                                
                 |
+------------------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=4.00MB                              
                 |
| Per-Host Resource Estimates: Memory=279.94MB                                  
                 |
| Codegen disabled by planner                                                   
                 |
|                                                                               
                 |
| F03:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                         
                 |
| |  Per-Host Resources: mem-estimate=10.00MB mem-reservation=0B                
                 |
| PLAN-ROOT SINK                                                                
                 |
| |  mem-estimate=0B mem-reservation=0B                                         
                 |
| |                                                                             
                 |
| 07:AGGREGATE [FINALIZE]                                                       
                 |
| |  output: count:merge(*)                                                     
                 |
| |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB                
                 |
| |  tuple-ids=2 row-size=8B cardinality=1                                      
                 |
| |                                                                             
                 |
| 06:EXCHANGE [UNPARTITIONED]                                                   
                 |
| |  mem-estimate=0B mem-reservation=0B                                         
                 |
| |  tuple-ids=2 row-size=8B cardinality=1                                      
                 |
| |                                                                             
                 |
| F02:PLAN FRAGMENT [HASH(a.id)] hosts=3 instances=3                            
                 |
| Per-Host Resources: mem-estimate=12.94MB mem-reservation=2.94MB 
runtime-filters-memory=1.00MB  |
| 03:AGGREGATE                                                                  
                 |
| |  output: count(*)                                                           
                 |
| |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB                
                 |
| |  tuple-ids=2 row-size=8B cardinality=1                                      
                 |
| |                                                                             
                 |
| 02:HASH JOIN [INNER JOIN, PARTITIONED]                                        
                 |
| |  hash predicates: a.id = b.id                                               
                 |
| |  fk/pk conjuncts: a.id = b.id                                               
                 |
| |  runtime filters: RF000[bloom] <- b.id                                      
                 |
| |  mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB            
                 |
| |  tuple-ids=0,1 row-size=8B cardinality=73       <--- should be 730          
                                   |
| |                                                                             
                 |
| |--05:EXCHANGE [HASH(b.id)]                                                   
                 |
| |  |  mem-estimate=0B mem-reservation=0B                                      
                 |
| |  |  tuple-ids=1 row-size=4B cardinality=730                                 
                 |
| |  |                                                                          
                 |
| |  F01:PLAN FRAGMENT [RANDOM] hosts=3 instances=3                             
                 |
| |  Per-Host Resources: mem-estimate=128.00MB mem-reservation=32.00KB          
                 |
| |  01:SCAN HDFS [functional.alltypes b, RANDOM]                               
                 |
| |     partitions=24/24 files=24 size=478.45KB                                 
                 |
| |     predicates: b.id < 10                                                   
                 |
| |     stored statistics:                                                      
                 |
| |       table: rows=7300 size=478.45KB                                        
                 |
| |       partitions: 24/24 rows=7300                                           
                 |
| |       columns: all                                                          
                 |
| |     extrapolated-rows=disabled                                              
                 |
| |     parquet dictionary predicates: b.id < 10                                
                 |
| |     mem-estimate=128.00MB mem-reservation=32.00KB                           
                 |
| |     tuple-ids=1 row-size=4B cardinality=730                                 
                 |
| |                                                                             
                 |
| 04:EXCHANGE [HASH(a.id)]                                                      
                 |
| |  mem-estimate=0B mem-reservation=0B                                         
                 |
| |  tuple-ids=0 row-size=4B cardinality=730                                    
                 |
| |                                                                             
                 |
| F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3                                
                 |
| Per-Host Resources: mem-estimate=129.00MB mem-reservation=1.03MB 
runtime-filters-memory=1.00MB |
| 00:SCAN HDFS [functional.alltypes a, RANDOM]                                  
                 |
|    partitions=24/24 files=24 size=478.45KB                                    
                 |
|    predicates: a.id < 10                                                      
                 |
|    runtime filters: RF000[bloom] -> a.id                                      
                 |
|    stored statistics:                                                         
                 |
|      table: rows=7300 size=478.45KB                                           
                 |
|      partitions: 24/24 rows=7300                                              
                 |
|      columns: all                                                             
                 |
|    extrapolated-rows=disabled                                                 
                 |
|    parquet dictionary predicates: a.id < 10                                   
                 |
|    mem-estimate=128.00MB mem-reservation=32.00KB                              
                 |
|    tuple-ids=0 row-size=4B cardinality=730                                    
                 |
+------------------------------------------------------------------------------------------------+
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to