[jira] [Resolved] (HIVE-7834) Use min, max and NDV from the stats to better estimate many to many vs one to many inner joins

Mostafa Mokhtar (JIRA) Tue, 09 Sep 2014 07:27:57 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-7834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mostafa Mokhtar resolved HIVE-7834.
-----------------------------------
    Resolution: Duplicate

> Use min, max and NDV from the stats to better estimate many to many vs one to 
> many inner joins
> ----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-7834
>                 URL: https://issues.apache.org/jira/browse/HIVE-7834
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>    Affects Versions: 0.13.1
>            Reporter: Mostafa Mokhtar
>             Fix For: 0.14.0
>
>
> I noticed that the estimate number of rows in Map joins is higher after the 
> join than before the join that is with column stats fetch ON or OFF.
> TPC-DS Q55 was a good example for that, the issue is that the current 
> statistics provide us enough information that we can estimate with strong 
> confidence that the joins are one to many and not many to many.
> Joining store_sales x item on ss_item_sk = i_item_sk, we know that the NDV, 
> min and max values for both join columns match while the row counts are 
> different this pattern indicates a PK/FK relationship between store_sales and 
> item.
> Yet when a filter is applied on item and reduces the number of rows from 462K 
> to 7K we estimate a many to many join between the filtered item and 
> store_sales and as a result the estimate number of rows coming out of the 
> join is off by several orders of magnitude.
> Available information from the stats
> {code}
> Table         Join column     NDV from describe               NDV actual      
> min             max
> item          i_item_sk       439,501                         462,000         
> 1               462,000
> date_dim      d_date_sk       65,332                          73,049          
> 2,415,022       2,488,070
> store_sales   ss_item_sk      439,501                         462,000         
> 1               462,000
> store_sales   ss_sold_date_sk 2,226                           1,823           
> 2,450,816       2,452,642
> {code}
> Same thing applies to store_sales and date_dim but with a caveat that the NDV 
> , min and max values don't match where date_dim has a bigger domain and 
> accordingly a higher NDV count.
> For joining store_sales and item on on ss_item_sk = i_item_sk since both 
> columns have the same NDV, min and max values we can safely conclude that 
> selectivity on item will translate to similar selectivity on store_sales.
> This is not the case for joining store_sales and date_dim on ss_sold_date_sk 
> = d_date_sk since the domain of d_date_sk is much bigger than that of 
> ss_sold_date_sk, differences in domain need to be taken into account when 
> inferring selectivity onto store_sales.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HIVE-7834) Use min, max and NDV from the stats to better estimate many to many vs one to many inner joins

Reply via email to