Stamatis Zampetakis created HIVE-23684:

             Summary: Large underestimation in NDV stats when input and join 
cardinality ratio is big
                 Key: HIVE-23684
             Project: Hive
          Issue Type: Bug
            Reporter: Stamatis Zampetakis
            Assignee: Stamatis Zampetakis

Large underestimations of NDV values may occur after a join operation since the 
current logic will decrease the original NDV values proportionally.

 compares the number of rows of each relation before the join with the number 
of rows after the join and extracts a ratio for each side. Based on this ratio 
it adapts (reduces) the NDV accordingly.

Consider for instance the following query:
select inv_warehouse_sk
     , inv_item_sk
     , stddev_samp(inv_quantity_on_hand) stdev
     , avg(inv_quantity_on_hand)         mean
from inventory
   , date_dim
where inv_date_sk = d_date_sk
  and d_year = 1999
  and d_moy = 2
group by inv_warehouse_sk, inv_item_sk;
For the sake of the discussion, I outline below some relevant stats (from 
 T(inventory) = 1627857000
 T(date_dim) = 73049
 T(inventory JOIN date_dim[d_year=1999 AND d_moy=2]) = 24948000
 V(inventory, inv_date_sk) = 261
 V(inventory, inv_item_sk) = 420000
 V(inventory, inv_warehouse_sk) = 27
 V(date_dim, inv, d_date_sk) = 73049

For instance, in this query the join between inventory and date_dim has ~24M 
rows while inventory has ~1.5B so the NDV of the columns coming from inventory 
are reduced by a factor of ~100 so we end up with V(JOIN, inv_item_sk) = ~6K 
while the real one is 231000.

This message was sent by Atlassian Jira

Reply via email to