[ https://issues.apache.org/jira/browse/HIVE-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146874#comment-14146874 ]
Mostafa Mokhtar commented on HIVE-8031: --------------------------------------- Issue is that NDV is not scaled with row count as a results the CE of the join ends up being too low. Another issue is that the join of date_dim x store_sales doesn't go through the PK/FK path HiveRelMdUniqueKeys.getUniqueKeys(ProjectRelBase, boolean) is of type HepRelVertex not HiveTableScanRel. > CBO should use per column join selectivity not NDV when applying exponential > backoff. > ------------------------------------------------------------------------------------- > > Key: HIVE-8031 > URL: https://issues.apache.org/jira/browse/HIVE-8031 > Project: Hive > Issue Type: Bug > Components: CBO > Affects Versions: 0.14.0, 0.13.1 > Reporter: Mostafa Mokhtar > Assignee: Harish Butani > Fix For: 0.14.0 > > > Currently CBO uses NDV not join selectivity in computeInnerJoinSelectivity > which results in in-accurate estimate number of rows. > I looked at the plan for TPC-DS Q17 after the latest set of changes and I am > concerned that the estimate of rows for the join of store_sales and > store_returns is so low, as you can see the estimate is 8461 rows for joining > 1.2795706667449066E8 with 1.2922108035889767E7. > {code} > HiveJoinRel(condition=[AND(=($130, $3), =($129, $15))], > joinType=[inner]): rowcount = 1079.1345153548855, cumulative cost = > {8.271845957931738E10 rows, 0.0 cpu, 0.0 io}, id = 517 > HiveJoinRel(condition=[=($0, $38)], joinType=[inner]): > rowcount = 6.669190301841249E7, cumulative cost = {4.300510912631623E10 rows, > 0.0 cpu, 0.0 io}, id = 402 > HiveTableScanRel(table=[[catalog_sales]]): rowcount = > 4.3005109025E10, cumulative cost = {0}, id = 2 > HiveFilterRel(condition=[in($15, '2000Q1', '2000Q2', > '2000Q3')]): rowcount = 101.31622746185853, cumulative cost = {0.0 rows, 0.0 > cpu, 0.0 io}, id = 181 > HiveTableScanRel(table=[[d3]]): rowcount = 73049.0, > cumulative cost = {0}, id = 3 > HiveJoinRel(condition=[AND(AND(=($3, $61), =($2, $60)), > =($9, $67))], joinType=[inner]): rowcount = 8461.27236667537, cumulative cost > = {8.26517592150266E10 rows, 0.0 cpu, 0.0 io}, id = 515 > HiveJoinRel(condition=[=($27, $0)], joinType=[inner]): > rowcount = 1.2795706667449066E8, cumulative cost = {8.251088004031622E10 > rows, 0.0 cpu, 0.0 io}, id = 417 > HiveTableScanRel(table=[[store_sales]]): rowcount = > 8.2510879939E10, cumulative cost = {0}, id = 5 > HiveFilterRel(condition=[=($15, '2000Q1')]): rowcount = > 101.31622746185853, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 173 > HiveTableScanRel(table=[[d1]]): rowcount = 73049.0, > cumulative cost = {0}, id = 0 > HiveJoinRel(condition=[=($0, $24)], joinType=[inner]): > rowcount = 1.2922108035889767E7, cumulative cost = {8.332595810316228E9 rows, > 0.0 cpu, 0.0 io}, id = 424 > HiveTableScanRel(table=[[store_returns]]): rowcount = > 8.332595709E9, cumulative cost = {0}, id = 7 > HiveFilterRel(condition=[in($15, '2000Q1', '2000Q2', > '2000Q3')]): rowcount = 101.31622746185853, cumulative cost = {0.0 rows, 0.0 > cpu, 0.0 io}, id = 177 > HiveTableScanRel(table=[[d2]]): rowcount = 73049.0, > cumulative cost = {0}, id = 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)