[ 
https://issues.apache.org/jira/browse/HIVE-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mostafa Mokhtar updated HIVE-8031:
----------------------------------
    Description: 
Currently CBO uses NDV not join selectivity in computeInnerJoinSelectivity 
which results in in-accurate estimate number of rows.

I looked at the plan for TPC-DS Q17 after the latest set of changes and I am 
concerned that the estimate of rows for the join of store_sales and 
store_returns is so low, as you can see the estimate is 8461 rows for joining 
1.2795706667449066E8 with 1.2922108035889767E7.

{code}
    HiveJoinRel(condition=[AND(=($130, $3), =($129, $15))], joinType=[inner]): 
rowcount = 1079.1345153548855, cumulative cost = {8.271845957931738E10 rows, 
0.0 cpu, 0.0 io}, id = 517
                  HiveJoinRel(condition=[=($0, $38)], joinType=[inner]): 
rowcount = 6.669190301841249E7, cumulative cost = {4.300510912631623E10 rows, 
0.0 cpu, 0.0 io}, id = 402
                    HiveTableScanRel(table=[[catalog_sales]]): rowcount = 
4.3005109025E10, cumulative cost = {0}, id = 2
                    HiveFilterRel(condition=[in($15, '2000Q1', '2000Q2', 
'2000Q3')]): rowcount = 101.31622746185853, cumulative cost = {0.0 rows, 0.0 
cpu, 0.0 io}, id = 181
                      HiveTableScanRel(table=[[d3]]): rowcount = 73049.0, 
cumulative cost = {0}, id = 3
                  HiveJoinRel(condition=[AND(AND(=($3, $61), =($2, $60)), =($9, 
$67))], joinType=[inner]): rowcount = 8461.27236667537, cumulative cost = 
{8.26517592150266E10 rows, 0.0 cpu, 0.0 io}, id = 515
                    HiveJoinRel(condition=[=($27, $0)], joinType=[inner]): 
rowcount = 1.2795706667449066E8, cumulative cost = {8.251088004031622E10 rows, 
0.0 cpu, 0.0 io}, id = 417
                      HiveTableScanRel(table=[[store_sales]]): rowcount = 
8.2510879939E10, cumulative cost = {0}, id = 5
                      HiveFilterRel(condition=[=($15, '2000Q1')]): rowcount = 
101.31622746185853, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 173
                        HiveTableScanRel(table=[[d1]]): rowcount = 73049.0, 
cumulative cost = {0}, id = 0
                    HiveJoinRel(condition=[=($0, $24)], joinType=[inner]): 
rowcount = 1.2922108035889767E7, cumulative cost = {8.332595810316228E9 rows, 
0.0 cpu, 0.0 io}, id = 424
                      HiveTableScanRel(table=[[store_returns]]): rowcount = 
8.332595709E9, cumulative cost = {0}, id = 7
                      HiveFilterRel(condition=[in($15, '2000Q1', '2000Q2', 
'2000Q3')]): rowcount = 101.31622746185853, cumulative cost = {0.0 rows, 0.0 
cpu, 0.0 io}, id = 177
                        HiveTableScanRel(table=[[d2]]): rowcount = 73049.0, 
cumulative cost = {0}, id = 1
{code}

> CBO should use per column join selectivity not NDV when applying exponential 
> backoff.
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-8031
>                 URL: https://issues.apache.org/jira/browse/HIVE-8031
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>    Affects Versions: 0.14.0, 0.13.1
>            Reporter: Mostafa Mokhtar
>            Assignee: Harish Butani
>             Fix For: 0.14.0
>
>
> Currently CBO uses NDV not join selectivity in computeInnerJoinSelectivity 
> which results in in-accurate estimate number of rows.
> I looked at the plan for TPC-DS Q17 after the latest set of changes and I am 
> concerned that the estimate of rows for the join of store_sales and 
> store_returns is so low, as you can see the estimate is 8461 rows for joining 
> 1.2795706667449066E8 with 1.2922108035889767E7.
> {code}
>     HiveJoinRel(condition=[AND(=($130, $3), =($129, $15))], 
> joinType=[inner]): rowcount = 1079.1345153548855, cumulative cost = 
> {8.271845957931738E10 rows, 0.0 cpu, 0.0 io}, id = 517
>                   HiveJoinRel(condition=[=($0, $38)], joinType=[inner]): 
> rowcount = 6.669190301841249E7, cumulative cost = {4.300510912631623E10 rows, 
> 0.0 cpu, 0.0 io}, id = 402
>                     HiveTableScanRel(table=[[catalog_sales]]): rowcount = 
> 4.3005109025E10, cumulative cost = {0}, id = 2
>                     HiveFilterRel(condition=[in($15, '2000Q1', '2000Q2', 
> '2000Q3')]): rowcount = 101.31622746185853, cumulative cost = {0.0 rows, 0.0 
> cpu, 0.0 io}, id = 181
>                       HiveTableScanRel(table=[[d3]]): rowcount = 73049.0, 
> cumulative cost = {0}, id = 3
>                   HiveJoinRel(condition=[AND(AND(=($3, $61), =($2, $60)), 
> =($9, $67))], joinType=[inner]): rowcount = 8461.27236667537, cumulative cost 
> = {8.26517592150266E10 rows, 0.0 cpu, 0.0 io}, id = 515
>                     HiveJoinRel(condition=[=($27, $0)], joinType=[inner]): 
> rowcount = 1.2795706667449066E8, cumulative cost = {8.251088004031622E10 
> rows, 0.0 cpu, 0.0 io}, id = 417
>                       HiveTableScanRel(table=[[store_sales]]): rowcount = 
> 8.2510879939E10, cumulative cost = {0}, id = 5
>                       HiveFilterRel(condition=[=($15, '2000Q1')]): rowcount = 
> 101.31622746185853, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 173
>                         HiveTableScanRel(table=[[d1]]): rowcount = 73049.0, 
> cumulative cost = {0}, id = 0
>                     HiveJoinRel(condition=[=($0, $24)], joinType=[inner]): 
> rowcount = 1.2922108035889767E7, cumulative cost = {8.332595810316228E9 rows, 
> 0.0 cpu, 0.0 io}, id = 424
>                       HiveTableScanRel(table=[[store_returns]]): rowcount = 
> 8.332595709E9, cumulative cost = {0}, id = 7
>                       HiveFilterRel(condition=[in($15, '2000Q1', '2000Q2', 
> '2000Q3')]): rowcount = 101.31622746185853, cumulative cost = {0.0 rows, 0.0 
> cpu, 0.0 io}, id = 177
>                         HiveTableScanRel(table=[[d2]]): rowcount = 73049.0, 
> cumulative cost = {0}, id = 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to