[ 
https://issues.apache.org/jira/browse/IMPALA-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned IMPALA-8037:
-----------------------------------

    Assignee:     (was: Paul Rogers)

> Planner makes poor estimations for non-equality predicates
> ----------------------------------------------------------
>
>                 Key: IMPALA-8037
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8037
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> Consider the TPC-H queries. Here is a part of one of them:
> {code:sql}
> select
>   count(*)
> from tpch.customer c,
>   tpch.orders o
> where
>   c.c_mktsegment = 'BUILDING'
>   and c.c_custkey = o.o_custkey
>   and o_orderdate < '1995-03-15'
> {code}
> This produces the following cardinality estimates (using the revised, 
> standard join cardinality estimator):
> {noformat}
> | 02:HASH JOIN [INNER JOIN, BROADCAST]                                      |
> | |  hash predicates: o.o_custkey = c.c_custkey                             |
> | |  runtime filters: RF000 <- c.c_custkey                                  |
> | |  row-size=59B cardinality=30.00K                                        |
> | |                                                                         |
> | |--04:EXCHANGE [BROADCAST]                                                |
> | |  |                                                                      |
> | |  00:SCAN HDFS [tpch.customer c]                                         |
> | |     partitions=1/1 files=1 size=23.08MB row-size=29B cardinality=30.00K |
> | |     predicates: c.c_mktsegment = 'BUILDING'                             |
> | |                                                                         |
> | 01:SCAN HDFS [tpch.orders o]                                              |
> |    partitions=1/1 files=1 size=162.56MB row-size=30B cardinality=150.00K  |
> |    predicates: o_orderdate < '1995-03-15'                                 |
> |    runtime filters: RF000 -> o.o_custkey                                  |
> {noformat}
> Run the query, the actual count is 147K vs. the estimate of 30K. The culprit 
> here is the HDFS Scan node with an estimated cardinality of 150K. Run just 
> this query:
> {code:sql}
> select count(*)
> from tpch.orders o
> where
>   o_orderdate < '1995-03-15'
> {code}
> With this plan:
> {noformat}
> | 00:SCAN HDFS [tpch.orders o]                                             |
> |    partitions=1/1 files=1 size=162.56MB row-size=22B cardinality=150.00K |
> |    predicates: o_orderdate < '1995-03-15'                                |
> {noformat}
> With with an actual result of 727K. Had this been used in the previous plan 
> instead of 30K, the join cardinality would have been about four times larger, 
> or 120K, which is far closer to the actual total.
> So, the problem is the mis-estimation of the inequality, which are very hard 
> to estimate. The actual selectivity, found by running a {{count\(*)}} on the 
> table, is 727K / 1.5M =~ 0.5. But, Impala estimated 30K / 1.5M = 0.02. Most 
> systems estimate around 0.45 or so.
> For example, using a default estimate of 1/3 for inequality provides the 
> following plan, which is much better:
> {noformat}
> 03:HASH JOIN [INNER JOIN]
> |  hash predicates: c.c_custkey = o.o_custkey
> |  runtime filters: RF002 <- o.o_custkey
> |  row-size=71B cardinality=120.00K
> |
> |--01:SCAN HDFS [tpch.orders o]
> |     partitions=1/1 files=1 size=162.56MB row-size=42B cardinality=600.00K
> |     predicates: o_orderdate < '1995-03-15'
> |     runtime filters: RF000 -> o.o_orderkey
> {noformat}
> Makes one wonder if the data set was designed with the industry-standard 
> defaults in mind...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to