[
https://issues.apache.org/jira/browse/HIVE-19097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470661#comment-16470661
]
Jesus Camacho Rodriguez commented on HIVE-19097:
------------------------------------------------
[~kgyrtkirk], could you create a different issue for branch-3 linked to this
one then? And an RB with the solution that you are proposing to push to
branch-2 so we can review it and push it?
Concerning your solution for branch-3, I like the fact that it is integrated
within existing rule, but I think it may not be the most efficient solution.
For instance, if you integrate the opener within the
HivePointLookupOptimizerRule, this decomposition/composition will be done
multiple times during that stage in the planning phase, which will incur some
overhead. In addition, we already have in that stage (reduce expressions) rules
that can handle the OR simplification. Thus, I was thinking whether we should
take a similar approach as Calcite, and break the IN clause when we are
generating the Calcite tree? This is a one time process, and we already have
code that does something very similar but for IN clause with a single element:
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/RexNodeConverter.java#L342
Creating an OR clause over there should make the trick.
What do you think?
> related equals and in operators may cause inaccurate stats estimations
> ----------------------------------------------------------------------
>
> Key: HIVE-19097
> URL: https://issues.apache.org/jira/browse/HIVE-19097
> Project: Hive
> Issue Type: Bug
> Reporter: Zoltan Haindrich
> Assignee: Zoltan Haindrich
> Priority: Major
> Attachments: HIVE-19097.01.patch, HIVE-19097.partial.patch
>
>
> tpcds#74 is optimized in a way that for date_dim the condition contains IN
> and = for the same column
> {code:java}
> | Map Operator Tree: |
> | TableScan |
> | alias: date_dim |
> | filterExpr: (((d_year) IN (2001, 2002) and (d_year =
> 2002) and d_date_sk is not null) or ((d_year) IN (2001, 2002) and (d_year =
> 2001) and d_date_sk is not null)) (type: boolean) |
> | Statistics: Num rows: 73049 Data size: 876588 Basic
> stats: COMPLETE Column stats: COMPLETE |
> | Filter Operator |
> | predicate: ((d_year) IN (2001, 2002) and (d_year =
> 2002) and d_date_sk is not null) (type: boolean) |
> | Statistics: Num rows: 4 Data size: 48 Basic stats:
> COMPLETE Column stats: COMPLETE |
> {code}
> the "real" row count will be 365
> for separate {{IN}} and {{=}} the estimation is very good; but if both are
> present it becomes (very) underestimated.
> {code:java}
> set hive.query.results.cache.enabled=false;
> drop table if exists t1;
> drop table if exists t8;
> create table t1 (a integer,b integer);
> create table t8 like t1;
> insert into t1 values (1,1),(2,2),(3,3),(4,4),(5,5);
> insert into t8
> select * from t1 union all select * from t1 union all select * from t1 union
> all select * from t1 union all
> select * from t1 union all select * from t1 union all select * from t1 union
> all select * from t1
> ;
> analyze table t1 compute statistics for columns;
> analyze table t8 compute statistics for columns;
> explain analyze select sum(a) from t8 where b in (2,3) group by b;
> explain analyze select sum(a) from t8 where b=2 group by b;
> explain analyze select sum(a) from t1 where b in (2,3) and b=2 group by b;
> explain analyze select sum(a) from t8 where b in (2,3) and b=2 group by b;
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)