[ 
https://issues.apache.org/jira/browse/HIVE-19097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16425372#comment-16425372
 ] 

Zoltan Haindrich commented on HIVE-19097:
-----------------------------------------

Calcite seems to be doing the conversion during sql parsing [sql 
parsing|https://github.com/apache/calcite/blob/6b35306e7818cb3e67e27b9edbb00b1ed8338cdd/core/src/main/java/org/apache/calcite/sql2rel/SqlToRelConverter.java#L1450]
 ; 
it has a configurable limit (which defaults to  
[20|https://github.com/apache/calcite/blob/6b35306e7818cb3e67e27b9edbb00b1ed8338cdd/core/src/main/java/org/apache/calcite/sql2rel/SqlToRelConverter.java#L223])

I don't feel it right that the parser decides whenever there will be a table 
scan for the in values or not and not the planner.
Adding the simplification code to calcite would doesn't make much difference 
since most of the time its already being done....and it seems to me that 
calcite "thinks" about the IN operator more like as the subquery operator - 
because the transformation of the small INs are done so early.

note: as the result of these things calcite uses IN in a form 
IN($1,$2,\{subquery\}) if it creates one - so calcite never uses more than 1 
set element for the in operator..

Hive passes IN-s by using  
[HiveIn|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveIn.java#L27].
It would be also possible to add a calcite rule to hive; which would do the IN 
to OR rewrite at the hive side.

> related equals and in operators may cause inaccurate stats estimations
> ----------------------------------------------------------------------
>
>                 Key: HIVE-19097
>                 URL: https://issues.apache.org/jira/browse/HIVE-19097
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Zoltan Haindrich
>            Assignee: Zoltan Haindrich
>            Priority: Major
>         Attachments: HIVE-19097.01.patch
>
>
> tpcds#74 is optimized in a way that for date_dim the condition contains IN 
> and = for the same column
> {code:java}
> |             Map Operator Tree:                     |
> |                 TableScan                          |
> |                   alias: date_dim                  |
> |                   filterExpr: (((d_year) IN (2001, 2002) and (d_year = 
> 2002) and d_date_sk is not null) or ((d_year) IN (2001, 2002) and (d_year = 
> 2001) and d_date_sk is not null)) (type: boolean) |
> |                   Statistics: Num rows: 73049 Data size: 876588 Basic 
> stats: COMPLETE Column stats: COMPLETE |
> |                   Filter Operator                  |
> |                     predicate: ((d_year) IN (2001, 2002) and (d_year = 
> 2002) and d_date_sk is not null) (type: boolean) |
> |                     Statistics: Num rows: 4 Data size: 48 Basic stats: 
> COMPLETE Column stats: COMPLETE |
> {code}
> the "real" row count will be 365
> for separate {{IN}} and {{=}} the estimation is very good; but if both are 
> present it becomes (very) underestimated.
> {code:java}
> set hive.query.results.cache.enabled=false;
> drop table if exists t1;
> drop table if exists t8;
> create table t1 (a integer,b integer);
> create table t8 like t1;
> insert into t1 values (1,1),(2,2),(3,3),(4,4),(5,5);
> insert into t8
> select * from t1 union all select * from t1 union all select * from t1 union 
> all select * from t1 union all
> select * from t1 union all select * from t1 union all select * from t1 union 
> all select * from t1
> ;
> analyze table t1 compute statistics for columns;
> analyze table t8 compute statistics for columns;
> explain analyze select sum(a) from t8 where b in (2,3) group by b;
> explain analyze select sum(a) from t8 where b=2 group by b;
> explain analyze select sum(a) from t1 where b in (2,3) and b=2 group by b;
> explain analyze select sum(a) from t8 where b in (2,3) and b=2 group by b;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to