[jira] [Commented] (HIVE-27084) Iceberg: Stats are not populated correctly during query compilation

okumin (Jira) Fri, 06 Oct 2023 00:19:04 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-27084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772464#comment-17772464
 ]


okumin commented on HIVE-27084:
-------------------------------

[~rajesh.balamohan] It could not be a problem of Iceberg and it could be an 
expected behavior.

First, the first query scans two columns, `ss_sold_date_sk` and 
`ss_wholesale_cost`. The second query scans a single column, `ss_sold_date_sk`. 
I guess `ss_wholesale_cost` is not a partition column. So, the first query has 
to scan more data than the second one.

{code:sql}
select count(*) from store_sales where ss_sold_date_sk=2450822 and 
ss_wholesale_cost > 0.0;
select count(*) from store_sales where ss_sold_date_sk=2450822
{code}

Then, I guess `ss_sold_date_sk` is int or bigint and `ss_wholesale_cost` is 
decimal. On my machine, the derived size of bigint is 8 while that of decimal 
is 112. It can potentially make a big difference.

> Iceberg: Stats are not populated correctly during query compilation
> -------------------------------------------------------------------
>
>                 Key: HIVE-27084
>                 URL: https://issues.apache.org/jira/browse/HIVE-27084
>             Project: Hive
>          Issue Type: Improvement
>          Components: Iceberg integration
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance
>
> - Table stats are not properly used/computed during query compilation phase.
>  - Here is an example. Check out the query with the filter which give more 
> data than the regular query
> This is just an example, real world queries can have bad query plans due to 
> this
> {{10470974584 with filter, vs 303658262936 without filter}}
> {noformat}
> explain select count(*) from store_sales where ss_sold_date_sk=2450822 and 
> ss_wholesale_cost > 0.0
> Explain
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48
>       Edges:
>         Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
>       DagName: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: store_sales
>                   filterExpr: ((ss_sold_date_sk = 2450822) and 
> (ss_wholesale_cost > 0)) (type: boolean)
>                   Statistics: Num rows: 2755519629 Data size: 303658262936 
> Basic stats: COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: ((ss_sold_date_sk = 2450822) and 
> (ss_wholesale_cost > 0)) (type: boolean)
>                     Statistics: Num rows: 5 Data size: 550 Basic stats: 
> COMPLETE Column stats: NONE
>                     Select Operator
>                       Statistics: Num rows: 5 Data size: 550 Basic stats: 
> COMPLETE Column stats: NONE
>                       Group By Operator
>                         aggregations: count()
>                         minReductionHashAggr: 0.99
>                         mode: hash
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 1 Data size: 124 Basic stats: 
> COMPLETE Column stats: NONE
>                         Reduce Output Operator
>                           null sort order:
>                           sort order:
>                           Statistics: Num rows: 1 Data size: 124 Basic stats: 
> COMPLETE Column stats: NONE
>                           value expressions: _col0 (type: bigint)
>             Execution mode: vectorized, llap
>             LLAP IO: all inputs (cache only)
>         Reducer 2
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE 
> Column stats: NONE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 124 Basic stats: 
> COMPLETE Column stats: NONE
>                   table:
>                       input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                       serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> 58 rows selected (0.73 seconds)
> explain select count(*) from store_sales where ss_sold_date_sk=2450822
> INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
> INFO  : Completed executing 
> command(queryId=hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7); 
> Time taken: 0.061 seconds
> INFO  : OK
> Explain
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49
>       Edges:
>         Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
>       DagName: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: store_sales
>                   filterExpr: (ss_sold_date_sk = 2450822) (type: boolean)
>                   Statistics: Num rows: 2755519629 Data size: 10470974584 
> Basic stats: COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: (ss_sold_date_sk = 2450822) (type: boolean)
>                     Statistics: Num rows: 5 Data size: 18 Basic stats: 
> COMPLETE Column stats: NONE
>                     Select Operator
>                       Statistics: Num rows: 5 Data size: 18 Basic stats: 
> COMPLETE Column stats: NONE
>                       Group By Operator
>                         aggregations: count()
>                         minReductionHashAggr: 0.99
>                         mode: hash
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 1 Data size: 12 Basic stats: 
> COMPLETE Column stats: NONE
>                         Reduce Output Operator
>                           null sort order:
>                           sort order:
>                           Statistics: Num rows: 1 Data size: 12 Basic stats: 
> COMPLETE Column stats: NONE
>                           value expressions: _col0 (type: bigint)
>             Execution mode: vectorized, llap
>             LLAP IO: all inputs (cache only)
>         Reducer 2
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE 
> Column stats: NONE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE 
> Column stats: NONE
>                   table:
>                       input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                       serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> 58 rows selected (0.529 seconds)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27084) Iceberg: Stats are not populated correctly during query compilation

Reply via email to