[
https://issues.apache.org/jira/browse/HIVE-27084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772464#comment-17772464
]
okumin commented on HIVE-27084:
-------------------------------
[~rajesh.balamohan] It could not be a problem of Iceberg and it could be an
expected behavior.
First, the first query scans two columns, `ss_sold_date_sk` and
`ss_wholesale_cost`. The second query scans a single column, `ss_sold_date_sk`.
I guess `ss_wholesale_cost` is not a partition column. So, the first query has
to scan more data than the second one.
{code:sql}
select count(*) from store_sales where ss_sold_date_sk=2450822 and
ss_wholesale_cost > 0.0;
select count(*) from store_sales where ss_sold_date_sk=2450822
{code}
Then, I guess `ss_sold_date_sk` is int or bigint and `ss_wholesale_cost` is
decimal. On my machine, the derived size of bigint is 8 while that of decimal
is 112. It can potentially make a big difference.
> Iceberg: Stats are not populated correctly during query compilation
> -------------------------------------------------------------------
>
> Key: HIVE-27084
> URL: https://issues.apache.org/jira/browse/HIVE-27084
> Project: Hive
> Issue Type: Improvement
> Components: Iceberg integration
> Reporter: Rajesh Balamohan
> Priority: Major
> Labels: performance
>
> - Table stats are not properly used/computed during query compilation phase.
> - Here is an example. Check out the query with the filter which give more
> data than the regular query
> This is just an example, real world queries can have bad query plans due to
> this
> {{10470974584 with filter, vs 303658262936 without filter}}
> {noformat}
> explain select count(*) from store_sales where ss_sold_date_sk=2450822 and
> ss_wholesale_cost > 0.0
> Explain
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> DagId: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48
> Edges:
> Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
> DagName: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: store_sales
> filterExpr: ((ss_sold_date_sk = 2450822) and
> (ss_wholesale_cost > 0)) (type: boolean)
> Statistics: Num rows: 2755519629 Data size: 303658262936
> Basic stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: ((ss_sold_date_sk = 2450822) and
> (ss_wholesale_cost > 0)) (type: boolean)
> Statistics: Num rows: 5 Data size: 550 Basic stats:
> COMPLETE Column stats: NONE
> Select Operator
> Statistics: Num rows: 5 Data size: 550 Basic stats:
> COMPLETE Column stats: NONE
> Group By Operator
> aggregations: count()
> minReductionHashAggr: 0.99
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 124 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> null sort order:
> sort order:
> Statistics: Num rows: 1 Data size: 124 Basic stats:
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: bigint)
> Execution mode: vectorized, llap
> LLAP IO: all inputs (cache only)
> Reducer 2
> Execution mode: vectorized, llap
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE
> Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 124 Basic stats:
> COMPLETE Column stats: NONE
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> 58 rows selected (0.73 seconds)
> explain select count(*) from store_sales where ss_sold_date_sk=2450822
> INFO : Starting task [Stage-3:EXPLAIN] in serial mode
> INFO : Completed executing
> command(queryId=hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7);
> Time taken: 0.061 seconds
> INFO : OK
> Explain
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> DagId: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49
> Edges:
> Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
> DagName: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: store_sales
> filterExpr: (ss_sold_date_sk = 2450822) (type: boolean)
> Statistics: Num rows: 2755519629 Data size: 10470974584
> Basic stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: (ss_sold_date_sk = 2450822) (type: boolean)
> Statistics: Num rows: 5 Data size: 18 Basic stats:
> COMPLETE Column stats: NONE
> Select Operator
> Statistics: Num rows: 5 Data size: 18 Basic stats:
> COMPLETE Column stats: NONE
> Group By Operator
> aggregations: count()
> minReductionHashAggr: 0.99
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 12 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> null sort order:
> sort order:
> Statistics: Num rows: 1 Data size: 12 Basic stats:
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: bigint)
> Execution mode: vectorized, llap
> LLAP IO: all inputs (cache only)
> Reducer 2
> Execution mode: vectorized, llap
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE
> Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE
> Column stats: NONE
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> 58 rows selected (0.529 seconds)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)