Rajesh Balamohan created HIVE-27084:
---------------------------------------
Summary: Iceberg: Stats are not populated correctly during query
compilation
Key: HIVE-27084
URL: https://issues.apache.org/jira/browse/HIVE-27084
Project: Hive
Issue Type: Improvement
Components: Iceberg integration
Reporter: Rajesh Balamohan
- Table stats are not properly used/computed during query compilation phase.
- Here is an example. Check out the query with the filter which give more data
than the regular query
This is just an example, real world queries can have bad query plans due to this
{{10470974584 with filter, vs 303658262936 without filter}}
{noformat}
explain select count(*) from store_sales where ss_sold_date_sk=2450822 and
ss_wholesale_cost > 0.0
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
DagId: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48
Edges:
Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
DagName: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: store_sales
filterExpr: ((ss_sold_date_sk = 2450822) and
(ss_wholesale_cost > 0)) (type: boolean)
Statistics: Num rows: 2755519629 Data size: 303658262936
Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((ss_sold_date_sk = 2450822) and
(ss_wholesale_cost > 0)) (type: boolean)
Statistics: Num rows: 5 Data size: 550 Basic stats:
COMPLETE Column stats: NONE
Select Operator
Statistics: Num rows: 5 Data size: 550 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
minReductionHashAggr: 0.99
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 124 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
null sort order:
sort order:
Statistics: Num rows: 1 Data size: 124 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Execution mode: vectorized, llap
LLAP IO: all inputs (cache only)
Reducer 2
Execution mode: vectorized, llap
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE
Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
58 rows selected (0.73 seconds)
explain select count(*) from store_sales where ss_sold_date_sk=2450822
INFO : Starting task [Stage-3:EXPLAIN] in serial mode
INFO : Completed executing
command(queryId=hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7); Time
taken: 0.061 seconds
INFO : OK
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
DagId: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49
Edges:
Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
DagName: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: store_sales
filterExpr: (ss_sold_date_sk = 2450822) (type: boolean)
Statistics: Num rows: 2755519629 Data size: 10470974584 Basic
stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (ss_sold_date_sk = 2450822) (type: boolean)
Statistics: Num rows: 5 Data size: 18 Basic stats: COMPLETE
Column stats: NONE
Select Operator
Statistics: Num rows: 5 Data size: 18 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
minReductionHashAggr: 0.99
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 12 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
null sort order:
sort order:
Statistics: Num rows: 1 Data size: 12 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Execution mode: vectorized, llap
LLAP IO: all inputs (cache only)
Reducer 2
Execution mode: vectorized, llap
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE
Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
58 rows selected (0.529 seconds)
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)