Rajesh Balamohan created HIVE-27099:
---------------------------------------
Summary: Iceberg: select count(*) from table queries all data
Key: HIVE-27099
URL: https://issues.apache.org/jira/browse/HIVE-27099
Project: Hive
Issue Type: Improvement
Reporter: Rajesh Balamohan
select count is scanning all data. Though it has complete basic stats, it
launched tez job which wasn't needed. Second issue is, it ended up scanning
ENTIRE 148 GB dataset which is completely not required. It should have got the
data from parq files itself. Ideal situation is getting entire records from
manifest itself.
Data is stored in parquet format in external tables. This may be broken for
parquet, as for ORC it is able to read less data (footer info).
1. Consider fixing count( * ) for parq
2. Check if it is possible to read stats from iceberg manifests after #1.
{noformat}
explain select count(*) from store_sales;
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
DagId: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5
Edges:
Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
DagName: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: store_sales
Statistics: Num rows: 2879966589 Data size: 195666988943
Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
Statistics: Num rows: 2879966589 Data size: 195666988943
Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: count()
minReductionHashAggr: 0.5
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: COMPLETE
Reduce Output Operator
null sort order:
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: bigint)
Execution mode: vectorized
Reducer 2
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: COMPLETE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
53 rows selected (1.454 seconds)
0: jdbc:hive2://ve0:218> select count(*) from store_sales;
INFO : Query ID = hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Subscribed to counters: [] for queryId:
hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8
INFO : Session is already open
INFO : Dag name: select count(*) from store_sales (Stage-1)
INFO : Status: Running (Executing on YARN cluster with App id
application_1676286357243_0061)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING
FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 767 767 0 0
0 0
Reducer 2 ...... container SUCCEEDED 1 1 0 0
0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 54.94 s
----------------------------------------------------------------------------------------------
INFO : Status: DAG finished successfully in 54.85 seconds
INFO :
INFO : Query Execution Summary
INFO :
----------------------------------------------------------------------------------------------
INFO : OPERATION DURATION
INFO :
----------------------------------------------------------------------------------------------
INFO : Compile Query 1.42s
INFO : Prepare Plan 0.18s
INFO : Get Query Coordinator (AM) 0.01s
INFO : Submit Plan 0.37s
INFO : Start DAG 0.08s
INFO : Run DAG 54.84s
INFO :
----------------------------------------------------------------------------------------------
INFO :
INFO : Task Execution Summary
INFO :
----------------------------------------------------------------------------------------------
INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms)
INPUT_RECORDS OUTPUT_RECORDS
INFO :
----------------------------------------------------------------------------------------------
INFO : Map 1 41326.00 4,280,860 28,890
2,879,966,589 1,332
INFO : Reducer 2 26123.00 4,750 34
767 0
INFO :
----------------------------------------------------------------------------------------------
INFO :
INFO : org.apache.tez.common.counters.DAGCounter:
INFO : NUM_SUCCEEDED_TASKS: 768
INFO : TOTAL_LAUNCHED_TASKS: 768
INFO : DATA_LOCAL_TASKS: 677
INFO : RACK_LOCAL_TASKS: 90
INFO : AM_CPU_MILLISECONDS: 101240
INFO : AM_GC_TIME_MILLIS: 156
INFO : File System Counters:
INFO : FILE_BYTES_READ: 2820
INFO : FILE_BYTES_WRITTEN: 46020
INFO : HDFS_BYTES_WRITTEN: 110
INFO : HDFS_READ_OPS: 2
INFO : HDFS_WRITE_OPS: 2
INFO : HDFS_OP_CREATE: 1
INFO : HDFS_OP_GET_FILE_STATUS: 2
INFO : HDFS_OP_RENAME: 1
INFO : OFS_BYTES_READ: 148270746936
INFO : OFS_READ_OPS: 11058
...
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)