[ https://issues.apache.org/jira/browse/HIVE-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rajesh Balamohan reopened HIVE-26185: ------------------------------------- > Need support for metadataonly operations with iceberg (e.g select distinct on > partition column) > ----------------------------------------------------------------------------------------------- > > Key: HIVE-26185 > URL: https://issues.apache.org/jira/browse/HIVE-26185 > Project: Hive > Issue Type: Bug > Components: HiveServer2 > Reporter: Rajesh Balamohan > Priority: Major > Labels: performance > > {noformat} > select distinct ss_sold_date_sk from store_sales > {noformat} > This query scans 1800+ rows in hive acid. But takes ages to process with > NullScanOptimiser during compilation phase > (https://issues.apache.org/jira/browse/HIVE-24262) > {noformat} > Hive ACID > INFO : Executing > command(queryId=hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14): > select distinct ss_sold_date_sk from store_sales > INFO : Compute 'ndembla-test2' is active. > INFO : Query ID = hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14 > INFO : Total jobs = 1 > INFO : Launching Job 1 out of 1 > INFO : Starting task [Stage-1:MAPRED] in serial mode > INFO : Subscribed to counters: [] for queryId: > hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14 > INFO : Tez session hasn't been created yet. Opening session > INFO : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1) > INFO : Status: Running (Executing on YARN cluster with App id > application_1651102345385_0000) > INFO : Status: DAG finished successfully in 1.81 seconds > INFO : DAG ID: dag_1651102345385_0000_5 > INFO : > INFO : Query Execution Summary > INFO : > ---------------------------------------------------------------------------------------------- > INFO : OPERATION DURATION > INFO : > ---------------------------------------------------------------------------------------------- > INFO : Compile Query 55.47s > INFO : Prepare Plan 2.32s > INFO : Get Query Coordinator (AM) 0.13s > INFO : Submit Plan 0.03s > INFO : Start DAG 0.09s > INFO : Run DAG 1.80s > INFO : > ---------------------------------------------------------------------------------------------- > INFO : > INFO : Task Execution Summary > INFO : > ---------------------------------------------------------------------------------------------- > INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) > INPUT_RECORDS OUTPUT_RECORDS > INFO : > ---------------------------------------------------------------------------------------------- > INFO : Map 1 1009.00 0 0 > 1,824 1,824 > INFO : Reducer 2 0.00 0 0 > 1,824 0 > INFO : > ---------------------------------------------------------------------------------------------- > INFO : > {noformat} > However, same query scans *2.8 Billion records.* in iceberg format. This can > be fixed. > {noformat} > INFO : Executing > command(queryId=hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72): > select distinct ss_sold_date_sk from store_sales > INFO : Compute 'ndembla-test2' is active. > INFO : Query ID = hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72 > INFO : Total jobs = 1 > INFO : Launching Job 1 out of 1 > INFO : Starting task [Stage-1:MAPRED] in serial mode > INFO : Subscribed to counters: [] for queryId: > hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72 > INFO : Tez session hasn't been created yet. Opening session > INFO : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1) > INFO : Status: Running (Executing on YARN cluster with App id > application_1651102345385_0000) > ---------------------------------------------------------------------------------------------- > VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING > FAILED KILLED > ---------------------------------------------------------------------------------------------- > Map 1 .......... llap SUCCEEDED 7141 7141 0 0 > 0 0 > Reducer 2 ...... llap SUCCEEDED 2 2 0 0 > 0 0 > ---------------------------------------------------------------------------------------------- > VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 18.48 s > ---------------------------------------------------------------------------------------------- > INFO : Status: DAG finished successfully in 17.97 seconds > INFO : DAG ID: dag_1651102345385_0000_4 > INFO : > INFO : Query Execution Summary > INFO : > ---------------------------------------------------------------------------------------------- > INFO : OPERATION DURATION > INFO : > ---------------------------------------------------------------------------------------------- > INFO : Compile Query 1.81s > INFO : Prepare Plan 0.04s > INFO : Get Query Coordinator (AM) 0.14s > INFO : Submit Plan 0.02s > INFO : Start DAG 0.03s > INFO : Run DAG 17.97s > INFO : > ---------------------------------------------------------------------------------------------- > INFO : > INFO : Task Execution Summary > INFO : > ---------------------------------------------------------------------------------------------- > INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) > INPUT_RECORDS OUTPUT_RECORDS > INFO : > ---------------------------------------------------------------------------------------------- > INFO : Map 1 15142.00 0 0 > 28,800,426,268 8,611 > INFO : Reducer 2 956.00 0 0 > 8,611 0 > INFO : > ---------------------------------------------------------------------------------------------- > INFO : > {noformat} > It will be beneficial to make use of MetadataOnlyOptimizer for iceberg tables > as well. -- This message was sent by Atlassian Jira (v8.20.7#820007)