[
https://issues.apache.org/jira/browse/HIVE-29322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048197#comment-18048197
]
Stamatis Zampetakis commented on HIVE-29322:
--------------------------------------------
Thanks for sharing the doc [~Indhumathi27] ! In the future it may be better to
reply directly in the Jira ticket since it makes the discussion more
self-contained and easier to follow.
According to the document Map 1 outputs *only* 100 rows because we have a
*single* mapper and a *single* reducer. I assume that we had more mappers the
OUTPUT_RECORDS would be {{100 X NUM_MAPPERS}} is that correct?
In order to better understand why TNK_7 is not pruning anything I would like to
know a bit more about the synthetic dataset. In particular since the Top N Key
Operator is applied on column h it is important to share some insights about
the data of this column. What's the number of distinct values for column h? How
and why it was chosen? Would the performance of the query change if the value
distribution of the column was different?
> Performance Degraded with Top N Key Pushdown for Order by Limit Queries
> -----------------------------------------------------------------------
>
> Key: HIVE-29322
> URL: https://issues.apache.org/jira/browse/HIVE-29322
> Project: Hive
> Issue Type: Bug
> Reporter: Indhumathi Muthumurugesh
> Assignee: Indhumathi Muthumurugesh
> Priority: Major
> Labels: pull-request-available
>
> Env: Used latest docker image: docker pull apache/hive:4.2.0
> Table Schema Used:
> create table seg1 (a string,b string,c string,d string,e string,f string,g
> string,h int,i string,j string,k string,l string,m int) ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ',' ;
> create table seg1_orc (a string,b string,c string,d string,e string,f
> string,g string,h int,i string,j string,k string,l string,m int) stored as
> orc;
>
> -- Generated and Inserted 16,777,216 records
>
> -- Query
> select * from seg1_orc order by h limit 100;
>
> Plan:
> {code:java}
> +----------------------------------------------------+
> | Explain |
> +----------------------------------------------------+
> | Plan optimized by CBO. |
> | |
> | Vertex dependency in root stage |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE) |
> | |
> | Stage-0 |
> | Fetch Operator |
> | limit:100 |
> | Stage-1 |
> | Reducer 2 vectorized |
> | File Output Operator [FS_12] |
> | Limit [LIM_11] (rows=100 width=1101) |
> | Number of rows:100 |
> | Select Operator [SEL_10] (rows=16777216 width=1101) |
> |
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"]
> |
> | <-Map 1 [SIMPLE_EDGE] vectorized |
> | SHUFFLE [RS_9] |
> | Select Operator [SEL_8] (rows=16777216 width=1101) |
> |
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"]
> |
> | Top N Key Operator [TNK_7] (rows=16777216 width=1101) |
> | keys:h,top n:100 |
> | TableScan [TS_0] (rows=16777216 width=1101) |
> |
> default@seg1_orc,seg1_orc,Tbl:COMPLETE,Col:COMPLETE,Output:["a","b","c","d","e","f","g","h","i","j","k","l","m"]
> |
> | |
> +----------------------------------------------------+ {code}
> Time Taken: *100 rows selected (40.806 seconds)*
> Set hive.optimize.topnkey=false;
> Time Taken:
> *100 rows selected (7.802 seconds)*
>
> *Issue:*
> It is seen that Top n key is processing all rows in the Map phase.
> Set hive.optimize.topnkey=false;
> {code:java}
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms)
> INPUT_RECORDS OUTPUT_RECORDS
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO : Map 1 7061.00 0 0
> 16,777,216 100
> INFO : Reducer 2 0.00 0 0
> 100 0
> INFO :
> ----------------------------------------------------------------------------------------------
> {code}
> Set hive.optimize.topnkey=true;
> {code:java}
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms)
> INPUT_RECORDS OUTPUT_RECORDS
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO : Map 1 31805.00 0 0
> 16,777,216 16,777,216
> INFO : Reducer 2 8156.00 0 0
> 16,777,216 0 {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)