[ 
https://issues.apache.org/jira/browse/HIVE-29322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048197#comment-18048197
 ] 

Stamatis Zampetakis commented on HIVE-29322:
--------------------------------------------

Thanks for sharing the doc [~Indhumathi27] ! In the future it may be better to 
reply directly in the Jira ticket since it makes the discussion more 
self-contained and easier to follow.

According to the document Map 1 outputs *only* 100 rows because we have a 
*single* mapper and a *single* reducer. I assume that we had more mappers the 
OUTPUT_RECORDS would be {{100 X NUM_MAPPERS}} is that correct?

In order to better understand why TNK_7 is not pruning anything I would like to 
know a bit more about the synthetic dataset. In particular since the Top N Key 
Operator is applied on column h it is important to share some insights about 
the data of this column. What's the number of distinct values for column h? How 
and why it was chosen? Would the performance of the query change if the value 
distribution of the column was different?

> Performance Degraded with Top N Key Pushdown for Order by Limit Queries
> -----------------------------------------------------------------------
>
>                 Key: HIVE-29322
>                 URL: https://issues.apache.org/jira/browse/HIVE-29322
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Indhumathi Muthumurugesh
>            Assignee: Indhumathi Muthumurugesh
>            Priority: Major
>              Labels: pull-request-available
>
> Env: Used latest docker image: docker pull apache/hive:4.2.0
> Table Schema Used:
> create table seg1 (a string,b string,c string,d string,e string,f string,g 
> string,h int,i string,j string,k string,l string,m int) ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY ',' ;
> create table seg1_orc (a string,b string,c string,d string,e string,f 
> string,g string,h int,i string,j string,k string,l string,m int) stored as 
> orc;
>  
> -- Generated and Inserted  16,777,216 records
>  
> -- Query
> select * from seg1_orc order by h limit 100;
>  
> Plan:
> {code:java}
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in root stage                    |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-0                                            |
> |   Fetch Operator                                   |
> |     limit:100                                      |
> |     Stage-1                                        |
> |       Reducer 2 vectorized                         |
> |       File Output Operator [FS_12]                 |
> |         Limit [LIM_11] (rows=100 width=1101)       |
> |           Number of rows:100                       |
> |           Select Operator [SEL_10] (rows=16777216 width=1101) |
> |             
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"]
>  |
> |           <-Map 1 [SIMPLE_EDGE] vectorized         |
> |             SHUFFLE [RS_9]                         |
> |               Select Operator [SEL_8] (rows=16777216 width=1101) |
> |                 
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"]
>  |
> |                 Top N Key Operator [TNK_7] (rows=16777216 width=1101) |
> |                   keys:h,top n:100                 |
> |                   TableScan [TS_0] (rows=16777216 width=1101) |
> |                     
> default@seg1_orc,seg1_orc,Tbl:COMPLETE,Col:COMPLETE,Output:["a","b","c","d","e","f","g","h","i","j","k","l","m"]
>  |
> |                                                    |
> +----------------------------------------------------+ {code}
> Time Taken:  *100 rows selected (40.806 seconds)*
> Set hive.optimize.topnkey=false;
> Time Taken:  
> *100 rows selected (7.802 seconds)*
>  
> *Issue:*
> It is seen that Top n key is processing all rows in the Map phase.
> Set hive.optimize.topnkey=false;
> {code:java}
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   
> INPUT_RECORDS   OUTPUT_RECORDS
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  :      Map 1           7061.00              0              0      
> 16,777,216              100
> INFO  :  Reducer 2              0.00              0              0            
>  100                0
> INFO  : 
> ----------------------------------------------------------------------------------------------
>  {code}
> Set hive.optimize.topnkey=true;
> {code:java}
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   
> INPUT_RECORDS   OUTPUT_RECORDS
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  :      Map 1          31805.00              0              0      
> 16,777,216       16,777,216
> INFO  :  Reducer 2           8156.00              0              0      
> 16,777,216                0 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to