[ 
https://issues.apache.org/jira/browse/HIVE-29322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048239#comment-18048239
 ] 

Stamatis Zampetakis commented on HIVE-29322:
--------------------------------------------

I am bit surprised by the results of the experiment, especially by case (2) 
where values in column h follow a uniform distribution. I would expect that a 
significant portion of those is filtered by the Top N Key Operator and never 
reach the reducer. 

If NDV(h) = COUNT(seg1) then:
* Best case: values in column h monotonically increasing so TNK prunes all rows 
* Worst case: values in column h monotonically decreasing so TNK prunes nothing

Moreover, it appears that case (3) cannot be isolated and the statement that 
TNK is efficient or not depends on the direction of the ordering (ASC/DESC), 
distribution, and NDVs.

> Performance Degraded with Top N Key Pushdown for Order by Limit Queries
> -----------------------------------------------------------------------
>
>                 Key: HIVE-29322
>                 URL: https://issues.apache.org/jira/browse/HIVE-29322
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Indhumathi Muthumurugesh
>            Assignee: Indhumathi Muthumurugesh
>            Priority: Major
>              Labels: pull-request-available
>
> Env: Used latest docker image: docker pull apache/hive:4.2.0
> Table Schema Used:
> create table seg1 (a string,b string,c string,d string,e string,f string,g 
> string,h int,i string,j string,k string,l string,m int) ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY ',' ;
> create table seg1_orc (a string,b string,c string,d string,e string,f 
> string,g string,h int,i string,j string,k string,l string,m int) stored as 
> orc;
>  
> -- Generated and Inserted  16,777,216 records
>  
> -- Query
> select * from seg1_orc order by h limit 100;
>  
> Plan:
> {code:java}
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in root stage                    |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-0                                            |
> |   Fetch Operator                                   |
> |     limit:100                                      |
> |     Stage-1                                        |
> |       Reducer 2 vectorized                         |
> |       File Output Operator [FS_12]                 |
> |         Limit [LIM_11] (rows=100 width=1101)       |
> |           Number of rows:100                       |
> |           Select Operator [SEL_10] (rows=16777216 width=1101) |
> |             
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"]
>  |
> |           <-Map 1 [SIMPLE_EDGE] vectorized         |
> |             SHUFFLE [RS_9]                         |
> |               Select Operator [SEL_8] (rows=16777216 width=1101) |
> |                 
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"]
>  |
> |                 Top N Key Operator [TNK_7] (rows=16777216 width=1101) |
> |                   keys:h,top n:100                 |
> |                   TableScan [TS_0] (rows=16777216 width=1101) |
> |                     
> default@seg1_orc,seg1_orc,Tbl:COMPLETE,Col:COMPLETE,Output:["a","b","c","d","e","f","g","h","i","j","k","l","m"]
>  |
> |                                                    |
> +----------------------------------------------------+ {code}
> Time Taken:  *100 rows selected (40.806 seconds)*
> Set hive.optimize.topnkey=false;
> Time Taken:  
> *100 rows selected (7.802 seconds)*
>  
> *Issue:*
> It is seen that Top n key is processing all rows in the Map phase.
> Set hive.optimize.topnkey=false;
> {code:java}
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   
> INPUT_RECORDS   OUTPUT_RECORDS
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  :      Map 1           7061.00              0              0      
> 16,777,216              100
> INFO  :  Reducer 2              0.00              0              0            
>  100                0
> INFO  : 
> ----------------------------------------------------------------------------------------------
>  {code}
> Set hive.optimize.topnkey=true;
> {code:java}
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   
> INPUT_RECORDS   OUTPUT_RECORDS
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  :      Map 1          31805.00              0              0      
> 16,777,216       16,777,216
> INFO  :  Reducer 2           8156.00              0              0      
> 16,777,216                0 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to