[
https://issues.apache.org/jira/browse/HIVE-29322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048239#comment-18048239
]
Stamatis Zampetakis commented on HIVE-29322:
--------------------------------------------
I am bit surprised by the results of the experiment, especially by case (2)
where values in column h follow a uniform distribution. I would expect that a
significant portion of those is filtered by the Top N Key Operator and never
reach the reducer.
If NDV(h) = COUNT(seg1) then:
* Best case: values in column h monotonically increasing so TNK prunes all rows
* Worst case: values in column h monotonically decreasing so TNK prunes nothing
Moreover, it appears that case (3) cannot be isolated and the statement that
TNK is efficient or not depends on the direction of the ordering (ASC/DESC),
distribution, and NDVs.
> Performance Degraded with Top N Key Pushdown for Order by Limit Queries
> -----------------------------------------------------------------------
>
> Key: HIVE-29322
> URL: https://issues.apache.org/jira/browse/HIVE-29322
> Project: Hive
> Issue Type: Bug
> Reporter: Indhumathi Muthumurugesh
> Assignee: Indhumathi Muthumurugesh
> Priority: Major
> Labels: pull-request-available
>
> Env: Used latest docker image: docker pull apache/hive:4.2.0
> Table Schema Used:
> create table seg1 (a string,b string,c string,d string,e string,f string,g
> string,h int,i string,j string,k string,l string,m int) ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ',' ;
> create table seg1_orc (a string,b string,c string,d string,e string,f
> string,g string,h int,i string,j string,k string,l string,m int) stored as
> orc;
>
> -- Generated and Inserted 16,777,216 records
>
> -- Query
> select * from seg1_orc order by h limit 100;
>
> Plan:
> {code:java}
> +----------------------------------------------------+
> | Explain |
> +----------------------------------------------------+
> | Plan optimized by CBO. |
> | |
> | Vertex dependency in root stage |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE) |
> | |
> | Stage-0 |
> | Fetch Operator |
> | limit:100 |
> | Stage-1 |
> | Reducer 2 vectorized |
> | File Output Operator [FS_12] |
> | Limit [LIM_11] (rows=100 width=1101) |
> | Number of rows:100 |
> | Select Operator [SEL_10] (rows=16777216 width=1101) |
> |
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"]
> |
> | <-Map 1 [SIMPLE_EDGE] vectorized |
> | SHUFFLE [RS_9] |
> | Select Operator [SEL_8] (rows=16777216 width=1101) |
> |
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"]
> |
> | Top N Key Operator [TNK_7] (rows=16777216 width=1101) |
> | keys:h,top n:100 |
> | TableScan [TS_0] (rows=16777216 width=1101) |
> |
> default@seg1_orc,seg1_orc,Tbl:COMPLETE,Col:COMPLETE,Output:["a","b","c","d","e","f","g","h","i","j","k","l","m"]
> |
> | |
> +----------------------------------------------------+ {code}
> Time Taken: *100 rows selected (40.806 seconds)*
> Set hive.optimize.topnkey=false;
> Time Taken:
> *100 rows selected (7.802 seconds)*
>
> *Issue:*
> It is seen that Top n key is processing all rows in the Map phase.
> Set hive.optimize.topnkey=false;
> {code:java}
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms)
> INPUT_RECORDS OUTPUT_RECORDS
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO : Map 1 7061.00 0 0
> 16,777,216 100
> INFO : Reducer 2 0.00 0 0
> 100 0
> INFO :
> ----------------------------------------------------------------------------------------------
> {code}
> Set hive.optimize.topnkey=true;
> {code:java}
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms)
> INPUT_RECORDS OUTPUT_RECORDS
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO : Map 1 31805.00 0 0
> 16,777,216 16,777,216
> INFO : Reducer 2 8156.00 0 0
> 16,777,216 0 {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)