Re: [PR] HIVE-29322: Avoid TopNKeyOperator When ReduceSink TopNkey Filtering Provides Better Pruning for ORDER BY LIMIT Queries [hive]

via GitHub Thu, 15 Jan 2026 19:59:01 -0800


Indhumathi27 commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2696786118



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void 
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
     ogw.startWalking(topNodes, null);
   }
 
+  /*
+   * Build the ReduceSink matching pattern used by TopNKey optimization.
+   *
+   * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+   * applying TopNKey results in a performance regression. ReduceSink
+   * operators created only for ordering must therefore be excluded from
+   * TopNKey.
+   *
+   * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+   * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+   * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+   * between.
+   */
+  private static String buildTopNKeyRegexPattern(OptimizeTezProcContext 
procCtx) {
+    String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+    boolean hasOrderOrLimit =
+            procCtx.parseContext.getQueryProperties().hasLimit() ||
+                    procCtx.parseContext.getQueryProperties().hasOrderBy();

Review Comment:
   > It would be great if you can run some experiments where the numbers are 
closer.
   
   Below are the number's with more data.
   Total rows: 51200000
   Dataset: Mixed Partition data
   Table schema and query: 
[ptf_testcase.txt](https://github.com/user-attachments/files/24545845/ptf_testcase.txt)
   
   With TopNkey enabled:
   <img width="967" height="491" alt="Screenshot 2026-01-16 at 9 14 54 AM" 
src="https://github.com/user-attachments/assets/77ca5a34-a604-4dc4-8a49-4c540d07a391";
 />
   
   
   With TopNkey disabled:
   <img width="967" height="491" alt="Screenshot 2026-01-16 at 9 13 50 AM" 
src="https://github.com/user-attachments/assets/bd1a9ecc-0b28-47f0-9de3-4b2a121f5113";
 />
   
   
   > I don't fully understand the statement about the shuffle and reducer 
stages. Can you elaborate a bit more?
   
   For PTF (windowing) queries, the shuffle and reducer stages are required to 
group rows by the PARTITION BY key before window functions can be evaluated. 
TopNKey operates per partition and, even when it forwards most or all rows, it 
does not introduce additional global shuffle or change the reducer fan-in.
   
   In contrast, for ORDER BY … LIMIT queries, TopNKey maintains a single global 
Top-N heap and can disable ReduceSink-level Top-N pruning; when input data is 
unsorted, this causes all rows to be shuffled globally, leading to severe 
performance degradation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29322: Avoid TopNKeyOperator When ReduceSink TopNkey Filtering Provides Better Pruning for ORDER BY LIMIT Queries [hive]

Reply via email to