Re: [PR] HIVE-29322: Avoid TopNKeyOperator When ReduceSink TopNkey Filtering Provides Better Pruning for ORDER BY LIMIT Queries [hive]

via GitHub Wed, 14 Jan 2026 03:34:40 -0800


zabetak commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2689820106



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void 
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
     ogw.startWalking(topNodes, null);
   }
 
+  /*
+   * Build the ReduceSink matching pattern used by TopNKey optimization.
+   *
+   * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+   * applying TopNKey results in a performance regression. ReduceSink
+   * operators created only for ordering must therefore be excluded from
+   * TopNKey.
+   *
+   * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+   * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+   * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+   * between.
+   */
+  private static String buildTopNKeyRegexPattern(OptimizeTezProcContext 
procCtx) {
+    String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+    boolean hasOrderOrLimit =
+            procCtx.parseContext.getQueryProperties().hasLimit() ||
+                    procCtx.parseContext.getQueryProperties().hasOrderBy();

Review Comment:
   Many thanks for the additional experiments for the PTF case. 
   
   > But query performance degradation is not observed for PTF operator case, 
comparing with disabling topnKeyoperator
   
   It's difficult to extract safe conclusions from the comparison between ORDER 
BY and windowing experiments cause dataset size and effective limit differ 
significantly.
   
   Dataset size:
   * for ORDER BY the dataset has ~10M rows
   * for windowing the dataset has 50K rows
   
   Effective limit/top-n filter:
   * for ORDER BY the limit is 100
   * for windowing the limit is ~6K
   
   It would be great if you can run some experiments where the numbers are 
closer.
    
   > Disabling TopNkey for Windowing queries don't show a drastic difference in 
performance
   
   I still believe that this depends on the use-case. The example that you 
crafted for ORDER BY was clearly showing the downsides of the Top-N operator. 
The answer/benchmarks above imply that this will never happen for windowing 
functions but I don't understand why.
   
   > because the shuffle and reducer stages are already needed for window 
partitioning.
   
   I don't fully understand the statement about the shuffle and reducer stages. 
Can you elaborate a bit more?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29322: Avoid TopNKeyOperator When ReduceSink TopNkey Filtering Provides Better Pruning for ORDER BY LIMIT Queries [hive]

Reply via email to