Re: [PR] HIVE-29322: Avoid TopNKeyOperator When ReduceSink TopNkey Filtering Provides Better Pruning for ORDER BY LIMIT Queries [hive]

via GitHub Wed, 07 Jan 2026 03:59:08 -0800


zabetak commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2667951917



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1300,9 +1300,17 @@ private static void 
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
       return;
     }
 
+    // Restrict TopNKey matching to GROUP BY / JOIN ReduceSinks, except for 
PTF queries.
+    String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+    String topNKeyRegexPattern = hasPTFReduceSink(procCtx) ? reduceSinkOp :
+            ".*(" + GroupByOperator.getOperatorName() + "|" +

Review Comment:
   Is it necessary to have the match any prefix `.*` at the beginning of the 
pattern?



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:
##########
@@ -198,6 +198,11 @@ public Object process(Node nd, Stack<Node> stack,
           }
           currentOp = currentOp.getChildOperators().get(0);
         }
+        // Skip RS copy for plans that are already TopNKey-optimized to avoid 
re-propagating
+        // TopNKey configuration into the same ReduceSink.
+        if (((LimitPushdownContext)procCtx).disallowRSCopy && 
foundGroupByOperator) {

Review Comment:
   1. Can you provide an example of query/plan/qfile that necessitates this 
conditional? 
   2. Why do we need this conditional now and we didn't have it before?



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void 
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
     ogw.startWalking(topNodes, null);
   }
 
+  /*
+   * Build the ReduceSink matching pattern used by TopNKey optimization.
+   *
+   * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+   * applying TopNKey results in a performance regression. ReduceSink
+   * operators created only for ordering must therefore be excluded from
+   * TopNKey.
+   *
+   * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+   * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+   * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+   * between.
+   */
+  private static String buildTopNKeyRegexPattern(OptimizeTezProcContext 
procCtx) {
+    String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+    boolean hasOrderOrLimit =
+            procCtx.parseContext.getQueryProperties().hasLimit() ||
+                    procCtx.parseContext.getQueryProperties().hasOrderBy();

Review Comment:
   Why do we need this? It is usually better if we can keep the 
optimization/transformation rules independent of the SQL syntax.



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void 
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
     ogw.startWalking(topNodes, null);
   }
 
+  /*
+   * Build the ReduceSink matching pattern used by TopNKey optimization.
+   *
+   * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+   * applying TopNKey results in a performance regression. ReduceSink
+   * operators created only for ordering must therefore be excluded from
+   * TopNKey.
+   *
+   * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+   * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+   * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+   * between.
+   */
+  private static String buildTopNKeyRegexPattern(OptimizeTezProcContext 
procCtx) {
+    String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+    boolean hasOrderOrLimit =
+            procCtx.parseContext.getQueryProperties().hasLimit() ||
+                    procCtx.parseContext.getQueryProperties().hasOrderBy();
+
+    if (hasPTFReduceSink(procCtx) || !hasOrderOrLimit) {
+      return reduceSinkOp;
+    }
+
+    return "("
+            + GroupByOperator.getOperatorName() + "|"
+            + PTFOperator.getOperatorName() + "|"
+            + JoinOperator.getOperatorName() + "|"
+            + MapJoinOperator.getOperatorName() + "|"
+            + LateralViewJoinOperator.getOperatorName() + "|"
+            + CommonMergeJoinOperator.getOperatorName()
+            + ")"
+            + "(%("
+            + SelectOperator.getOperatorName() + "|"
+            + FilterOperator.getOperatorName()

Review Comment:
   Do we really care what operators are between the join/gby and the RS?



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1330,15 @@ private static void 
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
     ogw.startWalking(topNodes, null);
   }
 
+  private static boolean hasPTFReduceSink(OptimizeTezProcContext ctx) {
+    for (ReduceSinkOperator rs : ctx.visitedReduceSinks) {
+      if (rs.getConf().isPTFReduceSink()) {
+        return true;
+      }
+    }
+    return false;
+  }
+

Review Comment:
   Not sure if this achieves the desired outcome. Basically, if there is a PTF 
RS anywhere in the plan we will apply the rule on every RS (no matter if it is 
PTF or not). 
   
   Moreover, by relying on `ctx.visitedReduceSinks` we make the 
TopNKeyOptimization highly dependent on stats dependent optimization which is 
not great.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:
##########
@@ -250,8 +255,12 @@ private static class LimitPushdownContext implements 
NodeProcessorCtx {
 
     private final float threshold;
 
+    private final boolean disallowRSCopy;
+
     public LimitPushdownContext(HiveConf conf) throws SemanticException {
       threshold = 
conf.getFloatVar(HiveConf.ConfVars.HIVE_LIMIT_PUSHDOWN_MEMORY_USAGE);
+      disallowRSCopy = 
conf.getBoolVar(HiveConf.ConfVars.HIVE_OPTIMIZE_TOPNKEY) &&
+              !conf.getBoolVar(HiveConf.ConfVars.HIVE_MAPSIDE_AGGREGATE);

Review Comment:
   Can you please explain how and why is this related to the TopNKey 
optimization and the map-side aggregation? The connection is not obvious for me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29322: Avoid TopNKeyOperator When ReduceSink TopNkey Filtering Provides Better Pruning for ORDER BY LIMIT Queries [hive]

Reply via email to