Re: [PR] HIVE-29322: Avoid TopNKeyOperator When ReduceSink TopNkey Filtering Provides Better Pruning for ORDER BY LIMIT Queries [hive]

via GitHub Sat, 10 Jan 2026 11:24:41 -0800


Indhumathi27 commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2678912718



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void 
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
     ogw.startWalking(topNodes, null);
   }
 
+  /*
+   * Build the ReduceSink matching pattern used by TopNKey optimization.
+   *
+   * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+   * applying TopNKey results in a performance regression. ReduceSink
+   * operators created only for ordering must therefore be excluded from
+   * TopNKey.
+   *
+   * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+   * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+   * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+   * between.
+   */
+  private static String buildTopNKeyRegexPattern(OptimizeTezProcContext 
procCtx) {
+    String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+    boolean hasOrderOrLimit =
+            procCtx.parseContext.getQueryProperties().hasLimit() ||
+                    procCtx.parseContext.getQueryProperties().hasOrderBy();

Review Comment:
   I have debugged Windowing queries case and below is the observation.
   
   Testcase (unsorted data):
   `CREATE TABLE topnkey_windowing (tw_code string, tw_value double);
   
   INSERT INTO topnkey_windowing VALUES (NULL, NULL),(NULL, 109),('A', 
109),('A', 104),('A', 110),('A', 120),('A', 103),('A', 109),('B', 105),('B', 
106),('B', 106),('B', NULL),('B', 106),('A', 109);
   
   SELECT tw_code, ranking
   FROM (
     SELECT tw_code AS tw_code,
       rank() OVER (PARTITION BY tw_code ORDER BY tw_value) AS ranking
     FROM topnkey_windowing) tmp1
     WHERE ranking < 2;
   `
   With TopNkey enabled, 
   Map phase: Input records: 14 Output Records: 9
   With TopNkey disabled, 
   Map phase: Input records: 14 Output Records: 8
   Time Taken : both almost same.
   
   1. In PTF queries, TopNKey creates a separate TopNKeyFilter for every 
distinct PARTITION BY key and maintains an in-memory Top-N heap per partition.
   2. Each incoming row performs partition-key hashing, map lookup, and heap 
comparison to decide whether it belongs to that partition’s Top-N.
   3. Rows are eliminated when their sort key compares worse than the current 
Top-N boundary, so they are not inserted into the partition’s ordered Top-N set.
   4. For ORDER BY … LIMIT queries, TopNKey maintains only a single global 
Top-N heap per reducer.
   
   I have tested with low-cardinality, monotonic-ordered windowing dataset and 
high-cardinality, multi-row-per-partition PTF test dataset. In this case, 
behaviour is similar to ORDER by.. Limit queries, where all the rows are 
forwarded. But query performance degradation is not observed for PTF operator 
case, comparing with disabling topnKeyoperator. 
   One such example: 
   
[ptf_testcase.txt](https://github.com/user-attachments/files/24545845/ptf_testcase.txt)
   
   From this experiments, with TopNkey enabled / disabled, performance is 
almost similar for Windowing queries.
   Disabling TopNkey for Windowing queries don't show a drastic difference in 
performance. Forwarding all rows in PTF TopNKey does not cause the catastrophic 
shuffle explosion seen in ORDER BY … LIMIT queries, because the shuffle and 
reducer stages are already needed for window partitioning.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29322: Avoid TopNKeyOperator When ReduceSink TopNkey Filtering Provides Better Pruning for ORDER BY LIMIT Queries [hive]

Reply via email to