Indhumathi27 commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2696786118
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
ogw.startWalking(topNodes, null);
}
+ /*
+ * Build the ReduceSink matching pattern used by TopNKey optimization.
+ *
+ * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+ * applying TopNKey results in a performance regression. ReduceSink
+ * operators created only for ordering must therefore be excluded from
+ * TopNKey.
+ *
+ * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+ * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+ * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+ * between.
+ */
+ private static String buildTopNKeyRegexPattern(OptimizeTezProcContext
procCtx) {
+ String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+ boolean hasOrderOrLimit =
+ procCtx.parseContext.getQueryProperties().hasLimit() ||
+ procCtx.parseContext.getQueryProperties().hasOrderBy();
Review Comment:
> It would be great if you can run some experiments where the numbers are
closer.
Below are the number's with more data.
Total rows: 51200000
Dataset: Mixed Partition data
Table schema and query:
[ptf_testcase.txt](https://github.com/user-attachments/files/24545845/ptf_testcase.txt)
With TopNkey enabled:
<img width="967" height="491" alt="Screenshot 2026-01-16 at 9 14 54 AM"
src="https://github.com/user-attachments/assets/77ca5a34-a604-4dc4-8a49-4c540d07a391"
/>
With TopNkey disabled:
<img width="967" height="491" alt="Screenshot 2026-01-16 at 9 13 50 AM"
src="https://github.com/user-attachments/assets/bd1a9ecc-0b28-47f0-9de3-4b2a121f5113"
/>
> I don't fully understand the statement about the shuffle and reducer
stages. Can you elaborate a bit more?
For PTF (windowing) queries, the shuffle and reducer stages are required to
group rows by the PARTITION BY key before window functions can be evaluated.
TopNKey operates per partition and, even when it forwards most or all rows, it
does not introduce additional global shuffle or change the reducer fan-in.
In contrast, for ORDER BY … LIMIT queries, TopNKey maintains a single global
Top-N heap and can disable ReduceSink-level Top-N pruning; when input data is
unsorted, this causes all rows to be shuffled globally, leading to severe
performance degradation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]