Indhumathi27 commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2678912718
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
ogw.startWalking(topNodes, null);
}
+ /*
+ * Build the ReduceSink matching pattern used by TopNKey optimization.
+ *
+ * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+ * applying TopNKey results in a performance regression. ReduceSink
+ * operators created only for ordering must therefore be excluded from
+ * TopNKey.
+ *
+ * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+ * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+ * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+ * between.
+ */
+ private static String buildTopNKeyRegexPattern(OptimizeTezProcContext
procCtx) {
+ String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+ boolean hasOrderOrLimit =
+ procCtx.parseContext.getQueryProperties().hasLimit() ||
+ procCtx.parseContext.getQueryProperties().hasOrderBy();
Review Comment:
I have debugged Windowing queries case and below is the observation.
Testcase (unsorted data):
`CREATE TABLE topnkey_windowing (tw_code string, tw_value double);
INSERT INTO topnkey_windowing VALUES (NULL, NULL),(NULL, 109),('A',
109),('A', 104),('A', 110),('A', 120),('A', 103),('A', 109),('B', 105),('B',
106),('B', 106),('B', NULL),('B', 106),('A', 109);
SELECT tw_code, ranking
FROM (
SELECT tw_code AS tw_code,
rank() OVER (PARTITION BY tw_code ORDER BY tw_value) AS ranking
FROM topnkey_windowing) tmp1
WHERE ranking < 2;
`
With TopNkey enabled,
Map phase: Input records: 14 Output Records: 9
With TopNkey disabled,
Map phase: Input records: 14 Output Records: 8
Time Taken : both almost same.
1. In PTF queries, TopNKey creates a separate TopNKeyFilter for every
distinct PARTITION BY key and maintains an in-memory Top-N heap per partition.
2. Each incoming row performs partition-key hashing, map lookup, and heap
comparison to decide whether it belongs to that partition’s Top-N.
3. Rows are eliminated when their sort key compares worse than the current
Top-N boundary, so they are not inserted into the partition’s ordered Top-N set.
4. For ORDER BY … LIMIT queries, TopNKey maintains only a single global
Top-N heap per reducer.
I have tested with low-cardinality, monotonic-ordered windowing dataset and
high-cardinality, multi-row-per-partition PTF test dataset. In this case,
behaviour is similar to ORDER by.. Limit queries, where all the rows are
forwarded. But query performance degradation is not observed for PTF operator
case, comparing with disabling topnKeyoperator.
One such example:
[ptf_testcase.txt](https://github.com/user-attachments/files/24545845/ptf_testcase.txt)
From this experiments, with TopNkey enabled / disabled, performance is
almost similar for Windowing queries.
Disabling TopNkey for Windowing queries don't show a drastic difference in
performance. Forwarding all rows in PTF TopNKey does not cause the catastrophic
shuffle explosion seen in ORDER BY … LIMIT queries, because the shuffle and
reducer stages are already needed for window partitioning.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]