zabetak commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2689820106
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
ogw.startWalking(topNodes, null);
}
+ /*
+ * Build the ReduceSink matching pattern used by TopNKey optimization.
+ *
+ * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+ * applying TopNKey results in a performance regression. ReduceSink
+ * operators created only for ordering must therefore be excluded from
+ * TopNKey.
+ *
+ * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+ * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+ * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+ * between.
+ */
+ private static String buildTopNKeyRegexPattern(OptimizeTezProcContext
procCtx) {
+ String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+ boolean hasOrderOrLimit =
+ procCtx.parseContext.getQueryProperties().hasLimit() ||
+ procCtx.parseContext.getQueryProperties().hasOrderBy();
Review Comment:
Many thanks for the additional experiments for the PTF case.
> But query performance degradation is not observed for PTF operator case,
comparing with disabling topnKeyoperator
It's difficult to extract safe conclusions from the comparison between ORDER
BY and windowing experiments cause dataset size and effective limit differ
significantly.
Dataset size:
* for ORDER BY the dataset has ~10M rows
* for windowing the dataset has 50K rows
Effective limit/top-n filter:
* for ORDER BY the limit is 100
* for windowing the limit is ~6K
It would be great if you can run some experiments where the numbers are
closer.
> Disabling TopNkey for Windowing queries don't show a drastic difference in
performance
I still believe that this depends on the use-case. The example that you
crafted for ORDER BY was clearly showing the downsides of the Top-N operator.
The answer/benchmarks above imply that this will never happen for windowing
functions but I don't understand why.
> because the shuffle and reducer stages are already needed for window
partitioning.
I don't fully understand the statement about the shuffle and reducer stages.
Can you elaborate a bit more?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]