zabetak commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2704788943
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
ogw.startWalking(topNodes, null);
}
+ /*
+ * Build the ReduceSink matching pattern used by TopNKey optimization.
+ *
+ * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+ * applying TopNKey results in a performance regression. ReduceSink
+ * operators created only for ordering must therefore be excluded from
+ * TopNKey.
+ *
+ * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+ * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+ * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+ * between.
+ */
+ private static String buildTopNKeyRegexPattern(OptimizeTezProcContext
procCtx) {
+ String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+ boolean hasOrderOrLimit =
+ procCtx.parseContext.getQueryProperties().hasLimit() ||
+ procCtx.parseContext.getQueryProperties().hasOrderBy();
Review Comment:
Based on the screenshots of the dataset with 51M rows we have:
### TopNKey enabled
#### Map 1:
* INPUT_RECORDS: 51,193,885
* OUTPUT_RECORDS: 72,272,499
#### Reducer 2:
* INPUT_RECORDS: 72,272,499
### TopNKey disabled
#### Map 1:
* INPUT_RECORDS: 51,193,885
* OUTPUT_RECORDS: 6,144,000
#### Reducer 2:
* INPUT_RECORDS: 6,144,000
There is a significant difference in the number of OUTPUT_RECORDS from Map 1
when the TopNKey is enabled/disabled. The same goes for the INPUT_RECORDS to
Reducer 2. Is this difference negligible in terms of performance?
> TopNKey operates per partition and, even when it forwards most or all
rows, it does not introduce additional global shuffle or change the reducer
fan-in.
I don't understand to what we what exactly we refer to by saying **global**
shuffle. I would also like some more clarifications about the "reducer fan-in".
Since the number of input records to Reducer 2 differs when TopNKey is enabled
and disabled what exactly do you mean that the fan-in is not affected?
PS. The content of the
[ptf_testcase.txt](https://github.com/user-attachments/files/24545845/ptf_testcase.txt)
file is identical wih the previous run. The screenshots show inputs with 51M
rows but the content of the attachment is not aligned.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]