okumin commented on code in PR #4432:
URL: https://github.com/apache/hive/pull/4432#discussion_r1238619245


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/GenTezUtils.java:
##########
@@ -130,8 +130,8 @@ public static ReduceWork createReduceWork(
       maxPartition = (maxPartition > maxReducers) ? maxReducers : maxPartition;
 
       // reduce only if the parameters are significant
-      if (minPartition < maxPartition &&
-          nReducers * minPartitionFactor >= 1.0) {
+      final int parallelismThreshold = 
context.conf.getIntVar(HiveConf.ConfVars.TEZ_AUTO_REDUCER_PARALLELISM_THRESHOLD);
+      if (minPartition < maxPartition && maxPartition >= parallelismThreshold) 
{

Review Comment:
   @ayushtkn I'd like to discuss what option is the best for us before 
implementing it. These are my current ideas. Do you have any preference or 
another idea?
   
   ### Option 1: Fallback to the original behavior by default
   
   This can keep the original behavior. New users could be confused.
   
   ```
   // default = -1
   final int parallelismThreshold = 
context.conf.getIntVar(HiveConf.ConfVars.TEZ_AUTO_REDUCER_PARALLELISM_THRESHOLD);
   final boolean enablesAutoReduce;
   if (parallelismThreshold == -1) {
     enablesAutoReduce = nReducers * minPartitionFactor >= 1.0;
   } else {
     enablesAutoReduce = maxPartition >= parallelismThreshold;
   }
   if (minPartition < maxPartition && enablesAutoReduce) {
   ```
   
   ### Option 2: Make the threshold of `nReducers * minPartitionFactor` 
configurable
   
   `Hive on Tez disables auto reducer parallelism when the estimated number of 
reducers * hive.tez.min.partition.factor is smaller than this value`. This is 
consistent with the original formula. This may not be straightforward but it 
can meet needs by setting like `0.2`.
   
   ```
   // default = 1.0
   final float minimumThreshold = 
context.conf.getFloatValue(HiveConf.ConfVars.TEZ_AUTO_REDUCER_PARALLELISM_THRESHOLD);
   if (minPartition < maxPartition &&
     nReducers * minPartitionFactor >= minimumThreshold) {
   ```
   
   ### Option 3: Introduce the best formula and change the behavior
   
   This kind of change may change performance but will not change the query 
result. If the performance degradation is unlikely to happen, we might accept 
it in the major version up. I understand this option is not polite. I just 
listed it as one option.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to