izchen commented on pull request #29028:
URL: https://github.com/apache/spark/pull/29028#issuecomment-657111204
> I'd rather not expose yet another config for this. Are there any
heuristics that can select this more intelligently?
Thank you very much for code review.
The idea of heuristic selection is like:
`mergeInDriver = intermediateResultsTotalSize <
getConf("spark.driver.maxResultSize")`
The intermediate result is the local TopK result of each RDD partition. We
can know that the total number of items is _partitionNum * k_. However, due to
uncertain length objects such as string or user-defined classes, the total size
cannot be estimated before all partitions are executed.
In most user applications, `mergeInDriver = true`, then the localTopK merge
begins after the second RDD partition result is returned, and the merge process
and the RDD partition compute process are parallel. This can obtain the
shortest result return waiting time. Collecting the information needed for
heuristic selection will have a negative impact on the performance of such
applications.
`mergeInDriver = true` may cause program errors in two types of applications:
The result data contains large objects, such as large strings or large
user-defined classes.
Driver applications with high memory pressure, such as multi-user
spark/sql/hive-thriftserver.
There may not be a good way to not expose this config.
Maybe I can add a user-friendly log about this config.
something like:
```
try {
mapRDDs.reduce { (queue1, queue2) =>
...
}
} catch {
case e: SparkException if e.getMessage.contains(MAX_RESULT_SIZE.key) =>
logError(s"Total size of serialized intermediate results is bigger than
" +
s"${MAX_RESULT_SIZE.key}
(${Utils.bytesToString(conf.get(MAX_RESULT_SIZE))}). " +
s"If the final result size is less than this value, you can set the "
+
s"config ${RDD_TAKE_ORDERED_MERGE_IN_DRIVER.key} to false to complete
" +
s"the intermediate result merge in executor. But this config will " +
s"increase the return time of the takeOrdered function.")
throw e
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]