izchen edited a comment on pull request #29028: URL: https://github.com/apache/spark/pull/29028#issuecomment-657111204
> I'd rather not expose yet another config for this. Are there any heuristics that can select this more intelligently? Thank you very much for code review. The idea of heuristic selection is like: `mergeInDriver = intermediateResultsTotalSize < getConf("spark.driver.maxResultSize")` The intermediate result is the local TopK result of each RDD partition. We can know that the total number of items is _partitionNum * k_. However, due to uncertain length objects such as string or user-defined classes, the total size cannot be estimated before all partitions are executed. In most user applications, `mergeInDriver = true`, then the localTopK merge begins after the second RDD partition result is returned, and the merge process and the RDD partition compute process are parallel. This can obtain the shortest result return waiting time. Collecting the information needed for heuristic selection will have a negative impact on the performance of such applications. `mergeInDriver = true` may cause program errors in two types of applications: The result data contains large objects, such as large strings or large user-defined classes. Driver applications with high memory pressure, such as multi-user spark/sql/hive-thriftserver. There may not be a good way to not expose this config. Maybe I can add a user-friendly log about this config. something like: ``` try { mapRDDs.reduce { (queue1, queue2) => ... } } catch { case e: SparkException if e.getMessage.contains(s"bigger than ${MAX_RESULT_SIZE.key}") => logError(s"Total size of serialized intermediate results is bigger than " + s"${MAX_RESULT_SIZE.key} (${Utils.bytesToString(conf.get(MAX_RESULT_SIZE))}). " + s"If the final result size is less than this value, you can set the " + s"config ${RDD_TAKE_ORDERED_MERGE_IN_DRIVER.key} to false to complete " + s"the intermediate result merge in executor. But this config will " + s"increase the return time of the takeOrdered function.") throw e } ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org