[GitHub] [spark] izchen commented on pull request #29028: [SPARK-32212][CORE]RDD.takeOrdered can choose to merge intermediate r…

GitBox Sat, 11 Jul 2020 11:53:07 -0700


izchen commented on pull request #29028:
URL: https://github.com/apache/spark/pull/29028#issuecomment-657111204



   > I'd rather not expose yet another config for this. Are there any 
heuristics that can select this more intelligently?
   
   Thank you very much for code review.
   
   The idea of heuristic selection is like:
   `mergeInDriver = intermediateResultsTotalSize < 
getConf("spark.driver.maxResultSize")`
   
   The intermediate result is the local TopK result of each RDD partition. We 
can know that the total number of items is _partitionNum * k_. However, due to 
uncertain length objects such as string or user-defined classes, the total size 
cannot be estimated before all partitions are executed.
   
   In most user applications, `mergeInDriver = true`, then the localTopK merge 
begins after the second RDD partition result is returned, and the merge process 
and the RDD partition compute process are parallel. This can obtain the 
shortest result return waiting time. Collecting the information needed for 
heuristic selection will have a negative impact on the performance of such 
applications.
   
   `mergeInDriver = true` may cause program errors in two types of applications:
   The result data contains large objects, such as large strings or large 
user-defined classes.
   Driver applications with high memory pressure, such as multi-user 
spark/sql/hive-thriftserver.
   
   There may not be a good way to not expose this config.
   
   Maybe I can add a user-friendly log about this config.
   something like:
   ```
   try {
       mapRDDs.reduce { (queue1, queue2) =>
        ...
       }
     } catch {
      case e: SparkException if e.getMessage.contains(MAX_RESULT_SIZE.key) =>
        logError(s"Total size of serialized intermediate results is bigger than 
" +
          s"${MAX_RESULT_SIZE.key} 
(${Utils.bytesToString(conf.get(MAX_RESULT_SIZE))}). " +
          s"If the final result size is less than this value, you can set the " 
+
          s"config ${RDD_TAKE_ORDERED_MERGE_IN_DRIVER.key} to false to complete 
" +
          s"the intermediate result merge in executor. But this config will " +
          s"increase the return time of the takeOrdered function.")
        throw e
     }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] izchen commented on pull request #29028: [SPARK-32212][CORE]RDD.takeOrdered can choose to merge intermediate r…

Reply via email to