[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

jerryshao Mon, 13 Nov 2017 22:28:41 -0800

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19633#discussion_r150746876
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
    @@ -424,11 +424,19 @@ case class FileSourceScanExec(
         val defaultMaxSplitBytes =
           fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
         val openCostInBytes = 
fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    -    val defaultParallelism = 
fsRelation.sparkSession.sparkContext.defaultParallelism
    -    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + 
openCostInBytes)).sum
    -    val bytesPerCore = totalBytes / defaultParallelism
     
    -    val maxSplitBytes = Math.min(defaultMaxSplitBytes, 
Math.max(openCostInBytes, bytesPerCore))
    +    // Ignore bytesPerCore when dynamic allocation is enabled. See 
SPARK-22411
    +    val maxSplitBytes =
    +      if 
(Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf))
 {
    +        defaultMaxSplitBytes
    --- End diff --
    
    What if `spark.dynamicAllocation.maxExecutors` is not configured? Seems we 
cannot rely on this configuration, user may not always set it.
    
    My concern is the cost of ramping up new executors, by splitting partitions 
into small ones, Spark will ramp up more executors to execute small tasks, when 
the cost of ramping up new executors is larger than executing tasks, this seems 
not a heuristic solution anymore. Previously because all the executors are 
available, so heuristic solution is valid.
    
    For small data (calculated `bytesPerCore ` < `defaultMaxSplitBytes `, less 
than 128M), I think using the available resources to schedule tasks would be 
enough, since task is not so big. For big data (calculated `bytesPerCore ` > 
`defaultMaxSplitBytes `, larger than 128M), I think 128M might be the proper 
value to issue new executors and tasks. So IMHO seems current solution is 
sufficient for dynamic allocation scenario. Please correct me if I'm wrong.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

Reply via email to