[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

jerryshao Wed, 15 Nov 2017 18:35:54 -0800

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19633#discussion_r151308496
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
    @@ -424,11 +424,19 @@ case class FileSourceScanExec(
         val defaultMaxSplitBytes =
           fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
         val openCostInBytes = 
fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    -    val defaultParallelism = 
fsRelation.sparkSession.sparkContext.defaultParallelism
    -    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + 
openCostInBytes)).sum
    -    val bytesPerCore = totalBytes / defaultParallelism
     
    -    val maxSplitBytes = Math.min(defaultMaxSplitBytes, 
Math.max(openCostInBytes, bytesPerCore))
    +    // Ignore bytesPerCore when dynamic allocation is enabled. See 
SPARK-22411
    +    val maxSplitBytes =
    +      if 
(Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf))
 {
    +        defaultMaxSplitBytes
    --- End diff --
    
    >For small data, sometimes users care about the number of output files 
generated as well.
    
    Can you please elaborate more, do you want more output files or less output 
files. if less, I think I already mentioned in the above comment.
    
    Seems you patch already ignored `bytesPerCore` when dynamic allocation is 
enabled, are you suggesting something different?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

Reply via email to