[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

vgankidi Wed, 15 Nov 2017 12:01:08 -0800

Github user vgankidi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19633#discussion_r151236188
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
    @@ -424,11 +424,19 @@ case class FileSourceScanExec(
         val defaultMaxSplitBytes =
           fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
         val openCostInBytes = 
fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    -    val defaultParallelism = 
fsRelation.sparkSession.sparkContext.defaultParallelism
    -    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + 
openCostInBytes)).sum
    -    val bytesPerCore = totalBytes / defaultParallelism
     
    -    val maxSplitBytes = Math.min(defaultMaxSplitBytes, 
Math.max(openCostInBytes, bytesPerCore))
    +    // Ignore bytesPerCore when dynamic allocation is enabled. See 
SPARK-22411
    +    val maxSplitBytes =
    +      if 
(Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf))
 {
    +        defaultMaxSplitBytes
    --- End diff --
    
    @jerryshao That's true. We cannot rely on users to set it.
    
    For small data, sometimes users care about the number of output files 
generated as well. To give the flexibility to the users, how about adding a 
flag to ignore `bytesPerCore` when set regardless of dynamic allocation being 
used or not? That way the default behavior stays as is and users with specific 
needs can set the flag.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19633: [SPARK-22411][SQL] Disable the heuristic to calcu...

Reply via email to