Github user jerryshao commented on a diff in the pull request:
https://github.com/apache/spark/pull/19633#discussion_r151308496
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
---
@@ -424,11 +424,19 @@ case class FileSourceScanExec(
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
val openCostInBytes =
fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
- val defaultParallelism =
fsRelation.sparkSession.sparkContext.defaultParallelism
- val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen +
openCostInBytes)).sum
- val bytesPerCore = totalBytes / defaultParallelism
- val maxSplitBytes = Math.min(defaultMaxSplitBytes,
Math.max(openCostInBytes, bytesPerCore))
+ // Ignore bytesPerCore when dynamic allocation is enabled. See
SPARK-22411
+ val maxSplitBytes =
+ if
(Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf))
{
+ defaultMaxSplitBytes
--- End diff --
>For small data, sometimes users care about the number of output files
generated as well.
Can you please elaborate more, do you want more output files or less output
files. if less, I think I already mentioned in the above comment.
Seems you patch already ignored `bytesPerCore` when dynamic allocation is
enabled, are you suggesting something different?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]