Github user vgankidi commented on a diff in the pull request:
https://github.com/apache/spark/pull/19633#discussion_r151236188
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
---
@@ -424,11 +424,19 @@ case class FileSourceScanExec(
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
val openCostInBytes =
fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
- val defaultParallelism =
fsRelation.sparkSession.sparkContext.defaultParallelism
- val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen +
openCostInBytes)).sum
- val bytesPerCore = totalBytes / defaultParallelism
- val maxSplitBytes = Math.min(defaultMaxSplitBytes,
Math.max(openCostInBytes, bytesPerCore))
+ // Ignore bytesPerCore when dynamic allocation is enabled. See
SPARK-22411
+ val maxSplitBytes =
+ if
(Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf))
{
+ defaultMaxSplitBytes
--- End diff --
@jerryshao That's true. We cannot rely on users to set it.
For small data, sometimes users care about the number of output files
generated as well. To give the flexibility to the users, how about adding a
flag to ignore `bytesPerCore` when set regardless of dynamic allocation being
used or not? That way the default behavior stays as is and users with specific
needs can set the flag.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]