Github user jerryshao commented on a diff in the pull request:
https://github.com/apache/spark/pull/19633#discussion_r150746876
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
---
@@ -424,11 +424,19 @@ case class FileSourceScanExec(
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
val openCostInBytes =
fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
- val defaultParallelism =
fsRelation.sparkSession.sparkContext.defaultParallelism
- val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen +
openCostInBytes)).sum
- val bytesPerCore = totalBytes / defaultParallelism
- val maxSplitBytes = Math.min(defaultMaxSplitBytes,
Math.max(openCostInBytes, bytesPerCore))
+ // Ignore bytesPerCore when dynamic allocation is enabled. See
SPARK-22411
+ val maxSplitBytes =
+ if
(Utils.isDynamicAllocationEnabled(fsRelation.sparkSession.sparkContext.getConf))
{
+ defaultMaxSplitBytes
--- End diff --
What if `spark.dynamicAllocation.maxExecutors` is not configured? Seems we
cannot rely on this configuration, user may not always set it.
My concern is the cost of ramping up new executors, by splitting partitions
into small ones, Spark will ramp up more executors to execute small tasks, when
the cost of ramping up new executors is larger than executing tasks, this seems
not a heuristic solution anymore. Previously because all the executors are
available, so heuristic solution is valid.
For small data (calculated `bytesPerCore ` < `defaultMaxSplitBytes `, less
than 128M), I think using the available resources to schedule tasks would be
enough, since task is not so big. For big data (calculated `bytesPerCore ` >
`defaultMaxSplitBytes `, larger than 128M), I think 128M might be the proper
value to issue new executors and tasks. So IMHO seems current solution is
sufficient for dynamic allocation scenario. Please correct me if I'm wrong.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]