[
https://issues.apache.org/jira/browse/SPARK-22411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated SPARK-22411:
------------------------------
Fix Version/s: (was: 2.3.0)
> Heuristic to combine splits in DataSourceScanExec isn't accurate when dynamic
> allocation is enabled
> ---------------------------------------------------------------------------------------------------
>
> Key: SPARK-22411
> URL: https://issues.apache.org/jira/browse/SPARK-22411
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Vinitha Reddy Gankidi
> Assignee: Vinitha Reddy Gankidi
>
> The heuristic to calculate the maxSplitSize in DataSourceScanExec is as
> follows:
> https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L431
> Default parallelism in this case is the number of total cores of all the
> registered executors for this application. This works well with static
> allocation but with dynamic allocation enabled, this value is usually one
> (with default config of min and initial executors as zero) at the time of
> split calculation. This heuristic was introduced in SPARK-14582.
> When Dynamic allocation it is confusing to tune the split size with this
> heuristic. It is better to ignore bytesPerCore and use the values of
> 'spark.sql.files.maxPartitionBytes' as the max split size.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]