[
https://issues.apache.org/jira/browse/SPARK-22411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinitha Reddy Gankidi updated SPARK-22411:
------------------------------------------
Description:
The heuristic to calculate the maxSplitSize in DataSourceScanExec is as follows:
https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L431
bytesPerCore is calculated based on the default parallelism. Default
parallelism in this case is the number of total cores of all the registered
executors for this application. The heuristic works well with static allocation
but with dynamic allocation enabled, defaultParallelism is usually one (with
default config of min and initial executors as zero) at the time of split
calculation. This heuristic was introduced in SPARK-14582.
With Dynamic allocation it is confusing to tune the split size with this
heuristic. It is better to ignore bytesPerCore and use the values of
'spark.sql.files.maxPartitionBytes' as the max split size.
was:
The heuristic to calculate the maxSplitSize in DataSourceScanExec is as follows:
https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L431
Default parallelism in this case is the number of total cores of all the
registered executors for this application. This works well with static
allocation but with dynamic allocation enabled, this value is usually one (with
default config of min and initial executors as zero) at the time of split
calculation. This heuristic was introduced in SPARK-14582.
When Dynamic allocation it is confusing to tune the split size with this
heuristic. It is better to ignore bytesPerCore and use the values of
'spark.sql.files.maxPartitionBytes' as the max split size.
> Heuristic to combine splits in DataSourceScanExec isn't accurate when dynamic
> allocation is enabled
> ---------------------------------------------------------------------------------------------------
>
> Key: SPARK-22411
> URL: https://issues.apache.org/jira/browse/SPARK-22411
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Vinitha Reddy Gankidi
> Assignee: Vinitha Reddy Gankidi
>
> The heuristic to calculate the maxSplitSize in DataSourceScanExec is as
> follows:
> https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L431
> bytesPerCore is calculated based on the default parallelism. Default
> parallelism in this case is the number of total cores of all the registered
> executors for this application. The heuristic works well with static
> allocation but with dynamic allocation enabled, defaultParallelism is usually
> one (with default config of min and initial executors as zero) at the time of
> split calculation. This heuristic was introduced in SPARK-14582.
> With Dynamic allocation it is confusing to tune the split size with this
> heuristic. It is better to ignore bytesPerCore and use the values of
> 'spark.sql.files.maxPartitionBytes' as the max split size.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]