[
https://issues.apache.org/jira/browse/SPARK-27635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lantao Jin updated SPARK-27635:
-------------------------------
Description:
The scenario is submitting multiple jobs concurrently with spark dynamic
allocation enabled. The issue happens in determining RDD partition numbers.
When there are more available CPU cores, spark will try to split RDD to more
pieces. But since the file is stored as parquet format, parquet's row group is
actually the basic unit block to read data. Splitting RDD to too many small
pieces doesn't make sense.
Jobs will launch too many partitions and never complete.
!Screen Shot 2019-05-05 at 5.45.15 PM.png!
Set the default parallelism to a fixed number (for example 200) will fix this.
was:
The scenario is submitting multiple jobs concurrently with spark dynamic
allocation enabled. The issue happens in determining RDD partition numbers.
When there are more available CPU cores, spark will try to split RDD to more
pieces. But since the file is stored as parquet format, parquet's row group is
actually the basic unit block to read data. Splitting RDD to too many small
pieces doesn't make sense.
Jobs will launch too many partitions and never complete.
Set the default parallelism to a fixed number (for example 200) will fix this.
> Prevent from splitting too many partitions smaller than row group size in
> Parquet file format
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-27635
> URL: https://issues.apache.org/jira/browse/SPARK-27635
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.2, 3.0.0
> Reporter: Lantao Jin
> Priority: Major
> Attachments: Screen Shot 2019-05-05 at 5.45.15 PM.png
>
>
> The scenario is submitting multiple jobs concurrently with spark dynamic
> allocation enabled. The issue happens in determining RDD partition numbers.
> When there are more available CPU cores, spark will try to split RDD to more
> pieces. But since the file is stored as parquet format, parquet's row group
> is actually the basic unit block to read data. Splitting RDD to too many
> small pieces doesn't make sense.
> Jobs will launch too many partitions and never complete.
> !Screen Shot 2019-05-05 at 5.45.15 PM.png!
> Set the default parallelism to a fixed number (for example 200) will fix this.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]