[
https://issues.apache.org/jira/browse/SPARK-27635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lantao Jin updated SPARK-27635:
-------------------------------
Attachment: Screen Shot 2019-05-05 at 5.45.15 PM.png
> Prevent from splitting too many partitions smaller than row group size in
> Parquet file format
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-27635
> URL: https://issues.apache.org/jira/browse/SPARK-27635
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.2, 3.0.0
> Reporter: Lantao Jin
> Priority: Major
> Attachments: Screen Shot 2019-05-05 at 5.45.15 PM.png
>
>
> The scenario is submitting multiple jobs concurrently with spark dynamic
> allocation enabled. The issue happens in determining RDD partition numbers.
> When there are more available CPU cores, spark will try to split RDD to more
> pieces. But since the file is stored as parquet format, parquet's row group
> is actually the basic unit block to read data. Splitting RDD to too many
> small pieces doesn't make sense.
> Jobs will launch too many partitions and never complete.
> Set the default parallelism to a fixed number (for example 200) will fix this.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]