[ 
https://issues.apache.org/jira/browse/SPARK-27635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-27635:
-------------------------------
    Attachment: Screen Shot 2019-05-05 at 5.45.15 PM.png

> Prevent from splitting too many partitions smaller than row group size in 
> Parquet file format
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27635
>                 URL: https://issues.apache.org/jira/browse/SPARK-27635
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.2, 3.0.0
>            Reporter: Lantao Jin
>            Priority: Major
>         Attachments: Screen Shot 2019-05-05 at 5.45.15 PM.png
>
>
> The scenario is submitting multiple jobs concurrently with spark dynamic 
> allocation enabled. The issue happens in determining RDD partition numbers. 
> When there are more available CPU cores, spark will try to split RDD to more 
> pieces. But since the file is stored as parquet format, parquet's row group 
> is actually the basic unit block to read data. Splitting RDD to too many 
> small pieces doesn't make sense.
> Jobs will launch too many partitions and never complete.
> Set the default parallelism to a fixed number (for example 200) will fix this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to