[ 
https://issues.apache.org/jira/browse/SPARK-27635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-27635.
---------------------------------

> Prevent from splitting too many partitions smaller than row group size in 
> Parquet file format
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27635
>                 URL: https://issues.apache.org/jira/browse/SPARK-27635
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.2, 3.0.0
>            Reporter: Lantao Jin
>            Priority: Major
>         Attachments: Screen Shot 2019-05-05 at 5.45.15 PM.png
>
>
> The scenario is submitting multiple jobs concurrently with spark dynamic 
> allocation enabled. The issue happens in determining RDD partition numbers. 
> When there are more available CPU cores, spark will try to split RDD to more 
> pieces. But since the file is stored as parquet format, parquet's row group 
> is actually the basic unit block to read data. Splitting RDD to too many 
> small pieces doesn't make sense.
> Jobs will launch too many partitions and never complete.
>  !Screen Shot 2019-05-05 at 5.45.15 PM.png! 
> Set the default parallelism to a fixed number (for example 200) could 
> workaround.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to