[ https://issues.apache.org/jira/browse/SPARK-27635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun closed SPARK-27635. --------------------------------- > Prevent from splitting too many partitions smaller than row group size in > Parquet file format > --------------------------------------------------------------------------------------------- > > Key: SPARK-27635 > URL: https://issues.apache.org/jira/browse/SPARK-27635 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.2, 3.0.0 > Reporter: Lantao Jin > Priority: Major > Attachments: Screen Shot 2019-05-05 at 5.45.15 PM.png > > > The scenario is submitting multiple jobs concurrently with spark dynamic > allocation enabled. The issue happens in determining RDD partition numbers. > When there are more available CPU cores, spark will try to split RDD to more > pieces. But since the file is stored as parquet format, parquet's row group > is actually the basic unit block to read data. Splitting RDD to too many > small pieces doesn't make sense. > Jobs will launch too many partitions and never complete. > !Screen Shot 2019-05-05 at 5.45.15 PM.png! > Set the default parallelism to a fixed number (for example 200) could > workaround. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org