Lantao Jin created SPARK-27635:
----------------------------------

             Summary: Prevent from splitting too many partitions smaller than 
row group size in Parquet file format
                 Key: SPARK-27635
                 URL: https://issues.apache.org/jira/browse/SPARK-27635
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.2, 3.0.0
            Reporter: Lantao Jin


The scenario is submitting multiple jobs concurrently with spark dynamic 
allocation enabled. The issue happens in determining RDD partition numbers. 
When there are more available CPU cores, spark will try to split RDD to more 
pieces. But since the file is stored as parquet format, parquet's row group is 
actually the basic unit block to read data. Splitting RDD to too many small 
pieces doesn't make sense.
Jobs will launch too many partitions and never complete.

Set the default parallelism to a fixed number (for example 200) will fix this.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to