[GitHub] [spark] LantaoJin opened a new pull request #24527: [SPARK-27635][SQL] Prevent from splitting too many partitions smaller than row group size in Parquet file format

GitBox Sun, 05 May 2019 02:55:07 -0700

LantaoJin opened a new pull request #24527: [SPARK-27635][SQL] Prevent from 
splitting too many partitions smaller than row group size in Parquet file format
URL: https://github.com/apache/spark/pull/24527
 
 
   ## What changes were proposed in this pull request?
   
   The scenario is submitting multiple jobs concurrently with spark dynamic 
allocation enabled. The issue happens in determining RDD partition numbers. 
When there are more available CPU cores, spark will try to split RDD to more 
pieces. But since the file is stored as parquet format, parquet's row group is 
actually the basic unit block to read data. Splitting RDD to too many small 
pieces doesn't make sense.
   Jobs will launch too many partitions and never complete.
   ![Screen Shot 2019-05-05 at 5 45 15 
PM](https://user-images.githubusercontent.com/1853780/57192037-b709ea00-6f5e-11e9-867e-cacfa3aab86a.png)
   
   Force set the default parallelism to a fixed number (for example 200) could 
workaround.
   
   
   ## How was this patch tested?
   
   Exist UTs


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] LantaoJin opened a new pull request #24527: [SPARK-27635][SQL] Prevent from splitting too many partitions smaller than row group size in Parquet file format

Reply via email to