wangyum opened a new pull request, #41545:
URL: https://github.com/apache/spark/pull/41545

   ### What changes were proposed in this pull request?
   
   This PR add a new SQL config: `spark.sql.files.maxDesiredPartitionNum`. User 
can set it to avoid generating too many partitions. Too many partitions will 
increase the various overheads of the driver and cause Shuffle service OOM.
   
   The following is the GC log of the Shuffle service:
   ```
   2023-06-08T01:41:01.871-0700: 7303.965: [Full GC (Allocation Failure) 
2023-06-08T01:41:01.871-0700: 7303.965: [CMS: 4194304K->4194304K(4194304K), 
7.4010107 secs]2023-06-08T01:41:09.272-0700: 7311.366: [Class Histogram (after 
full gc):
    num     #instances         #bytes  class name
   ----------------------------------------------
      1:       7110660     2334927400  [C
      2:      19465810      467514416  [I
      3:       6754570      270182800  
org.apache.spark.network.protocol.EnhancedChunkFetchRequest
      4:       6661155      266446200  
org.sparkproject.io.netty.channel.DefaultChannelPromise
      5:       6639056      265562240  
org.apache.spark.network.buffer.FileSegmentManagedBuffer
      6:       6639055      265562200  
org.apache.spark.network.protocol.RequestTraceInfo
      7:       6663764      213240448  
org.sparkproject.io.netty.util.Recycler$DefaultHandle
      8:       6659382      213100224  
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext$WriteTask
      9:       6659218      213094976  
org.apache.spark.network.server.ChunkFetchRequestHandler$$Lambda$156/886274988
     10:       6640444      212494208  java.io.File
   ...
   ```
   
   ### Why are the changes needed?
   
   1. To make it do not generate too many partitions if it is very large 
partitioned and bucketed table as it is not always use bucket scan since 
[SPARK-32859](https://issues.apache.org/jira/browse/SPARK-32859).
   2. Avoid generating too many partitions if these are lots of small files.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test and manual testing:
   
   Before this PR | After this PR and `set 
spark.sql.files.maxDesiredPartitionNum=20000`
   -- | --
   <img 
src="https://github.com/apache/spark/assets/5399861/4f0b3b54-44c0-4dd7-80f0-42792bc6e22f";
 width="300"> | <img 
src="https://github.com/apache/spark/assets/5399861/c916b650-f186-46fc-97ba-27118b4da7e1";
 width="300">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to