wangyum opened a new pull request, #41545:
URL: https://github.com/apache/spark/pull/41545
### What changes were proposed in this pull request?
This PR add a new SQL config: `spark.sql.files.maxDesiredPartitionNum`. User
can set it to avoid generating too many partitions. Too many partitions will
increase the various overheads of the driver and cause Shuffle service OOM.
The following is the GC log of the Shuffle service:
```
2023-06-08T01:41:01.871-0700: 7303.965: [Full GC (Allocation Failure)
2023-06-08T01:41:01.871-0700: 7303.965: [CMS: 4194304K->4194304K(4194304K),
7.4010107 secs]2023-06-08T01:41:09.272-0700: 7311.366: [Class Histogram (after
full gc):
num #instances #bytes class name
----------------------------------------------
1: 7110660 2334927400 [C
2: 19465810 467514416 [I
3: 6754570 270182800
org.apache.spark.network.protocol.EnhancedChunkFetchRequest
4: 6661155 266446200
org.sparkproject.io.netty.channel.DefaultChannelPromise
5: 6639056 265562240
org.apache.spark.network.buffer.FileSegmentManagedBuffer
6: 6639055 265562200
org.apache.spark.network.protocol.RequestTraceInfo
7: 6663764 213240448
org.sparkproject.io.netty.util.Recycler$DefaultHandle
8: 6659382 213100224
org.sparkproject.io.netty.channel.AbstractChannelHandlerContext$WriteTask
9: 6659218 213094976
org.apache.spark.network.server.ChunkFetchRequestHandler$$Lambda$156/886274988
10: 6640444 212494208 java.io.File
...
```
### Why are the changes needed?
1. To make it do not generate too many partitions if it is very large
partitioned and bucketed table as it is not always use bucket scan since
[SPARK-32859](https://issues.apache.org/jira/browse/SPARK-32859).
2. Avoid generating too many partitions if these are lots of small files.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test and manual testing:
Before this PR | After this PR and `set
spark.sql.files.maxDesiredPartitionNum=20000`
-- | --
<img
src="https://github.com/apache/spark/assets/5399861/4f0b3b54-44c0-4dd7-80f0-42792bc6e22f"
width="300"> | <img
src="https://github.com/apache/spark/assets/5399861/c916b650-f186-46fc-97ba-27118b4da7e1"
width="300">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]