Re: [PR] [SPARK-54838] A new feature to optimize partition size and count [spark]

via GitHub Thu, 25 Dec 2025 13:53:58 -0800


dotsering commented on PR #53599:
URL: https://github.com/apache/spark/pull/53599#issuecomment-3691769429

> Could we consider setting `spark.sql.files.maxPartitionNum` to address the
small file issue?

Thanks @wangyum for your feedback. Here are few limitations that I can think
of using hardcoded maxPartitionNum property:
1. Incompatibility with Heterogeneous Data Sources: The configuration lacks
flexibility because it applies to the read operation globally. An ideal value
for one source is often catastrophic for another within the same job. For
example, setting maxPartitionNum = 1 is perfect for my example 10,000 tiny 10KB
files (reducing overhead), but if that same job also reads a large 1TB table,
that table is also forced into a single partition, causing an immediate
failure. You cannot optimize the small source without breaking the large one.
You have to keep fine tuning it for every spark job. That was the main reason I
thought of automating hardcoded value.

2. Destructive Merging of Optimized Data (Collateral Damage): This setting
acts as a blunt instrument that overrides safe partition sizing logic. If a job
reads a second data source that is already perfectly partitioned (e.g., 1,000
partitions of 150MB), setting maxPartitionNum = 10 forces Spark to ignore those
healthy boundaries. It will aggressively coalesce the 1,000 partitions down to
10 massive 15GB partitions, destroying parallelism and guaranteeing Executor
OOM errors on data that was originally fine.
Here is a link to the
[loc](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L98C27-L98C37)
where maxPartitionBytes could have helped generate 1000 150MB partitions but
if we set too low value for maxPartitionNum, it will override that partition
count and reduce 1000 partitions to 10 partitions.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54838] A new feature to optimize partition size and count [spark]

Reply via email to