dotsering commented on PR #53599: URL: https://github.com/apache/spark/pull/53599#issuecomment-3691769429
> Could we consider setting `spark.sql.files.maxPartitionNum` to address the small file issue? Thanks @wangyum for your feedback. Here are few limitations that I can think of using hardcoded maxPartitionNum property: 1. Incompatibility with Heterogeneous Data Sources: The configuration lacks flexibility because it applies to the read operation globally. An ideal value for one source is often catastrophic for another within the same job. For example, setting maxPartitionNum = 1 is perfect for my example 10,000 tiny 10KB files (reducing overhead), but if that same job also reads a large 1TB table, that table is also forced into a single partition, causing an immediate failure. You cannot optimize the small source without breaking the large one. You have to keep fine tuning it for every spark job. That was the main reason I thought of automating hardcoded value. 2. Destructive Merging of Optimized Data (Collateral Damage): This setting acts as a blunt instrument that overrides safe partition sizing logic. If a job reads a second data source that is already perfectly partitioned (e.g., 1,000 partitions of 150MB), setting maxPartitionNum = 10 forces Spark to ignore those healthy boundaries. It will aggressively coalesce the 1,000 partitions down to 10 massive 15GB partitions, destroying parallelism and guaranteeing Executor OOM errors on data that was originally fine. Here is a link to the [loc](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L98C27-L98C37) where maxPartitionBytes could have helped generate 1000 150MB partitions but if we set too low value for maxPartitionNum, it will override that partition count and reduce 1000 partitions to 10 partitions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
