cloud-fan commented on pull request #28778: URL: https://github.com/apache/spark/pull/28778#issuecomment-645167188
After more thoughts, I think the file partitions split logic itself is problematic. Its target is to make the number of partitions the same as the total number of cores, which doesn't make sense as the cluster may only have a few free cores. I think a proper way is to set an expected size of each partition, like 64mb. This is also what we do when coalescing shuffle partitions in AQE. Can we add such a config? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
