jackylee-ch commented on PR #37601: URL: https://github.com/apache/spark/pull/37601#issuecomment-1222301529
> Hi @jackylee-ch AFAIK, we will split on there files from `listFiles` to `partitionedFile` by `maxSplitBytes` first, second, we will merge this `partitionedFile` to a `Partition`, I'm not sure how this relates to small files, in theory, even with a large number of small files, merge at `maxSplitBytes` without affecting concurrency ?(probably) A little Example: we hase 7000 files in a table, whose total size is 1TB, and we have start a application with 4500 cores, thus the proper config is maxPartitionBytes=240MB and openCostInBytes=4MB. It is hard for user to calculate the proper maxPartitionBytes and openCostInBytes. With this PR, user can easily get the best performance without calculating it. And for long live applications, espetially those use FAIR scheduling mode, it will be also easy to control concurrency with different kind of queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
