jackylee-ch commented on PR #37601:
URL: https://github.com/apache/spark/pull/37601#issuecomment-1222301529

   > Hi @jackylee-ch AFAIK, we will split on there files from `listFiles` to 
`partitionedFile` by `maxSplitBytes` first, second, we will merge this 
`partitionedFile` to a `Partition`, I'm not sure how this relates to small 
files, in theory, even with a large number of small files, merge at 
`maxSplitBytes` without affecting concurrency ?(probably)
   
   A little Example: we hase 7000 files in a table, whose total size is 1TB, 
and we have start a application with 4500 cores, thus the proper config is 
maxPartitionBytes=240MB and openCostInBytes=4MB. It is hard for user to 
calculate the proper maxPartitionBytes and openCostInBytes. 
   
   With this PR, user can easily get the best performance without calculating 
it. And for long live applications, espetially those use FAIR scheduling mode, 
it will be also easy to control concurrency with different kind of queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to