mridulm commented on PR #3279: URL: https://github.com/apache/celeborn/pull/3279#issuecomment-2908117292
Couple of things to watch out for based on a quick read of the doc: a) A large number of random/small reads has very bad performance characteristics at scale (this is why data is kept reduce oriented, or sorted). b) As number of mappers and reducers increase (and so data per reducer in mapper's output and vice versa), we have to ensure the overhead of maintaining the mapping is kept reasonable. I would suggest submitting this via a CIP, would be good to get broader community feedback. Thanks for working on this ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
