CodingCat commented on PR #3569: URL: https://github.com/apache/celeborn/pull/3569#issuecomment-3699042600
> @CodingCat Thanks for the details, I understand the motivation and rationale behind the proposal - and there are existing alternatives like checkpoint'ing, temp materialization, etc. This will be more on the application design/impl side though - and apps will have to do their own tradeoffs (given there are costs involved). > > > > As I mentioned earlier, this proposal itself is inherently unsound, and given additional details provided, I am not in favor of introducing it into Apache Celeborn - my rationale/analysis would be the same if a variant of the proposal was made to Apache Spark for "vanilla" shuffle as well :-) > > > > I am open to being corrected ofcourse if there are other valid usecases and/or requirements I am missing ! @mridulm Hi, I agree we can reduce shuffle cost with something like Spark checkpoint , or dump intermediate data to s3 manually. these kinds of approaches essentially sacrifice performance significantly...e.g. using s3 to store the intermediate results of a k-ways join can slow down queries for almost 10X based on my experience the proposal here is essentially another alternative provided to the user: if your job has a "clean" lineage, you can reduce your shuffle cost by enabling this feature at the higher recovery cost from failure. As I said, this is not a feature for broad rollout but only for certain types of jobs. (actually, based on my experience, most of jobs will survive with this feature , since in reality, there are not many jobs dumping same RDD to multiple locations and RDD-reuse jobs can always start quickly to fill the lineage information needed by this feature. On the other side, only big shuffle jobs can deliver significant values with this feature since small shuffle jobs do not play big roles for your cluster capacity ) I’d like to better understand what specific invariant you believe this proposal violates when you call it “inherently unsound”. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
