wangshengjie123 commented on PR #2373: URL: https://github.com/apache/celeborn/pull/2373#issuecomment-2564664619
> Thanks for continuing to work on this @wangshengjie123, really appreciate it ! I will try to go over the PR during the holidays. > > One general comment though ... It has been a while, so I am not very clear on some of the details about this PR. Having said that, I would suggest to minimize changes to Spark directly in order to add support for this PR - if we can find ways to do so. Some of the changes proposed in the diffs, at first read, look a bit disruptive - and are subject to arbitrary failures as spark evolves: even if technically we could justify their correctness right now, they are subject to being arbitrarily impacted as spark evolves - as well as within individual deployments. In other words, compared to other patches we have, which are much more pointed - the impact here could be nontrivial. Thanks @mridulm , i think about this question. We have 3 modifications: 1、change PartialReducerPartitionSpec in ShufflePartitionsUtil. it seems necessary and acceptable 2、modify DAGScheduler to forbid retry the skew read stage. it maybe can let client dont throw FetchFailed 3、determine whether the current shuffle has fallen back to the External Shuffle Service. it could be rercord spark external shuffle id and check But, if we want to fully support the stage rerun feature, modifying the DAGScheduler is necessary in following pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
