wangshengjie123 commented on PR #2373:
URL: https://github.com/apache/celeborn/pull/2373#issuecomment-2564664619

   > Thanks for continuing to work on this @wangshengjie123, really appreciate 
it ! I will try to go over the PR during the holidays.
   > 
   > One general comment though ... It has been a while, so I am not very clear 
on some of the details about this PR. Having said that, I would suggest to 
minimize changes to Spark directly in order to add support for this PR - if we 
can find ways to do so. Some of the changes proposed in the diffs, at first 
read, look a bit disruptive - and are subject to arbitrary failures as spark 
evolves: even if technically we could justify their correctness right now, they 
are subject to being arbitrarily impacted as spark evolves - as well as within 
individual deployments. In other words, compared to other patches we have, 
which are much more pointed - the impact here could be nontrivial.
   
   Thanks @mridulm , i think about this question. We have 3 modifications:
   1、change PartialReducerPartitionSpec in ShufflePartitionsUtil. it seems 
necessary and acceptable
   2、modify DAGScheduler to forbid retry the skew read stage. it maybe can let 
client dont throw FetchFailed
   3、determine whether the current shuffle has fallen back to the External 
Shuffle Service. it could be rercord spark external  shuffle id and check
   
   But, if we want to fully support the stage rerun feature, modifying the 
DAGScheduler is necessary in following pr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to