jerqi commented on PR #1652: URL: https://github.com/apache/incubator-uniffle/pull/1652#issuecomment-2081788299
From overview, I have still two points which I wan to talk about. 1. Fault tolerance and rebalance is different concepts. We should differ them in the code. Although we can reuse some underlying techniques, we still need to avoid using naming fault tolerance or rebalance in any underlying data structure. For example, if one shuffle server has too high load, it can trigger the rebalance, but we can't say it's a faulty server. 2. Task partition level reassignment record may have some risks. Baidu or Ali company's shuffle don't use the similar design. We should think more about this point. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
