jerqi commented on PR #1652: URL: https://github.com/apache/incubator-uniffle/pull/1652#issuecomment-2081830433
> > Fault tolerance and rebalance is different concepts. We should differ them in the code. Although we can reuse some underlying techniques, we still need to avoid using naming fault tolerance or rebalance in any underlying data structure. > > For example, if one shuffle server has too high load, it can trigger the rebalance, but we can't say it's a faulty server. > > Loadbalance has been removed in this feature. I will develop this in our internal cluster. > > > Task partition level reassignment record may have some risks. Baidu or Ali company's shuffle don't use the similar design. We should think more about this point. > > What's risk? Please describe more this and then discuss more. > > From the uniffle cluster dashboard, I can see some task failed by the requireBuffer failure, so if having this, we could avoid task retry to improve stable. 1. Is current design compatible for balance feature? 2. The possible risks are: 1. The memory cost, will it cause we use too much memory to store the data stucture 2. If a task fails many times, will it produce a bad influence. Maybe I think more, too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
