zuston commented on PR #1652:
URL: 
https://github.com/apache/incubator-uniffle/pull/1652#issuecomment-2081791048

   > Fault tolerance and rebalance is different concepts. We should differ them 
in the code. Although we can reuse some underlying techniques, we still need to 
avoid using naming fault tolerance or rebalance in any underlying data 
structure.
   For example, if one shuffle server has too high load, it can trigger the 
rebalance, but we can't say it's a faulty server.
   
   Rebalance has been removed in this feature. I will develop this in our 
internal cluster.
   
   > Task partition level reassignment record may have some risks. Baidu or Ali 
company's shuffle don't use the similar design. We should think more about this 
point.
   
   What's risk? Please describe more this and then discuss more.
   
   From the uniffle cluster dashboard, I can see some task failed by the 
requireBuffer failure, so if having this, we could avoid task retry to improve 
stable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to