Re: [PR] [#1608][part-5] feat(spark3): always use the available assignment [incubator-uniffle]

via GitHub Sun, 28 Apr 2024 20:26:38 -0700


jerqi commented on PR #1652:
URL: 
https://github.com/apache/incubator-uniffle/pull/1652#issuecomment-2081830433


   > > Fault tolerance and rebalance is different concepts. We should differ 
them in the code. Although we can reuse some underlying techniques, we still 
need to avoid using naming fault tolerance or rebalance in any underlying data 
structure.
   > > For example, if one shuffle server has too high load, it can trigger the 
rebalance, but we can't say it's a faulty server.
   > 
   > Loadbalance has been removed in this feature. I will develop this in our 
internal cluster.
   > 
   > > Task partition level reassignment record may have some risks. Baidu or 
Ali company's shuffle don't use the similar design. We should think more about 
this point.
   > 
   > What's risk? Please describe more this and then discuss more.
   > 
   > From the uniffle cluster dashboard, I can see some task failed by the 
requireBuffer failure, so if having this, we could avoid task retry to improve 
stable.
   1. Is current design compatible for balance feature?
   
   2. The possible risks are:
   1. The memory cost, will it cause we use too much memory to store the data 
stucture
   2. If  a task fails many times, will it produce a bad influence.
   Maybe I think more, too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [#1608][part-5] feat(spark3): always use the available assignment [incubator-uniffle]

Reply via email to