kfaraz commented on PR #18591:
URL: https://github.com/apache/druid/pull/18591#issuecomment-3372021229

   > I mean risky because overlord needs to restore all tasks, previously we 
had some problems (maybe bug) that after switching leaders, overlord failed to 
elect a new leader.
   
   Yes, there might be some bugs around that. Also, the K8s task runner makes 
certain list pod calls, which are pretty heavy
   and needs to be addressed. I think @capistrant is doing some work to improve 
that code flow.
   
   > We try our best not to restart coordinator/overlord in production.
   
   Oh, how frequently do you upgrade your cluster?
   Is changing the task capacity going to be much more frequent than that.
   
   I agree that K8s task runner is buggy and we should improve upon it.
   But making the task capacity dynamic doesn't seem like the best solution.
   It will open a whole another can of worms and make this piece only more 
complicated.
   
   Instead, we should trying to fix up the actual problems in the task runner 
which make Overlord leader switch erroneous.
   
   What are your thoughts, @FrankChen021 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to