kfaraz commented on PR #18591: URL: https://github.com/apache/druid/pull/18591#issuecomment-3372021229
> I mean risky because overlord needs to restore all tasks, previously we had some problems (maybe bug) that after switching leaders, overlord failed to elect a new leader. Yes, there might be some bugs around that. Also, the K8s task runner makes certain list pod calls, which are pretty heavy and needs to be addressed. I think @capistrant is doing some work to improve that code flow. > We try our best not to restart coordinator/overlord in production. Oh, how frequently do you upgrade your cluster? Is changing the task capacity going to be much more frequent than that. I agree that K8s task runner is buggy and we should improve upon it. But making the task capacity dynamic doesn't seem like the best solution. It will open a whole another can of worms and make this piece only more complicated. Instead, we should trying to fix up the actual problems in the task runner which make Overlord leader switch erroneous. What are your thoughts, @FrankChen021 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
