dstandish commented on PR #35800: URL: https://github.com/apache/airflow/pull/35800#issuecomment-1824655540
In case it is important to adopt pods, but we need a fix to do it safely, just throwing out some ideas before i disappear for holiday. The scheduler can know which other schedulers are live. So, perhaps each loop it could look up active schedulers. Then when looking at the pod, it can see the scheduler job Id (i think it's a label e.g. queued_by or something) and then it can not adopt pods by active schedulers. Or perhaps more safe would be to queue up the "adoption candidates" in a list and then after done with reading all the pods, and finding candidates, at that moment look up active scheduler IDs and then not adopt candidates that have an active scheduler. Another idea would be to have the scheduler periodically patch pods with a timestamp and then we could look at that when adopting pods and not adopt one unless hasn't been touched in long time. Last idea: only adopt completed pods that completed more than 5 minutes ago. Maybe this is the simplest. All of these have performance implications which need to be considered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
