whitleykeith commented on issue #23361: URL: https://github.com/apache/airflow/issues/23361#issuecomment-1170162525
We've made some progress in at least reducing the amount of deadlocks. We're running 2.0.2 on K8s and we've discovered the following: 1. This mainly happens at scale when there's a lot of tasks to schedule 2. This mainly happens during updates to the scheduler or when multiple schedulers start at once We've been able to reduce deadlocks almost entirely simply by adding a startupProbe to the scheduler deployment in K8s and telling K8s to only roll out schedulers one at a time to avoid them starting at the same time. When they started at the same time all of them would deadlock and running tasks would get killed and rescheduled etc. Rolling out one at a time has almost entirely removed deadlocking, and the few times it does happen it's isolated to one scheduler where other schedulers can keep things moving The fact it happens more frequently when starting schedulers at the same time makes me think it might be related to the `file_parsing_sort_mode` I mentioned above. Since the default behavior is `modified_time`, My theory is that all the schedulers are configured to "start" it's scheduler loop at the same place, which would naturally increase the chance of deadlocking -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
