whitleykeith commented on issue #23361:
URL: https://github.com/apache/airflow/issues/23361#issuecomment-1170162525

   We've made some progress in at least reducing the amount of deadlocks. We're 
running 2.0.2 on K8s and we've discovered the following:
   
   1. This mainly happens at scale when there's a lot of tasks to schedule
   2. This mainly happens during updates to the scheduler or when multiple 
schedulers start at once
   
   We've been able to reduce deadlocks almost entirely simply by adding a 
startupProbe to the scheduler deployment in K8s and telling K8s to only roll 
out schedulers one at a time to avoid them starting at the same time. When they 
started at the same time all of them would deadlock and running tasks would get 
killed and rescheduled etc. Rolling out one at a time has almost entirely 
removed deadlocking, and the few times it does happen it's isolated to one 
scheduler where other schedulers can keep things moving
   
   The fact it happens more frequently when starting schedulers at the same 
time makes me think it might be related to the `file_parsing_sort_mode` I 
mentioned above. Since the default behavior is `modified_time`, My theory is 
that all the schedulers are configured to "start" it's scheduler loop at the 
same place, which would naturally increase the chance of deadlocking 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to