stepanof commented on issue #24731:
URL: https://github.com/apache/airflow/issues/24731#issuecomment-1318461607

   @potiuk Hello Jarek.
   I'm using custom airflow image based on `apache/airflow:2.4.1-python3.8`
   Recently I built HA clusters for postgres database and redis. Both are used 
by airflow cluster (1webserver,2scheduler,2worker)
   I have faced with problem in scheduler and worker in the moment when 
VirtualIP of redis or postgres cluster move at another node - tasks stuck in 
'queqed' or 'scheduled' status.
   I attach worker's logs which was stuck when redis master moved to another 
node. 
   
[airflow_logs_err.txt](https://github.com/apache/airflow/files/10030701/airflow_logs_err.txt)
   Restarting airflow-worker solve the problem.
   
   To solve this problem I have added one more service at each airflow instanse 
- it called '[autoheal](https://hub.docker.com/r/willfarrell/autoheal/)'. It 
restarts docker container when it become 'unhelthy'.
   We are using it in production but it is workaround solution. I think airflow 
scheduler and worker have to be able react on such situations without any 
additional services.
   
   I am ready help you to debug this problem and find the solution, just tell 
me what I can do for Airflow developers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to