stepanof commented on issue #24731: URL: https://github.com/apache/airflow/issues/24731#issuecomment-1318461607
@potiuk Hello Jarek. I'm using custom airflow image based on `apache/airflow:2.4.1-python3.8` Recently I built HA clusters for postgres database and redis. Both are used by airflow cluster (1webserver,2scheduler,2worker) I have faced with problem in scheduler and worker in the moment when VirtualIP of redis or postgres cluster move at another node - tasks stuck in 'queqed' or 'scheduled' status. I attach worker's logs which was stuck when redis master moved to another node. [airflow_logs_err.txt](https://github.com/apache/airflow/files/10030701/airflow_logs_err.txt) Restarting airflow-worker solve the problem. To solve this problem I have added one more service at each airflow instanse - it called '[autoheal](https://hub.docker.com/r/willfarrell/autoheal/)'. It restarts docker container when it become 'unhelthy'. We are using it in production but it is workaround solution. I think airflow scheduler and worker have to be able react on such situations without any additional services. I am ready help you to debug this problem and find the solution, just tell me what I can do for Airflow developers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
