hterik commented on issue #27100: URL: https://github.com/apache/airflow/issues/27100#issuecomment-1290021256
I've seen tasks getting stuck silently inside the `airflow db check` command, which is part of the **Entrypoint** of the airflow docker container. It has a loop both in the entrypoint itself, CONNECTION_CHECK_MAX_COUNT, set to 20, that get multiplied with your connect timeout which can be very long by default, maybe even infinite? I've seen examples where it get stuck hanging here for hours even after the DB is recovered. If you use KubernetesExecutor, this will be the first thing happening whenever a task is started. It doesn't log anything before starting and immediately goes into probing the database for a very very long time. See https://github.com/apache/airflow/blob/main/Dockerfile#L952 ---------------- Another problem with the scheduler is that if one of the threads inside crash, the process still keeps running. You need to monitor the scheduler heartbeat from externally and restart the scheduler whenever it becomes unhealthy. This became a lot easier in 2.4 which now has a dedicated health-probe for scheduler. If this is the problem, it should be visible with a banner on the top of the web page. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
