potiuk commented on issue #27100: URL: https://github.com/apache/airflow/issues/27100#issuecomment-1296472304
> I've seen tasks getting stuck silently inside the airflow db check command, which is part of the Entrypoint of the airflow docker container. It has a loop both in the entrypoint itself, CONNECTION_CHECK_MAX_COUNT, set to 20, that get multiplied with your connect timeout which can be very long by default, maybe even infinite? I've seen examples where it get stuck hanging here for hours even after the DB is recovered. Ideas on other strategies? What have you see working for you @hterik ? I think we can improve that - current defaults have been takenf from the original Astronomer image, but maybe we can do better? WDYT? > Another problem with the scheduler is that if one of the threads inside crash, the process still keeps running. You need to monitor the scheduler heartbeat from externally and restart the scheduler whenever it becomes unhealthy. This became a lot easier in 2.4 which now has a dedicated health-probe for scheduler. If this is the problem, it should be visible with a banner on the top of the web page. This is interesting and should not (generally) happen. Do you have an example of that @hterik? IMHO that is exactly what is my point about "crashing hard whenever any crash occured. Seeing examples of when it happened would be super helpful (for reproduction and fix). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
