hterik commented on issue #31810: URL: https://github.com/apache/airflow/issues/31810#issuecomment-1715095320
Keeping track of previous_heartbeat sounds like a good idea. I'd be wary of using the term "_Scheduler_" here, Maybe it's a terminology thing with each Worker/Executor having a little scheduler in itself or when running in standalone mode. For me, scheduler is the central deamon here: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/overview.html  Where we observe this error the most is _between the **Workers** and the DB._. This error category can be identified as `psycopg2.OperationalError`. While the **Scheduler**<->DB or Scheduler itself is having no issues. The scheduler only get involved if the scheduler observes that a worker has not sent a heartbeat for a long time. I would suggest phrasing it something like * **First failure:** WARNING: "Worker failed to write heartbeat to database, this will retry and is not harmful if recovery happens within $scheduler_health_check_threshold seconds. + $reason_without_stacktrace * **Failure after scheduler_health_check_threshold :** ERROR: "Worker failed to write hearbeat to database for $scheduler_health_check_threshold seconds. The Scheduler may mark this task as failed without the worker being informed of it. The task could potentially continue running but the result is going to be ignored by the scheduler. + $reason_without_stacktrace * **Recovery after failure:** "INFO: Heartbeat recovered after XXX seconds" Note that I may be mixing up some of the heartbeat timeout parameters, i haven't looked at the details of this for a long time. (`local_task_job_heartbeat_sec` vs `scheduler_health_check_threshold` vs `scheduler_zombie_task_threshold` vs `job_heartbeat_sec`). Another reason for good logs, understanding all the interaction of all these parameters is not obvious :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
