hterik commented on issue #31810:
URL: https://github.com/apache/airflow/issues/31810#issuecomment-1715095320

   Keeping track of previous_heartbeat sounds like a good idea.
   
   I'd be wary of using the term "_Scheduler_" here, Maybe it's a terminology 
thing with each Worker/Executor having a little scheduler in itself or when 
running in standalone mode.
   For me, scheduler is the central deamon here:
   
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/overview.html
   
![image](https://github.com/apache/airflow/assets/89977373/e884af6d-83e1-43c7-a74d-56030f97b122)
   
   Where we observe this error the most is _between the **Workers** and the 
DB._. This error category can be identified as `psycopg2.OperationalError`.
   While the **Scheduler**<->DB or Scheduler itself is having no issues.
   
   The scheduler only get involved if the scheduler observes that a worker has 
not sent a heartbeat for a long time.
   
   
   I would suggest phrasing it something like 
   * **First failure:** WARNING: "Worker failed to write heartbeat to database, 
this will retry and is not harmful if recovery happens within 
$scheduler_health_check_threshold seconds. + $reason_without_stacktrace
   * **Failure after scheduler_health_check_threshold :** ERROR: "Worker failed 
to write hearbeat to database for $scheduler_health_check_threshold seconds. 
The Scheduler may mark this task as failed without the worker being informed of 
it. The task could potentially continue running but the result is going to be 
ignored by the scheduler. + $reason_without_stacktrace
   * **Recovery after failure:** "INFO: Heartbeat recovered after XXX seconds"
   
   Note that I may be mixing up some of the heartbeat timeout parameters, i 
haven't looked at the details of this for a long time. 
(`local_task_job_heartbeat_sec` vs `scheduler_health_check_threshold` vs 
`scheduler_zombie_task_threshold` vs `job_heartbeat_sec`). Another reason for 
good logs, understanding all the interaction of all these parameters is not 
obvious :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to