hterik commented on code in PR #31996:
URL: https://github.com/apache/airflow/pull/31996#discussion_r1234843006
##########
airflow/jobs/job.py:
##########
@@ -213,7 +213,11 @@ def heartbeat(
self.log.debug("[heartbeat]")
except OperationalError:
Stats.incr(convert_camel_to_snake(self.__class__.__name__) +
"_heartbeat_failure", 1, 1)
- self.log.exception("%s heartbeat got an exception",
self.__class__.__name__)
+ if self.is_alive():
+ self.log.error("%s heartbeat failed with error. Scheduler may
go into unhealthy state", self.__class__.__name__)
+ else:
+ self.log.error("%s heartbeat failed with error. Scheduler is
in unhealthy state", self.__class__.__name__)
Review Comment:
As a user of airflow reading the logs of a dag i would not understand what
this means to me. Is this something i have to react to? Do i need to contact my
admins? Is the dag results corrupted? Should i restart the scheduler?
This error isn't necessarily a problem with the scheduler. More often it is
a problem of the executor not being able to reach the database, due to
transient network problems. As long as this error is transient and recovers
shortly, the consequence of this is usually none. The log message should
reflect this. If this is too much to fit into a log-message, linking to the
architecture documentation at airflow.apache.org as suggested by potiuk above
sounds like a proposal.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]