potiuk commented on code in PR #31996:
URL: https://github.com/apache/airflow/pull/31996#discussion_r1268935150
##########
airflow/jobs/job.py:
##########
@@ -207,7 +207,16 @@ def heartbeat(
self.log.debug("[heartbeat]")
except OperationalError:
Stats.incr(convert_camel_to_snake(self.__class__.__name__) +
"_heartbeat_failure", 1, 1)
- self.log.exception("%s heartbeat got an exception",
self.__class__.__name__)
+ if self.is_alive():
+ self.log.error(
+ "%s heartbeat failed with error. Scheduler may go into
unhealthy state",
Review Comment:
I think, again maybe @BShraman can help - they seem to be engaged and
propose improvement in this area and possibly they have some experience and can
propose concrete wording and explanation here and list of cases. Also possibly
@hterik - the creator of #31810 has some past experiences?
I think what might be best is to add a section in the documentation (in
scheduler docs) describing some details about possible heartbeat loss and error
situations that might happen here and rather than explaining them in details in
this log message, it could be explained there and link to that documentation
section could be added as". See more at http://airflow.apache.org/....." in the
message.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]