potiuk commented on code in PR #31996:
URL: https://github.com/apache/airflow/pull/31996#discussion_r1268562633


##########
airflow/jobs/job.py:
##########
@@ -207,7 +207,16 @@ def heartbeat(
                 self.log.debug("[heartbeat]")
         except OperationalError:
             Stats.incr(convert_camel_to_snake(self.__class__.__name__) + 
"_heartbeat_failure", 1, 1)
-            self.log.exception("%s heartbeat got an exception", 
self.__class__.__name__)
+            if self.is_alive():
+                self.log.error(
+                    "%s heartbeat failed with error. Scheduler may go into 
unhealthy state",
+                    self.__class__.__name__,
+                )
+            else:
+                self.log.error(
+                    "%s heartbeat failed with error. Scheduler is in unhealthy 
state", self.__class__.__name__
+                )

Review Comment:
   I think what would be useful is maybe to dump stacktrace of last such 
failure and append it to a file in a known location 
("heartbeat_exceptions.dump") with timestamp of the dump. Possibly we could 
also limit size of such file and truncate beginning of it when it grows too 
much. We would have to hard-code some reasonable defaults - no need to 
configure those.  Instead of dumping stack trace we could write: "Details of 
the error are available in {THIS_DUMP_FILE} for further inspection if it 
repeats often" or smth like that.
   
   I think also we will need a test case covering both the logs and dumping 
such information.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to