hussein-awala commented on code in PR #68068:
URL: https://github.com/apache/airflow/pull/68068#discussion_r3395731837


##########
airflow-core/src/airflow/jobs/scheduler_job_runner.py:
##########
@@ -1223,12 +1223,16 @@ def _is_tracing_enabled():
         return conf.getboolean("traces", "otel_on")
 
     def _process_executor_events(self, executor: BaseExecutor, session: 
Session) -> int:
-        return SchedulerJobRunner.process_executor_events(
-            executor=executor,
-            job_id=self.job.id,
-            scheduler_dag_bag=self.scheduler_dag_bag,
-            session=session,
-        )
+        try:
+            return SchedulerJobRunner.process_executor_events(
+                executor=executor,
+                job_id=self.job.id,
+                scheduler_dag_bag=self.scheduler_dag_bag,
+                session=session,
+            )
+        except Exception as exc:
+            stats.incr("scheduler.executor_events.failed", tags={"reason": 
type(exc).__name__})
+            raise

Review Comment:
   why do you use two different tags for the same data between 
`scheduler.executor_events.failed` and `scheduler.loop_exceptions`?



##########
airflow-core/src/airflow/jobs/scheduler_job_runner.py:
##########
@@ -3221,8 +3233,12 @@ def adopt_or_reset_orphaned_tasks(self, *, session: 
Session = NEW_SESSION) -> in
 
                     stats.incr("scheduler.orphaned_tasks.cleared", 
len(to_reset))
                     stats.incr("scheduler.orphaned_tasks.adopted", 
len(tis_to_adopt_or_reset) - len(to_reset))
-
                     if to_reset:
+                        stats.incr(
+                            "scheduler.zombies.detected",
+                            len(to_reset),
+                            tags={"reason": "adopt_failure"},
+                        )

Review Comment:
   I believe there is a distinction between a zombie task (a task that remains 
stuck in a running state even though its associated job is inactive) and an 
orphaned task (a task that has lost its executor). This metric appears to track 
orphaned tasks that the executor failed to adopt for any reason.
   
   Since this overlaps with `scheduler.orphaned_tasks.cleared`, I would prefer 
to remove the new metric.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to