health endpoint does not give meaningful information into Scheduler statuses

Ferruzzi, Dennis Thu, 25 Jun 2026 09:53:27 -0700

Thanks for looking into this.   I'm a strong +1 with a caveat (see below)

One alarming side effect of the way this is calculated and reported is that in 
a multi-team environment, Team_1's scheduler may be down entirely and the 
current dashboard will report HEALTHY as long as at least one other team is 
live.  I'm not sure if we can squeeze this in as a bug-fix because that seems 
like a gap we should fix.  Teghveer just confirmed last night that the same 
calculation/reporting is being used for the Triggerers as well, so I am 
amending your proposal; whatever we decide here will also be applied to 
Triggerers as well. (Teghveer is willing to do that part of he work in parallel 
once we have a consensus.)


That said, here's my opinion on the proposal:

Caveat:  I like the proposal, on the condition that the value reported by the 
existing "scheduler" is unchanged.  We can (and should?) deprecate and remove 
that in a future release with instructions to move to using "schedulers" 
instead, but for now we can't break existing monitoring.

Additional non-blocking suggestion:  Also, let's add the "team_name" in the 
individual scheduler schema since it's available:

"schedulers": {
    "status": "DEGRADED",
    "instances": [
      {
        "hostname": "scheduler-ha-instance-1",
        "status": "HEALTHY",
        "team_name": "team_1",
        "latest_heartbeat": "2026-06-24T23:15:02+00:00"
      },
      {
        "hostname": "scheduler-ha-instance-2",
        "status": "DOWN",
        "team_name": "team_2",
        "latest_heartbeat": "2026-06-24T23:10:14+00:00"
      },
     etc...
}

- ferrruzi
________________________________
From: Jung-Hyun Kim <[email protected]>
Sent: Wednesday, June 24, 2026 5:01 PM
To: [email protected] <[email protected]>
Subject: [EXT] [DISCUSS] /api/v2/monitor/health endpoint does not give 
meaningful information into Scheduler statuses

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne 
cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas 
confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le 
contenu ne présente aucun risque.



The Problem
In distributed Airflow environments running multiple schedulers, the current 
health endpoint contains a significant monitoring blind spot.
Currently, the health check determines the status of the scheduler by querying 
the metadata database using the most_recent_job method found in job.py:

@provide_session
def most_recent_job(job_type: str, *, session: Session = NEW_SESSION) -> Job | 
None:
    """
    Return the most recent job of this type, if any, based on last heartbeat 
received.

    Jobs in "running" state take precedence over others to make sure alive
    job is returned if it is available.

    :param job_type: job type to query for to get the most recent job for
    :param session: Database session
    :end_date: None
    """
    return session.scalar(
        select(Job)
        .where(Job.job_type == job_type)
        .order_by(
            # Put "running" jobs at the front.
            case({JobState.RUNNING: 0}, value=Job.state, else_=1),
            Job.latest_heartbeat.desc(),
        )
        .limit(1)
    )


This database query explicitly sorts records by the RUNNING state and applies 
.limit(1), returning only a single, absolute newest job record.
This result is then processed in airflow_health.py via the get_airflow_health 
endpoint method:

def get_airflow_health() -> dict[str, Any]:
    """Get the health for Airflow metadatabase, scheduler and triggerer."""
    metadatabase_status = HEALTHY
    latest_scheduler_heartbeat = None
    latest_triggerer_heartbeat = None
    latest_dag_processor_heartbeat = None

    scheduler_status = UNHEALTHY
    triggerer_status: str | None = UNHEALTHY
    dag_processor_status: str | None = UNHEALTHY

    try:
        latest_scheduler_job = SchedulerJobRunner.most_recent_job()

        if latest_scheduler_job:
            if latest_scheduler_job.latest_heartbeat:
                latest_scheduler_heartbeat = 
latest_scheduler_job.latest_heartbeat.isoformat()
            if latest_scheduler_job.is_alive():
                scheduler_status = HEALTHY
    except Exception:
        metadatabase_status = UNHEALTHY


Because the health endpoint evaluates only the single job returned by 
most_recent_job(), the check can only ever validate the health of one scheduler 
at a time.
In a distributed deployment with multiple active schedulers, if even one 
instance is running cleanly, the endpoint will flag as healthy even if all 
other parallel scheduler instances have gone down.
To get meaningful information regarding the scheduler status from the health 
endpoint it is worth it to monitor every scheduler in the distributed 
environment instead of just a single scheduler.
The Proposed Solution
To deal with this problem we can add a new field called schedulers (plural for 
multiple schedulers) in the health endpoint that returns a 3-tier aggregated 
status that covers the following:

  *
HEALTHY: All registered scheduler instances are fully operational and actively 
heartbeating.
  *
DEGRADED: At least one scheduler instance is down or failing, but at least one 
remaining instance is still working.
  *
DOWN: All scheduler instances have failed or stopped working.

Per-Instance Diagnostic Breakdown
We should also add a per instance breakdown as a nested list that will show the 
following:

  1.
hostname
  2.
status: Individual status
  3.
latest_heartbeat

Example

{
  "metadatabase": {
    "status": "healthy"
  },
  "scheduler": {
    "scheduler_status": "healthy",
    "latest_scheduler_heartbeat": "2026-06-24T23:15:02+00:00"
  },
  "schedulers": {
    "status": "DEGRADED",
    "instances": [
      {
        "hostname": "scheduler-ha-instance-1",
        "status": "HEALTHY",
        "latest_heartbeat": "2026-06-24T23:15:02+00:00"
      },
      {
        "hostname": "scheduler-ha-instance-2",
        "status": "DOWN",
        "latest_heartbeat": "2026-06-24T23:10:14+00:00"
      },
      {
        "hostname": "scheduler-ha-instance-3",
        "status": "HEALTHY",
        "latest_heartbeat": "2026-06-24T23:14:59+00:00"
      }
    ]
  }
}

Could end up looking roughly like this, resulting in a more meaningful health 
endpoint that will make it easier to diagnose issues with the scheduler. This 
is a LAZY CONSENSUS proposal.

Re: [DISCUSS] /api/v2/monitor/health endpoint does not give meaningful information into Scheduler statuses

Reply via email to