The GitHub Actions job "Tests" on airflow.git/fix-k8s-multi-scheduler-thrash 
has succeeded.
Run started by GitHub user potiuk (triggered by potiuk).

Head commit for run:
8bc3c73e5b4b87ea66604600280bfc4d4a8c8213 / Jarek Potiuk <[email protected]>
KubernetesExecutor: scope periodic completed-pod adoption to dead schedulers

PR #61839 (cncf-kubernetes 10.15.0) added a periodic call to
`_adopt_completed_pods` from inside `KubernetesExecutor.sync()`, gated by
`[scheduler] orphaned_tasks_check_interval` (default 300 s). The query
selects every Succeeded pod whose `airflow-worker` label is not the current
scheduler's label and PATCHes it with the current scheduler's label so its
KubernetesJobWatcher will see the change and DELETE the pod.

With multi-scheduler deployments that caused thrashing — every
`orphaned_tasks_check_interval` each scheduler iterated over every Succeeded
pod that did not carry its own label and PATCHed it. Schedulers fought each
other:

  * Scheduler A relabels every Succeeded pod owned by B and C → A's watcher
    DELETEs them.
  * Scheduler B does the same a few seconds later → relabels A's freshly
    patched pods to B → B's watcher takes over.
  * Scheduler C the same.

At steady state with high pod churn this manifested as heavy
PATCH /api/v1/namespaces/.../pods/... traffic, expensive `_list_pods` calls
on every interval tick (#35599 already documents this is 15-30 s with 500
pods), and tasks stalling in `scheduled` / `queued` because every scheduler
loop was burning seconds inside `_list_pods` and `patch_namespaced_pod`
instead of doing useful scheduling. Setting `delete_worker_pods=False` did
NOT help — the periodic adoption code path doesn't gate on it; it goes
through the watcher's delete.

Fix: scope the periodic adoption to pods owned by no-longer-alive
schedulers. New helper `_alive_other_scheduler_job_ids` queries the
`Job` table for SchedulerJobs whose `state == RUNNING` and whose
`latest_heartbeat` is within `[scheduler] scheduler_health_check_threshold`
(matching the alive-scheduler definition already used by
`SchedulerJobRunner.adopt_or_reset_orphaned_tasks`). The label selector
in `_adopt_completed_pods` is then built to exclude self + every alive
sibling using K8s set-based syntax `airflow-worker notin (a,b,c)`:

  * Single-scheduler deployment: no behavior change. Helper returns empty
    set, selector falls back to the original equality form
    `airflow-worker!=<self_label>`.
  * Multi-scheduler deployment: each scheduler only adopts pods whose
    owning scheduler is gone — preserving the original goal of #61839
    (cleanup after a scheduler restart) without the thrash.

If the DB query fails, the helper returns an empty set so the caller
falls back to the pre-#61839 "exclude self only" selector — a transient
DB issue must not break completed-pod cleanup.

Two new unit tests cover the multi-scheduler set-based selector and
confirm the single-scheduler equality form is unchanged. Existing
`test_adopt_completed_pods` and `test_adopt_completed_pods_api_exception`
keep their original assertions because the new helper falls back to an
empty set when `scheduler_job_id` is the test's non-numeric string.

Closes: #66396

Report URL: https://github.com/apache/airflow/actions/runs/25368272798

With regards,
GitHub Actions via GitBox


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to