The GitHub Actions job "Tests" on airflow.git/fix-k8s-multi-scheduler-thrash has succeeded. Run started by GitHub user potiuk (triggered by potiuk).
Head commit for run: 8bc3c73e5b4b87ea66604600280bfc4d4a8c8213 / Jarek Potiuk <[email protected]> KubernetesExecutor: scope periodic completed-pod adoption to dead schedulers PR #61839 (cncf-kubernetes 10.15.0) added a periodic call to `_adopt_completed_pods` from inside `KubernetesExecutor.sync()`, gated by `[scheduler] orphaned_tasks_check_interval` (default 300 s). The query selects every Succeeded pod whose `airflow-worker` label is not the current scheduler's label and PATCHes it with the current scheduler's label so its KubernetesJobWatcher will see the change and DELETE the pod. With multi-scheduler deployments that caused thrashing — every `orphaned_tasks_check_interval` each scheduler iterated over every Succeeded pod that did not carry its own label and PATCHed it. Schedulers fought each other: * Scheduler A relabels every Succeeded pod owned by B and C → A's watcher DELETEs them. * Scheduler B does the same a few seconds later → relabels A's freshly patched pods to B → B's watcher takes over. * Scheduler C the same. At steady state with high pod churn this manifested as heavy PATCH /api/v1/namespaces/.../pods/... traffic, expensive `_list_pods` calls on every interval tick (#35599 already documents this is 15-30 s with 500 pods), and tasks stalling in `scheduled` / `queued` because every scheduler loop was burning seconds inside `_list_pods` and `patch_namespaced_pod` instead of doing useful scheduling. Setting `delete_worker_pods=False` did NOT help — the periodic adoption code path doesn't gate on it; it goes through the watcher's delete. Fix: scope the periodic adoption to pods owned by no-longer-alive schedulers. New helper `_alive_other_scheduler_job_ids` queries the `Job` table for SchedulerJobs whose `state == RUNNING` and whose `latest_heartbeat` is within `[scheduler] scheduler_health_check_threshold` (matching the alive-scheduler definition already used by `SchedulerJobRunner.adopt_or_reset_orphaned_tasks`). The label selector in `_adopt_completed_pods` is then built to exclude self + every alive sibling using K8s set-based syntax `airflow-worker notin (a,b,c)`: * Single-scheduler deployment: no behavior change. Helper returns empty set, selector falls back to the original equality form `airflow-worker!=<self_label>`. * Multi-scheduler deployment: each scheduler only adopts pods whose owning scheduler is gone — preserving the original goal of #61839 (cleanup after a scheduler restart) without the thrash. If the DB query fails, the helper returns an empty set so the caller falls back to the pre-#61839 "exclude self only" selector — a transient DB issue must not break completed-pod cleanup. Two new unit tests cover the multi-scheduler set-based selector and confirm the single-scheduler equality form is unchanged. Existing `test_adopt_completed_pods` and `test_adopt_completed_pods_api_exception` keep their original assertions because the new helper falls back to an empty set when `scheduler_job_id` is the test's non-numeric string. Closes: #66396 Report URL: https://github.com/apache/airflow/actions/runs/25368272798 With regards, GitHub Actions via GitBox --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
