github-actions[bot] opened a new pull request, #67068:
URL: https://github.com/apache/airflow/pull/67068
test_integration_run_dag_with_scheduler_failure is intermittently flaky
on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
so the new scheduler pod (after `kubectl rollout status` returns
successfully)
sometimes has to handle the very first scheduling step itself before the
40s monitor_task timeout expires. push stays in `queued` for the full 40s
and the test fails with `assert 'queued' == 'success'`.
Two adjustments:
1. Before killing the scheduler, wait until the `push` task instance has
reached a `queued`-or-later state. That way the original scheduler has
already handed the task to the executor and the post-restart scheduler
only needs to drive the downstream dependency for `puller`, not pick up
`push` from scratch.
2. Bump the post-restart monitor_task timeout from 40s to 120s. The
previous "fail fast if failing" budget races with scheduler-loop warm-up
under load; 120s is still fast for a successful run and gives a clear
margin for the legitimate cases.
This is a residual flake left after #46502 — the rollout-status wait fixed
the worst of it, but the race between "pod is running" and "scheduler loop
is actually scheduling" remained.
Reopens the spirit of #45145.
(cherry picked from commit 246c19f113f7e5ca9b676567a9cfdde13c25c00e)
Co-authored-by: Jarek Potiuk <[email protected]>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]