github-actions[bot] opened a new pull request, #67068:
URL: https://github.com/apache/airflow/pull/67068

   test_integration_run_dag_with_scheduler_failure is intermittently flaky
   on ARM CI: the scheduler is killed immediately after start_job_in_kubernetes,
   so the new scheduler pod (after `kubectl rollout status` returns 
successfully)
   sometimes has to handle the very first scheduling step itself before the
   40s monitor_task timeout expires. push stays in `queued` for the full 40s
   and the test fails with `assert 'queued' == 'success'`.
   
   Two adjustments:
   
   1. Before killing the scheduler, wait until the `push` task instance has
      reached a `queued`-or-later state. That way the original scheduler has
      already handed the task to the executor and the post-restart scheduler
      only needs to drive the downstream dependency for `puller`, not pick up
      `push` from scratch.
   
   2. Bump the post-restart monitor_task timeout from 40s to 120s. The
      previous "fail fast if failing" budget races with scheduler-loop warm-up
      under load; 120s is still fast for a successful run and gives a clear
      margin for the legitimate cases.
   
   This is a residual flake left after #46502 — the rollout-status wait fixed
   the worst of it, but the race between "pod is running" and "scheduler loop
   is actually scheduling" remained.
   
   Reopens the spirit of #45145.
   (cherry picked from commit 246c19f113f7e5ca9b676567a9cfdde13c25c00e)
   
   Co-authored-by: Jarek Potiuk <[email protected]>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to