xBis7 opened a new pull request, #52936:
URL: https://github.com/apache/airflow/pull/52936

   Related issue: https://github.com/apache/airflow/issues/52906
   
   Although the test always passes locally, it has 20-30% failure rate on the 
remote CI.
   
   For testing it, I created a custom CI that runs the class on repeat.
   
   I was able to determine that 
`test_scheduler_change_after_the_first_task_finishes` was the flaky test method 
and once that failed, it would cause every test running after it to fail as 
well. After commenting it out, the class passed 40/40 runs where it would pass 
only 25/40 runs.
   
   https://github.com/xBis7/airflow/actions/runs/16096527407
   
   https://github.com/xBis7/airflow/actions/runs/16096660040
   
   https://github.com/xBis7/airflow/actions/runs/16096742809
   
   https://github.com/xBis7/airflow/actions/runs/16096823166
   
   The actual issue causing the flakiness is the value of the 
`scheduler_health_check_threshold` flag.
   
   The test
   * uses 2 schedulers
   * scheduler1 processes the dag_run
   * in the middle of the dag_run, scheduler1 becomes idle
     * This tries to simulate the scenario where there is a very long running 
dag_run and one scheduler stops processing it so that another with more 
resources picks it up
     * In that case, scheduler2 finishes it and realizes that another scheduler 
has started the dr spans.
     * Scheduler2 marks the dr spans on the DB so that the original scheduler 
that holds the objects in memory, will know to end the spans
   
   The rest of the tests need `scheduler_health_check_threshold` to have a low 
value so that scheduler2 can mark scheduler1 as unhealthy pretty quickly. But 
the opposite is needed for this test.
   
   During 20-30% of the runs that the test is failing, scheduler2 is marking 
scheduler1 as unhealthy and therefore recreating the older spans because they 
are considered lost. The test is then timing out waiting for the span status to 
change from `ENDED` to `SHOULD_END` which will never happen.
   
   After increasing the flag just for this test, the flakiness is gone. I've 
run the test 99/100 times successfully. The other run was canceled because the 
workflow didn't have enough resources and the test exceeded the 30 mins 
threshold.
   
   https://github.com/xBis7/airflow/actions/runs/16098256324
   
   
   
   
   
   <!-- Please keep an empty line above the dashes. -->
   ---
   **^ Add meaningful description above**
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a 
newsfragment file, named `{pr_number}.significant.rst` or 
`{issue_number}.significant.rst`, in 
[airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to