potiuk opened a new issue, #35204:
URL: https://github.com/apache/airflow/issues/35204

   ### Body
   
   ## Problem 
   
   The test in question:
   
   ```
   tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_retry_handling_job
   ```
   
   Started to timeout - mysteriously - on October 18, 2023:
   
   - only for self-hosted instances od ours (not for Public runners)
   - only for sqlite not for Postgres / MySQL
   - for local execution on Llinux it can be reproduced as well only with sqlite
   - for local execution on MacOS it can be reproduced as well only with sqlite 
   
   
   ## Successes in the (recent past) 
   
   The last time it's known to succeeded was 
   https://github.com/apache/airflow/actions/runs/6638965943/job/18039945807
   
   This test toook just 2.77s 
   
   ```
   2.77s call     
tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_retry_handling_job
   ```
   
   Since then it is consistently handling for all runs on self-hosted runners 
of ours, while it consistenly succeeds on Public runnners.
   
   ## Reproducing locally
   
   Reproducing is super easy with breeze:
   
   ```
   pytest 
tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_retry_handling_job -s 
--with-db-init
   ```
   
   Pressing Ctrl-C (so sending INT to all processes in the group) "unhangs" the 
test and it succeeds quickly (????)
   
   ## What's so strange
   
   It is super-mysterious:
   
   * There does not seem to be any significant difference in the dependencies. 
there are a few dependencies beign upgraded in main - but going back to the 
versions they are coming from does not change anything:
   
   ```diff
   --- /files/constraints-3.8/original-constraints-3.8.txt      2023-10-26 
11:32:47.167610348 +0000
   +++ /files/constraints-3.8/constraints-3.8.txt       2023-10-26 
11:32:48.763610466 +0000
   @@ -184 +184 @@
   -asttokens==2.4.0
   +asttokens==2.4.1
   @@ -249 +249 @@
   -confluent-kafka==2.2.0
   +confluent-kafka==2.3.0
   @@ -352 +352 @@
   -greenlet==3.0.0
   +greenlet==3.0.1
   @@ -510 +510 @@
   -pyOpenSSL==23.2.0
   +pyOpenSSL==23.3.0
   @@ -619 +619 @@
   -spython==0.3.0
   +spython==0.3.1
   @@ -687 +687 @@
   -yandexcloud==0.238.0
   +yandexcloud==0.240.0
   ```
   
   * Even going back the very same image that was used in the job that 
succeeded does not fix the problem. It still hangs.
   
   Do this (020691f5cc0935af91a09b052de6122073518b4e  is the image used in 
   
   ```
   docker pull 
ghcr.io/apache/airflow/main/ci/python3.8:020691f5cc0935af91a09b052de6122073518b4e
   docker tag 
ghcr.io/apache/airflow/main/ci/python3.8:020691f5cc0935af91a09b052de6122073518b4e
 ghcr.io/apache/airflow/main/ci/python3.8:latest
   breeze
   pytest 
tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_retry_handling_job -s 
--with-db-init
   ```
   
   Looks like there is something very strange going on with the environment of 
the test - something is apparently triggering a very nasty race condition 
(kernel version ? - this is the only idea I have) that is not yet avaiale on 
public runners. 
   
   
   
   
   ### Committer
   
   - [X] I acknowledge that I am a maintainer/committer of the Apache Airflow 
project.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to