potiuk opened a new issue, #35204: URL: https://github.com/apache/airflow/issues/35204
### Body ## Problem The test in question: ``` tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_retry_handling_job ``` Started to timeout - mysteriously - on October 18, 2023: - only for self-hosted instances od ours (not for Public runners) - only for sqlite not for Postgres / MySQL - for local execution on Llinux it can be reproduced as well only with sqlite - for local execution on MacOS it can be reproduced as well only with sqlite ## Successes in the (recent past) The last time it's known to succeeded was https://github.com/apache/airflow/actions/runs/6638965943/job/18039945807 This test toook just 2.77s ``` 2.77s call tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_retry_handling_job ``` Since then it is consistently handling for all runs on self-hosted runners of ours, while it consistenly succeeds on Public runnners. ## Reproducing locally Reproducing is super easy with breeze: ``` pytest tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_retry_handling_job -s --with-db-init ``` Pressing Ctrl-C (so sending INT to all processes in the group) "unhangs" the test and it succeeds quickly (????) ## What's so strange It is super-mysterious: * There does not seem to be any significant difference in the dependencies. there are a few dependencies beign upgraded in main - but going back to the versions they are coming from does not change anything: ```diff --- /files/constraints-3.8/original-constraints-3.8.txt 2023-10-26 11:32:47.167610348 +0000 +++ /files/constraints-3.8/constraints-3.8.txt 2023-10-26 11:32:48.763610466 +0000 @@ -184 +184 @@ -asttokens==2.4.0 +asttokens==2.4.1 @@ -249 +249 @@ -confluent-kafka==2.2.0 +confluent-kafka==2.3.0 @@ -352 +352 @@ -greenlet==3.0.0 +greenlet==3.0.1 @@ -510 +510 @@ -pyOpenSSL==23.2.0 +pyOpenSSL==23.3.0 @@ -619 +619 @@ -spython==0.3.0 +spython==0.3.1 @@ -687 +687 @@ -yandexcloud==0.238.0 +yandexcloud==0.240.0 ``` * Even going back the very same image that was used in the job that succeeded does not fix the problem. It still hangs. Do this (020691f5cc0935af91a09b052de6122073518b4e is the image used in ``` docker pull ghcr.io/apache/airflow/main/ci/python3.8:020691f5cc0935af91a09b052de6122073518b4e docker tag ghcr.io/apache/airflow/main/ci/python3.8:020691f5cc0935af91a09b052de6122073518b4e ghcr.io/apache/airflow/main/ci/python3.8:latest breeze pytest tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_retry_handling_job -s --with-db-init ``` Looks like there is something very strange going on with the environment of the test - something is apparently triggering a very nasty race condition (kernel version ? - this is the only idea I have) that is not yet avaiale on public runners. ### Committer - [X] I acknowledge that I am a maintainer/committer of the Apache Airflow project. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
