potiuk commented on PR #37446:
URL: https://github.com/apache/airflow/pull/37446#issuecomment-1946130824

   Hey @pankajastro - I believe (and have a demostrable proof) that #37363 
broke some of the behaviour of KPO tests. The tests started to fail yesterday 
in main after the change has been merged and while the tests work in isolation, 
it seems that they introduce a some heavly fleaky behaviour for The async KPO 
behaviour when the system is busy - and I think it can have impact in 
production, so I think better is to revert it now and try to find out the root 
cause and fix. 
   
   I was not able to pinpoint the root cause, but I believe it's caused by this 
change: 
   
   
https://github.com/apache/airflow/pull/37363/files#diff-9e582418b4d9f5578d2466b2e5ccd32faa3d8bb651fd7873c488469ccf599c0dR220
   
   It seems that on a busy system, the loop finishes much faster than it should 
and errors yout with 
   `Pod did not leave 'Pending' phase within specified timeout`
   
   The problem is that it does not happen even if you run a complete 
`Providers[-amazon,-google]` test type in isolation. It only happens when you 
run all test types in parallel.
   
   
   The issue is relatively easy (but a bit time consuming) to reproduce. On my 
local env, I can rather repeatably (same on CI) reproduce it with:
   
   ```
   breeze testing tests --run-in-paralllel
   ```
   
   The error is the same as in case of last few `main` builds  - for example 
this one 
https://github.com/apache/airflow/actions/runs/7911809922/job/21596873757 
   
   
   When I revert this one, it works nicely without errors.
   
   It's rather fishy and difficult to guess what's the problem.
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to