potiuk commented on PR #37446: URL: https://github.com/apache/airflow/pull/37446#issuecomment-1946130824
Hey @pankajastro - I believe (and have a demostrable proof) that #37363 broke some of the behaviour of KPO tests. The tests started to fail yesterday in main after the change has been merged and while the tests work in isolation, it seems that they introduce a some heavly fleaky behaviour for The async KPO behaviour when the system is busy - and I think it can have impact in production, so I think better is to revert it now and try to find out the root cause and fix. I was not able to pinpoint the root cause, but I believe it's caused by this change: https://github.com/apache/airflow/pull/37363/files#diff-9e582418b4d9f5578d2466b2e5ccd32faa3d8bb651fd7873c488469ccf599c0dR220 It seems that on a busy system, the loop finishes much faster than it should and errors yout with `Pod did not leave 'Pending' phase within specified timeout` The problem is that it does not happen even if you run a complete `Providers[-amazon,-google]` test type in isolation. It only happens when you run all test types in parallel. The issue is relatively easy (but a bit time consuming) to reproduce. On my local env, I can rather repeatably (same on CI) reproduce it with: ``` breeze testing tests --run-in-paralllel ``` The error is the same as in case of last few `main` builds - for example this one https://github.com/apache/airflow/actions/runs/7911809922/job/21596873757 When I revert this one, it works nicely without errors. It's rather fishy and difficult to guess what's the problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
