potiuk commented on pull request #11538: URL: https://github.com/apache/airflow/pull/11538#issuecomment-708990762
@kaxil @ashb @turbaszek -> this one should solve the Kubernetes problems we started to experienced recently. They apparently were not related to the provider split as I originally suspected - but to some changes in the way how port forwarding started to interact with GA runner. So looking forward to reviews :+1: One more thing and maybe you can help me verify my theory. I believe GA is kinda reusing workers without full restarts between them - that might be the reason for 137 errors and resource exhaustion because they do not clean up the machines fully. It could be an accident this is the only explanation for an error I saw yesterday that some other jobs were affected by the kubectl background processes that we started in other jobs. This was an earlier version of the fix, but it did not have the trap that kills (first gently and then forcefully) all kubectl instances running in the background: https://github.com/apache/airflow/runs/1256383093?check_suite_focus=true There were seemingly unrelated errors (in several other jobs). Seems like for other jobs (theoretically in different machines!), the tests were affected by the background-running hanging kubectls, as if the 8080 port numbers continued to be be "taken". I am not 100% sure of that, but that is the only explanation I have for this. The errors went completely away when I added the trap to kill the kubectls (in unrelated jobs !). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
