potiuk commented on pull request #11538:
URL: https://github.com/apache/airflow/pull/11538#issuecomment-708990762


   @kaxil @ashb @turbaszek -> this one should solve the Kubernetes problems we 
started to experienced recently. They apparently were not related to the 
provider split as I originally suspected - but to some changes in the way how 
port forwarding started to interact with GA runner. So looking forward to 
reviews :+1: 
   
   
   One more thing and maybe you can help me verify my theory. 
   
   I believe GA is kinda reusing workers without full restarts between them - 
that might be the reason for 137 errors and resource exhaustion because they do 
not clean up the machines fully.
   
   It could be an accident this is the only explanation for an error I saw 
yesterday that some other jobs were affected by the kubectl background 
processes that we started in other jobs. This was an earlier version of the 
fix, but it did not have the trap that kills (first gently and then forcefully) 
all kubectl instances running in the background:
   
   https://github.com/apache/airflow/runs/1256383093?check_suite_focus=true
   
   There were seemingly unrelated errors (in several other jobs). Seems like 
for other jobs (theoretically in different machines!), the tests were affected 
by the background-running hanging kubectls, as if the 8080 port numbers 
continued to be be "taken".  I am not 100% sure of that, but that is the only 
explanation I have for this. The errors went completely away when I added the 
trap to kill the kubectls (in unrelated jobs !).
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to