potiuk commented on PR #27214:
URL: https://github.com/apache/airflow/pull/27214#issuecomment-1293437980

   > Not related to this PR but a bit annoying things.
   > 
   > Is anyone know what might be a nature of this CI error which happen time 
to time? Is another it another `airflow-test-integration_trino_1` might run in 
the same time in worker CI?
   > 
   > ```
   >   Host is already in use by another container
   >   Creating airflow-test-integration_trino_1                  ... error
   >   
   >   ERROR: for airflow-test-integration_trino_1  Cannot start service trino: 
driver failed programming external connectivity on endpoint 
airflow-test-integration_trino_1 
(9718b96633b0c3913bc614cb7fc4577496eae3ae9a0d5872f8aa14333f8816b8): Error 
starting userland proxy: listen tcp4 0.0.0.0:38080: bind: address already in use
   > ```
   
   I am chasing that one for a long time and I was never able to make a 
plausible hypothesis on why it happens and implements some workaround. But any 
ideas/inputs are more than welcome.
   
   This happens intermittently which makes it very difficult to diagnose, and I 
was never able to replicate it locally - and it is extremely annoying to see 
it. Theorethically it should not happen on GitHub Public runners - we should 
have a clean public runner every time we run job there. And busy port is not 
something that should not happen. 
   
   What I THNK that happens is that docker-compose which runs several services 
experiences some race condition when starting our integration tests (with 
mutliple containers) or has some resource problems (memory, opened sockets, 
etc.). 
   
   One idea I have is that we might want to eventually make some exclusions fo 
the Integration tests - and maybe just run them on single runner (sqlite? in 
public runners). I might experiment with it and make a PR for that after I 
re-run those failures.
   
   It's interesting to see it happening in 3 jobs out of 4 as it was the case 
in your build. Theorethically, those are independent runners - and yet all of 
them failed at about the same time, only sqlite "did it". I will restart those 
jobs and see if that will be reproduced or (as usual) it is a race/intermittent 
failure. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to