Abhishekmishra2808 opened a new pull request, #61242:
URL: https://github.com/apache/airflow/pull/61242

   ### Description
   
   This PR fixes a flaky integration test: 
`test_scheduler_change_after_the_first_task_finishes` in 
`tests/integration/otel/test_otel.py`.
   
   **The Problem:**
   The test frequently failed in CI and local Breeze environments with an 
`AssertionError` (missing `task2` span) and a 
`urllib3.exceptions.NameResolutionError` for the host `breeze-otel-collector`. 
   
   This was caused by a race condition where the Airflow test components 
attempted to connect to the OpenTelemetry (OTel) collector before Docker's 
internal DNS had fully propagated or before the collector service was ready to 
accept connections. This resulted in dropped spans and failed assertions.
   
   **The Fix:**
   I implemented a robust health check mechanism, `wait_for_otel_collector()`, 
within the `TestOtelIntegration` class.
   * The function uses `socket.create_connection` to poll the collector's 
availability.
   * It specifically handles `socket.gaierror` (DNS resolution) and 
`ConnectionRefusedError` with a 60-second timeout.
   * The `setup_class` method now calls this health check before any tests 
execute, ensuring the infrastructure is stable.
   
   This is a targeted fix that addresses the root cause of the network 
flakiness and infrastructure timing issues without modifying core production 
code.
   
   
   
   ---
   
   ### Related Issues
   
   * **closes:** #61070
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [ ] Yes (please specify the tool below)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to