Abhishekmishra2808 commented on issue #61070:
URL: https://github.com/apache/airflow/issues/61070#issuecomment-3818094701

   Hi, @jason810496 
   ### Analysis of the Flaky Failure
   I've analyzed the logs from the recent CI failures for 
`test_scheduler_change_after_the_first_task_finishes`. The `AssertionError` 
regarding the missing `task2` span appears to be a race condition rather than a 
logic bug in the OTel integration itself.
   
   **Observed Issues:**
   1. **Collector Latency:** The integration tests rely on the 
`breeze-otel-collector`. Logs indicate occasional `NameResolutionError` or 
connection timeouts, suggesting the collector isn't always ready when the test 
begins its assertions.
   2. **Eventual Consistency:** Since OTel spans are exported asynchronously, 
the test is asserting the existence of 'task2' before the collector has 
finished processing the trace from the second scheduler.
   
   ### Proposed Approach
   I would like to work on a fix to make this test more robust. My proposed 
approach is:
   * **Implement a Polling/Retry Mechanism:** Instead of a single assertion, 
I'll wrap the span validation in a retry loop (using a helper or `tenacity`) 
with a reasonable timeout (e.g., 10s). This allows for the inherent delay in 
telemetry propagation.
   * **Environment Check:** Add a pre-test check to ensure the OTel collector 
endpoint is reachable before the test suite proceeds.
   
   I'm happy to open a PR for this if the maintainers agree this is the right 
direction to reduce CI noise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to