gnodet commented on PR #2971:
URL: https://github.com/apache/cxf/pull/2971#issuecomment-4135324651

   _Claude Code on behalf of Guillaume Nodet_
   
   @reta Thanks for the clarification about retries vs. silent ignoring. I 
investigated the root cause of why these tests can fail. Here's what I found:
   
   ### Root Cause 1: Thread-safety bug in `TestSpanHandler` (Brave/Micrometer)
   
   `TestSpanHandler.SPANS` 
([source](https://github.com/apache/cxf/blob/main/systests/tracing/src/test/java/org/apache/cxf/systest/brave/TestSpanHandler.java#L29))
 uses a plain `ArrayList`:
   
   ```java
   private static final List<MutableSpan> SPANS = new ArrayList<>(12);
   ```
   
   The client span is added from the main thread (via 
`BraveClientStartInterceptor.handleFault()`) and the server span is added from 
the server thread (via `BraveProvider` ContainerResponseFilter). Without 
synchronization:
   - `ArrayList.add()` from two threads can lose an element
   - `ArrayList.size()` polled by Awaitility may never see the update due to 
JMM visibility
   
   This affects `MicrometerTracingTest` too since it extends 
`AbstractBraveTracingTest` and uses the same `TestSpanHandler`.
   
   **Fix**: Change to `CopyOnWriteArrayList` (or 
`Collections.synchronizedList()`). Writes are rare (span completions), reads 
frequent (Awaitility polling), making COW ideal.
   
   ### Root Cause 2: Timing under CI load (all implementations)
   
   The `testThatNewSpanIsCreatedOnClientTimeout` test sets a 100ms client 
timeout against a server endpoint that does `Thread.sleep(500)`. The server 
span is recorded ~400-500ms after the client span. Under CI load with CPU 
starvation and GC pauses, this 500ms sleep can take significantly longer.
   
   The OTel tests use `InMemorySpanExporter` (thread-safe), so their failures 
are purely timing-related. This explains why bumping the Awaitility timeout 
helps but doesn't eliminate failures.
   
   ### Verified: Both spans ARE properly ended
   
   I traced through the full chain and confirmed that both client and server 
spans are correctly ended:
   - **Client span**: `MessageSenderEndingInterceptor` catches 
`SocketTimeoutException`, wraps it in `Fault` (RuntimeException), 
`PhaseInterceptorChain.doIntercept()` catches it, calls `unwind()` → 
`handleFault()` on `BraveClientStartInterceptor` → span is ended
   - **Server span**: The `ContainerResponseFilter` runs *before* response 
entity writing, so even if the client disconnected, the server span is ended 
before any I/O error
   
   ### Suggested improvement
   
   Making `TestSpanHandler` thread-safe should eliminate the Brave/Micrometer 
failures. The OTel timing-related failures would still need the increased 
timeout (or a redesign of the test to reduce the timing gap, e.g. using a 
shorter `Thread.sleep` in `BookStore.getBooksLong()`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to