[ 
https://issues.apache.org/jira/browse/CAMEL-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18077640#comment-18077640
 ] 

AJ Danelz commented on CAMEL-23380:
-----------------------------------

*Re: error logs*

We turned on {{-Dio.opentelemetry.context.enableStrictContext=true}} to surface 
any scope-propagation issues. Over an 18-minute window on one pod we captured 
*1102 {{IllegalStateException}} events* logged by 
{{org.apache.camel.telemetry.Tracer}} ("Tracing: Failed to capture tracing 
data. This exception is ignored."). 100% have the same shape: scope opened on 
the caller thread, closed on {{ForkJoinPool.commonPool-worker-N}}. Span URI in 
every case is {{spring-rabbitmq://...?exchangePattern=InOut}}.

Representative stack:

{code:java}
java.lang.IllegalStateException: Thread [http-nio-8080-exec-60] opened scope, 
but thread [ForkJoinPool.commonPool-worker-2081] closed it
    at 
io.opentelemetry.context.StrictContextStorage$StrictScope.close(StrictContextStorage.java:205)
    at 
io.micrometer.tracing.otel.bridge.EventPublishingContextWrapper$1.lambda$attach$0(EventPublishingContextWrapper.java:56)
    at 
org.apache.camel.opentelemetry2.OpenTelemetrySpanAdapter.close(OpenTelemetrySpanAdapter.java:56)
    at 
org.apache.camel.opentelemetry2.OpenTelemetryTracer$OpentelemetrySpanLifecycleManager.close(OpenTelemetryTracer.java:153)
    at org.apache.camel.telemetry.Tracer.endEventSpan(Tracer.java:297)
    at 
org.apache.camel.telemetry.Tracer$TracingEventNotifier.notify(Tracer.java:203)
    at 
org.apache.camel.support.EventHelper.notifyExchangeSent(EventHelper.java:1048)
    at 
org.apache.camel.processor.SendProcessor.lambda$sendUsingProducer$2(SendProcessor.java:236)
    at 
org.apache.camel.component.springrabbit.SpringRabbitMQProducer.lambda$processInOut$0(SpringRabbitMQProducer.java:168)
    at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:884)
    ...
    at 
java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1450)
{code}

{{SpringRabbitMQProducer.processInOut}} chains 
{{CompletableFuture.whenComplete}} without an explicit executor, so 
{{ExchangeSent}} fires on {{ForkJoinPool.commonPool}} while the scope was 
opened on the caller. With strict mode off, the scope still leaks silently 
(caller thread's {{ThreadLocal}} isn't cleared); strict mode just makes it 
visible.

*Possibly related: Spring AMQP listener thread accumulation*

Attached thread dump from a separate pod (uptime 44m, 1768 routes running, 1 
RabbitMQ connection, 15 channels) shows 24 idle Spring AMQP consumer threads, 
all parked in {{BlockingQueueConsumer.nextMessage}}:

{noformat}
not.a.Spring.bean-1                   x10
not.a.Spring.bean-11                  x3
not.a.Spring.bean-12                  x1
not.a.Spring.bean-51..59              x9 (one each)
not.a.Spring.bean-consumerMonitor-1   x5
{noformat}

Top frames identical for all 24:

{noformat}
LockSupport.parkNanos
LinkedBlockingQueue.poll
SimpleMessageListenerContainer.doReceiveAndExecute
SimpleMessageListenerContainer$AsyncMessageProcessingConsumer.mainLoop
{noformat}

The repeated {{-1}} suffix indicates 10 distinct {{SimpleAsyncTaskExecutor}} 
instances were created. {{not.a.Spring.bean}} is the default thread-name prefix 
Spring assigns when the executor isn't a managed bean and {{threadNamePrefix}} 
isn't set, suggesting {{camel-spring-rabbitmq}} instantiates a fresh executor 
per container without naming or lifecycle ownership. The threads outlive their 
containers across reload/recovery.

*Environment*

* Apache Camel 4.18.2
* camel-opentelemetry2-starter, camel-spring-rabbitmq
* OpenTelemetry SDK 1.61.0, instrumentation 2.27.0
* micrometer-tracing-bridge-otel
* Spring Boot 3.x, OpenJDK 25
* {{camel.opentelemetry2.enabled=true}}, 
{{camel.opentelemetry2.trace-processors=true}}

Full thread dump (Spring Boot actuator format) attached.


> `TracingRoutePolicy.onExchangeBegin` opens scopes and spans that are never 
> closed/ended
> ---------------------------------------------------------------------------------------
>
>                 Key: CAMEL-23380
>                 URL: https://issues.apache.org/jira/browse/CAMEL-23380
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-opentelemetry
>    Affects Versions: 4.14.5
>            Reporter: Nikita
>            Assignee: Pasquale Congiusti
>            Priority: Major
>
> *Summary*
> `Tracer$TracingRoutePolicy.onExchangeBegin()` calls `beginEventSpan()` which 
> opens an OTel scope via `makeCurrent()`, but the corresponding 
> `endEventSpan()` is never reached. The span is never ended 
> (`endEpochNanos=0`) and the scope is never closed. On pooled Tomcat threads, 
> this causes `ArrayBasedContext` objects to accumulate in `ThreadLocal` 
> storage, growing ~8 MB/hour until OOM.
> *Evidence*
> With `-Dio.opentelemetry.context.enableStrictContext=true` enabled, *{*}71 
> `Scope garbage collected before being closed` errors{*}* were captured over 
> 15 minutes of normal traffic. All 71 share an identical stack trace and all 
> 71 leaked spans have `endEpochNanos=0`.
> *{*}Stack trace (all 71 entries):{*}*
> ```
> io.opentelemetry.context.Context.makeCurrent (Context.java:231)
> io.opentelemetry.context.ImplicitContextKeyed.makeCurrent 
> (ImplicitContextKeyed.java:33)
> org.apache.camel.opentelemetry2.OpenTelemetrySpanAdapter.makeCurrent 
> (OpenTelemetrySpanAdapter.java:47)
> org.apache.camel.opentelemetry2.OpenTelemetryTracer$OpentelemetrySpanLifecycleManager.activate
>  (OpenTelemetryTracer.java:141)
> org.apache.camel.telemetry.Tracer.beginEventSpan (Tracer.java:267)
> org.apache.camel.telemetry.Tracer$TracingRoutePolicy.onExchangeBegin 
> (Tracer.java:220)
> org.apache.camel.impl.engine.CamelInternalProcessor$RoutePolicyAdvice.before 
> (CamelInternalProcessor.java:554)
> org.apache.camel.impl.engine.CamelInternalProcessor.process 
> (CamelInternalProcessor.java:324)
> ```
> *{*}Per-exchange pattern:{*}*
>  - 41 exchanges produced 71 leaks
>  - 26 exchanges leaked 2 scopes: `[router-route span, child-route span]`
>  - 15 exchanges leaked 3 scopes: `[router-route span, router-route span, 
> child-route span]`
> All 71 leaks occur on the same Tomcat thread as the originating exchange (no 
> async/cross-thread handoff for the HTTP side)
> The double router-route leak in the 3-scope pattern may indicate 
> `TracingRoutePolicy` is registered twice on some routes, but the core issue 
> (spans never ended) affects all patterns.
> *{*}Additional observation — wireTap executor threads also affected:{*}*
> Heap dump analysis from Apr 29 shows `wireTapIoExecutor` threads accumulating 
> OTel contexts via the same ThreadLocal mechanism. In the most leaked pod 
> observed (879 MB live heap), two of the four wireTap executor threads each 
> retained ~15.9 MB of OTel context chains while the other two remained at 
> baseline (~147 KB). The selectively affected threads are those that have 
> processed exchanges on the child handler routes (route 3 above), which 
> contain multiple `.wireTap()` calls. The wireTap executor threads are not 
> Tomcat servlet threads, so the leak on these threads follows a different call 
> path than the HTTP side but appears to share the same root cause: 
> `TracingRoutePolicy.onExchangeBegin` opening a scope that `endEventSpan` 
> never closes.
> *{*}Span names from the leaking routes:{*}*
> |Leaks|Span Name|
> |—|—|
> |31|`corsair/carrier-direct-documents` (child route)|
> |18|`carrier-direct-tracking-rest-consumer-*-router` (router route)|
> |12|`carrier-direct-document-rest-consumer-*-router` (router route)|
> |10|`corsair/carrier-direct-tracking` (child route)|
> *Route structure*
> The leaking routes follow a standard REST DSL pattern with three routes per 
> endpoint:
> 1. `rest:post:/path` (auto-generated, `routeId = X--rest`) calls 
> `.to("direct:X-router")`
> 2. `direct:X-router` (`routeId = X-router`) dispatches via 
> `toD("direct:${header.ROUTE_TO}")`
> 3. `direct:X-<version>` (`routeId = X::<version>`) is the handler with 
> multiple `.wireTap()` calls
> Spans for routes 2 and 3 leak. Route 1's span does not appear in the 
> strict-context output.
> *Heap dump confirmation*
> Production heap dumps across multiple pods and dates show linear growth of 
> OTel objects retained by Tomcat `http-nio-*-exec` ThreadLocals, with 
> `wireTapIoExecutor` threads also affected at higher uptimes:
> |Date|Uptime|TaskThread Retained|Live Heap (MAT)|`SdkSpan` Count|
> |—|—|—|—|—|
> |Apr 17|0.2h|not measurable|~470 MB (baseline)|—|
> |Apr 17|10.8h|104 MB (16.8%)|593 MB|23,188|
> |Apr 17|23.9h|201 MB (26.2%)|733 MB|47,145|
> |Apr 24|23.9h|209 MB (30.1%)|697 MB|51,444|
> |Apr 29|high uptime|*{*}309 MB (33.5%){*}*|*{*}879 MB{*}*|*{*}83,138{*}*|
> Growth rate: ~8 MB/hour across HTTP threads. With an 8 GB max heap, pods OOM 
> after approximately 40-50 hours. `SdkSpan` count (unended spans) grows 
> proportionally and serves as the most direct indicator of leak severity.
> *Configuration*
> ```yaml
> camel:
>   opentelemetry2:
>     enabled: true
>     trace-processors: false   # only route-level spans, no per-processor spans
> ```
> *Environment*
> |Dependency|Version|
> |—|—|
> |`camel-opentelemetry2-starter`|4.14.5|
> |`camel-telemetry`|4.14.5 (transitive)|
> |`io.opentelemetry:opentelemetry-bom`|1.60.1|
> |`io.opentelemetry.instrumentation:opentelemetry-instrumentation-bom`|2.26.1|
> |`io.micrometer:micrometer-tracing-bridge-otel`|1.5.10|
> |Spring Boot|3.5.13|
> |JDK|25.0.1+8-LTS|
> Note: `CAMEL-21302` (async producer context leak, fixed in 4.4.5) does not 
> cover this case. Our leak is synchronous, same-thread, and caused by 
> `beginEventSpan`/`endEventSpan` lifecycle mismatch rather than cross-thread 
> scope closure.
> *Expected behavior*
> Every `beginEventSpan()` call in `TracingRoutePolicy.onExchangeBegin` should 
> have a matching `endEventSpan()` that ends the span and closes the scope when 
> the exchange completes on that route.
> *Workaround*
> A servlet filter calling `Context.root().makeCurrent().close()` after each 
> request caps the heap impact on Tomcat HTTP threads but does not fix the 
> underlying issue (spans still go unended, trace data is lost). It also has no 
> effect on `wireTapIoExecutor` threads or other non-servlet thread pools, 
> which accumulate OTel contexts via the same mechanism but are not covered by 
> the filter chain.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to