[
https://issues.apache.org/jira/browse/CAMEL-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18077640#comment-18077640
]
AJ Danelz commented on CAMEL-23380:
-----------------------------------
*Re: error logs*
We turned on {{-Dio.opentelemetry.context.enableStrictContext=true}} to surface
any scope-propagation issues. Over an 18-minute window on one pod we captured
*1102 {{IllegalStateException}} events* logged by
{{org.apache.camel.telemetry.Tracer}} ("Tracing: Failed to capture tracing
data. This exception is ignored."). 100% have the same shape: scope opened on
the caller thread, closed on {{ForkJoinPool.commonPool-worker-N}}. Span URI in
every case is {{spring-rabbitmq://...?exchangePattern=InOut}}.
Representative stack:
{code:java}
java.lang.IllegalStateException: Thread [http-nio-8080-exec-60] opened scope,
but thread [ForkJoinPool.commonPool-worker-2081] closed it
at
io.opentelemetry.context.StrictContextStorage$StrictScope.close(StrictContextStorage.java:205)
at
io.micrometer.tracing.otel.bridge.EventPublishingContextWrapper$1.lambda$attach$0(EventPublishingContextWrapper.java:56)
at
org.apache.camel.opentelemetry2.OpenTelemetrySpanAdapter.close(OpenTelemetrySpanAdapter.java:56)
at
org.apache.camel.opentelemetry2.OpenTelemetryTracer$OpentelemetrySpanLifecycleManager.close(OpenTelemetryTracer.java:153)
at org.apache.camel.telemetry.Tracer.endEventSpan(Tracer.java:297)
at
org.apache.camel.telemetry.Tracer$TracingEventNotifier.notify(Tracer.java:203)
at
org.apache.camel.support.EventHelper.notifyExchangeSent(EventHelper.java:1048)
at
org.apache.camel.processor.SendProcessor.lambda$sendUsingProducer$2(SendProcessor.java:236)
at
org.apache.camel.component.springrabbit.SpringRabbitMQProducer.lambda$processInOut$0(SpringRabbitMQProducer.java:168)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:884)
...
at
java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1450)
{code}
{{SpringRabbitMQProducer.processInOut}} chains
{{CompletableFuture.whenComplete}} without an explicit executor, so
{{ExchangeSent}} fires on {{ForkJoinPool.commonPool}} while the scope was
opened on the caller. With strict mode off, the scope still leaks silently
(caller thread's {{ThreadLocal}} isn't cleared); strict mode just makes it
visible.
*Possibly related: Spring AMQP listener thread accumulation*
Attached thread dump from a separate pod (uptime 44m, 1768 routes running, 1
RabbitMQ connection, 15 channels) shows 24 idle Spring AMQP consumer threads,
all parked in {{BlockingQueueConsumer.nextMessage}}:
{noformat}
not.a.Spring.bean-1 x10
not.a.Spring.bean-11 x3
not.a.Spring.bean-12 x1
not.a.Spring.bean-51..59 x9 (one each)
not.a.Spring.bean-consumerMonitor-1 x5
{noformat}
Top frames identical for all 24:
{noformat}
LockSupport.parkNanos
LinkedBlockingQueue.poll
SimpleMessageListenerContainer.doReceiveAndExecute
SimpleMessageListenerContainer$AsyncMessageProcessingConsumer.mainLoop
{noformat}
The repeated {{-1}} suffix indicates 10 distinct {{SimpleAsyncTaskExecutor}}
instances were created. {{not.a.Spring.bean}} is the default thread-name prefix
Spring assigns when the executor isn't a managed bean and {{threadNamePrefix}}
isn't set, suggesting {{camel-spring-rabbitmq}} instantiates a fresh executor
per container without naming or lifecycle ownership. The threads outlive their
containers across reload/recovery.
*Environment*
* Apache Camel 4.18.2
* camel-opentelemetry2-starter, camel-spring-rabbitmq
* OpenTelemetry SDK 1.61.0, instrumentation 2.27.0
* micrometer-tracing-bridge-otel
* Spring Boot 3.x, OpenJDK 25
* {{camel.opentelemetry2.enabled=true}},
{{camel.opentelemetry2.trace-processors=true}}
Full thread dump (Spring Boot actuator format) attached.
> `TracingRoutePolicy.onExchangeBegin` opens scopes and spans that are never
> closed/ended
> ---------------------------------------------------------------------------------------
>
> Key: CAMEL-23380
> URL: https://issues.apache.org/jira/browse/CAMEL-23380
> Project: Camel
> Issue Type: Bug
> Components: camel-opentelemetry
> Affects Versions: 4.14.5
> Reporter: Nikita
> Assignee: Pasquale Congiusti
> Priority: Major
>
> *Summary*
> `Tracer$TracingRoutePolicy.onExchangeBegin()` calls `beginEventSpan()` which
> opens an OTel scope via `makeCurrent()`, but the corresponding
> `endEventSpan()` is never reached. The span is never ended
> (`endEpochNanos=0`) and the scope is never closed. On pooled Tomcat threads,
> this causes `ArrayBasedContext` objects to accumulate in `ThreadLocal`
> storage, growing ~8 MB/hour until OOM.
> *Evidence*
> With `-Dio.opentelemetry.context.enableStrictContext=true` enabled, *{*}71
> `Scope garbage collected before being closed` errors{*}* were captured over
> 15 minutes of normal traffic. All 71 share an identical stack trace and all
> 71 leaked spans have `endEpochNanos=0`.
> *{*}Stack trace (all 71 entries):{*}*
> ```
> io.opentelemetry.context.Context.makeCurrent (Context.java:231)
> io.opentelemetry.context.ImplicitContextKeyed.makeCurrent
> (ImplicitContextKeyed.java:33)
> org.apache.camel.opentelemetry2.OpenTelemetrySpanAdapter.makeCurrent
> (OpenTelemetrySpanAdapter.java:47)
> org.apache.camel.opentelemetry2.OpenTelemetryTracer$OpentelemetrySpanLifecycleManager.activate
> (OpenTelemetryTracer.java:141)
> org.apache.camel.telemetry.Tracer.beginEventSpan (Tracer.java:267)
> org.apache.camel.telemetry.Tracer$TracingRoutePolicy.onExchangeBegin
> (Tracer.java:220)
> org.apache.camel.impl.engine.CamelInternalProcessor$RoutePolicyAdvice.before
> (CamelInternalProcessor.java:554)
> org.apache.camel.impl.engine.CamelInternalProcessor.process
> (CamelInternalProcessor.java:324)
> ```
> *{*}Per-exchange pattern:{*}*
> - 41 exchanges produced 71 leaks
> - 26 exchanges leaked 2 scopes: `[router-route span, child-route span]`
> - 15 exchanges leaked 3 scopes: `[router-route span, router-route span,
> child-route span]`
> All 71 leaks occur on the same Tomcat thread as the originating exchange (no
> async/cross-thread handoff for the HTTP side)
> The double router-route leak in the 3-scope pattern may indicate
> `TracingRoutePolicy` is registered twice on some routes, but the core issue
> (spans never ended) affects all patterns.
> *{*}Additional observation — wireTap executor threads also affected:{*}*
> Heap dump analysis from Apr 29 shows `wireTapIoExecutor` threads accumulating
> OTel contexts via the same ThreadLocal mechanism. In the most leaked pod
> observed (879 MB live heap), two of the four wireTap executor threads each
> retained ~15.9 MB of OTel context chains while the other two remained at
> baseline (~147 KB). The selectively affected threads are those that have
> processed exchanges on the child handler routes (route 3 above), which
> contain multiple `.wireTap()` calls. The wireTap executor threads are not
> Tomcat servlet threads, so the leak on these threads follows a different call
> path than the HTTP side but appears to share the same root cause:
> `TracingRoutePolicy.onExchangeBegin` opening a scope that `endEventSpan`
> never closes.
> *{*}Span names from the leaking routes:{*}*
> |Leaks|Span Name|
> |—|—|
> |31|`corsair/carrier-direct-documents` (child route)|
> |18|`carrier-direct-tracking-rest-consumer-*-router` (router route)|
> |12|`carrier-direct-document-rest-consumer-*-router` (router route)|
> |10|`corsair/carrier-direct-tracking` (child route)|
> *Route structure*
> The leaking routes follow a standard REST DSL pattern with three routes per
> endpoint:
> 1. `rest:post:/path` (auto-generated, `routeId = X--rest`) calls
> `.to("direct:X-router")`
> 2. `direct:X-router` (`routeId = X-router`) dispatches via
> `toD("direct:${header.ROUTE_TO}")`
> 3. `direct:X-<version>` (`routeId = X::<version>`) is the handler with
> multiple `.wireTap()` calls
> Spans for routes 2 and 3 leak. Route 1's span does not appear in the
> strict-context output.
> *Heap dump confirmation*
> Production heap dumps across multiple pods and dates show linear growth of
> OTel objects retained by Tomcat `http-nio-*-exec` ThreadLocals, with
> `wireTapIoExecutor` threads also affected at higher uptimes:
> |Date|Uptime|TaskThread Retained|Live Heap (MAT)|`SdkSpan` Count|
> |—|—|—|—|—|
> |Apr 17|0.2h|not measurable|~470 MB (baseline)|—|
> |Apr 17|10.8h|104 MB (16.8%)|593 MB|23,188|
> |Apr 17|23.9h|201 MB (26.2%)|733 MB|47,145|
> |Apr 24|23.9h|209 MB (30.1%)|697 MB|51,444|
> |Apr 29|high uptime|*{*}309 MB (33.5%){*}*|*{*}879 MB{*}*|*{*}83,138{*}*|
> Growth rate: ~8 MB/hour across HTTP threads. With an 8 GB max heap, pods OOM
> after approximately 40-50 hours. `SdkSpan` count (unended spans) grows
> proportionally and serves as the most direct indicator of leak severity.
> *Configuration*
> ```yaml
> camel:
> opentelemetry2:
> enabled: true
> trace-processors: false # only route-level spans, no per-processor spans
> ```
> *Environment*
> |Dependency|Version|
> |—|—|
> |`camel-opentelemetry2-starter`|4.14.5|
> |`camel-telemetry`|4.14.5 (transitive)|
> |`io.opentelemetry:opentelemetry-bom`|1.60.1|
> |`io.opentelemetry.instrumentation:opentelemetry-instrumentation-bom`|2.26.1|
> |`io.micrometer:micrometer-tracing-bridge-otel`|1.5.10|
> |Spring Boot|3.5.13|
> |JDK|25.0.1+8-LTS|
> Note: `CAMEL-21302` (async producer context leak, fixed in 4.4.5) does not
> cover this case. Our leak is synchronous, same-thread, and caused by
> `beginEventSpan`/`endEventSpan` lifecycle mismatch rather than cross-thread
> scope closure.
> *Expected behavior*
> Every `beginEventSpan()` call in `TracingRoutePolicy.onExchangeBegin` should
> have a matching `endEventSpan()` that ends the span and closes the scope when
> the exchange completes on that route.
> *Workaround*
> A servlet filter calling `Context.root().makeCurrent().close()` after each
> request caps the heap impact on Tomcat HTTP threads but does not fix the
> underlying issue (spans still go unended, trace data is lost). It also has no
> effect on `wireTapIoExecutor` threads or other non-servlet thread pools,
> which accumulate OTel contexts via the same mechanism but are not covered by
> the filter chain.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)