[
https://issues.apache.org/jira/browse/CAMEL-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pasquale Congiusti closed CAMEL-23380.
--------------------------------------
Resolution: Fixed
> `TracingRoutePolicy.onExchangeBegin` opens scopes and spans that are never
> closed/ended
> ---------------------------------------------------------------------------------------
>
> Key: CAMEL-23380
> URL: https://issues.apache.org/jira/browse/CAMEL-23380
> Project: Camel
> Issue Type: Bug
> Components: camel-opentelemetry
> Affects Versions: 4.14.5
> Reporter: Nikita
> Assignee: Pasquale Congiusti
> Priority: Major
> Fix For: 4.21.0
>
> Attachments: threaddump-bbjjj-1777575894 json.txt
>
>
> *Summary*
> `Tracer$TracingRoutePolicy.onExchangeBegin()` calls `beginEventSpan()` which
> opens an OTel scope via `makeCurrent()`, but the corresponding
> `endEventSpan()` is never reached. The span is never ended
> (`endEpochNanos=0`) and the scope is never closed. On pooled Tomcat threads,
> this causes `ArrayBasedContext` objects to accumulate in `ThreadLocal`
> storage, growing ~8 MB/hour until OOM.
> *Evidence*
> With `-Dio.opentelemetry.context.enableStrictContext=true` enabled, *{*}71
> `Scope garbage collected before being closed` errors{*}* were captured over
> 15 minutes of normal traffic. All 71 share an identical stack trace and all
> 71 leaked spans have `endEpochNanos=0`.
> *{*}Stack trace (all 71 entries):{*}*
> ```
> io.opentelemetry.context.Context.makeCurrent (Context.java:231)
> io.opentelemetry.context.ImplicitContextKeyed.makeCurrent
> (ImplicitContextKeyed.java:33)
> org.apache.camel.opentelemetry2.OpenTelemetrySpanAdapter.makeCurrent
> (OpenTelemetrySpanAdapter.java:47)
> org.apache.camel.opentelemetry2.OpenTelemetryTracer$OpentelemetrySpanLifecycleManager.activate
> (OpenTelemetryTracer.java:141)
> org.apache.camel.telemetry.Tracer.beginEventSpan (Tracer.java:267)
> org.apache.camel.telemetry.Tracer$TracingRoutePolicy.onExchangeBegin
> (Tracer.java:220)
> org.apache.camel.impl.engine.CamelInternalProcessor$RoutePolicyAdvice.before
> (CamelInternalProcessor.java:554)
> org.apache.camel.impl.engine.CamelInternalProcessor.process
> (CamelInternalProcessor.java:324)
> ```
> *{*}Per-exchange pattern:{*}*
> - 41 exchanges produced 71 leaks
> - 26 exchanges leaked 2 scopes: `[router-route span, child-route span]`
> - 15 exchanges leaked 3 scopes: `[router-route span, router-route span,
> child-route span]`
> All 71 leaks occur on the same Tomcat thread as the originating exchange (no
> async/cross-thread handoff for the HTTP side)
> The double router-route leak in the 3-scope pattern may indicate
> `TracingRoutePolicy` is registered twice on some routes, but the core issue
> (spans never ended) affects all patterns.
> *{*}Additional observation — wireTap executor threads also affected:{*}*
> Heap dump analysis from Apr 29 shows `wireTapIoExecutor` threads accumulating
> OTel contexts via the same ThreadLocal mechanism. In the most leaked pod
> observed (879 MB live heap), two of the four wireTap executor threads each
> retained ~15.9 MB of OTel context chains while the other two remained at
> baseline (~147 KB). The selectively affected threads are those that have
> processed exchanges on the child handler routes (route 3 above), which
> contain multiple `.wireTap()` calls. The wireTap executor threads are not
> Tomcat servlet threads, so the leak on these threads follows a different call
> path than the HTTP side but appears to share the same root cause:
> `TracingRoutePolicy.onExchangeBegin` opening a scope that `endEventSpan`
> never closes.
> *{*}Span names from the leaking routes:{*}*
> |Leaks|Span Name|
> |—|—|
> |31|`corsair/carrier-direct-documents` (child route)|
> |18|`carrier-direct-tracking-rest-consumer-*-router` (router route)|
> |12|`carrier-direct-document-rest-consumer-*-router` (router route)|
> |10|`corsair/carrier-direct-tracking` (child route)|
> *Route structure*
> The leaking routes follow a standard REST DSL pattern with three routes per
> endpoint:
> 1. `rest:post:/path` (auto-generated, `routeId = X--rest`) calls
> `.to("direct:X-router")`
> 2. `direct:X-router` (`routeId = X-router`) dispatches via
> `toD("direct:${header.ROUTE_TO}")`
> 3. `direct:X-<version>` (`routeId = X::<version>`) is the handler with
> multiple `.wireTap()` calls
> Spans for routes 2 and 3 leak. Route 1's span does not appear in the
> strict-context output.
> *Heap dump confirmation*
> Production heap dumps across multiple pods and dates show linear growth of
> OTel objects retained by Tomcat `http-nio-*-exec` ThreadLocals, with
> `wireTapIoExecutor` threads also affected at higher uptimes:
> |Date|Uptime|TaskThread Retained|Live Heap (MAT)|`SdkSpan` Count|
> |—|—|—|—|—|
> |Apr 17|0.2h|not measurable|~470 MB (baseline)|—|
> |Apr 17|10.8h|104 MB (16.8%)|593 MB|23,188|
> |Apr 17|23.9h|201 MB (26.2%)|733 MB|47,145|
> |Apr 24|23.9h|209 MB (30.1%)|697 MB|51,444|
> |Apr 29|high uptime|*{*}309 MB (33.5%){*}*|*{*}879 MB{*}*|*{*}83,138{*}*|
> Growth rate: ~8 MB/hour across HTTP threads. With an 8 GB max heap, pods OOM
> after approximately 40-50 hours. `SdkSpan` count (unended spans) grows
> proportionally and serves as the most direct indicator of leak severity.
> *Configuration*
> ```yaml
> camel:
> opentelemetry2:
> enabled: true
> trace-processors: false # only route-level spans, no per-processor spans
> ```
> *Environment*
> |Dependency|Version|
> |—|—|
> |`camel-opentelemetry2-starter`|4.14.5|
> |`camel-telemetry`|4.14.5 (transitive)|
> |`io.opentelemetry:opentelemetry-bom`|1.60.1|
> |`io.opentelemetry.instrumentation:opentelemetry-instrumentation-bom`|2.26.1|
> |`io.micrometer:micrometer-tracing-bridge-otel`|1.5.10|
> |Spring Boot|3.5.13|
> |JDK|25.0.1+8-LTS|
> Note: `CAMEL-21302` (async producer context leak, fixed in 4.4.5) does not
> cover this case. Our leak is synchronous, same-thread, and caused by
> `beginEventSpan`/`endEventSpan` lifecycle mismatch rather than cross-thread
> scope closure.
> *Expected behavior*
> Every `beginEventSpan()` call in `TracingRoutePolicy.onExchangeBegin` should
> have a matching `endEventSpan()` that ends the span and closes the scope when
> the exchange completes on that route.
> *Workaround*
> A servlet filter calling `Context.root().makeCurrent().close()` after each
> request caps the heap impact on Tomcat HTTP threads but does not fix the
> underlying issue (spans still go unended, trace data is lost). It also has no
> effect on `wireTapIoExecutor` threads or other non-servlet thread pools,
> which accumulate OTel contexts via the same mechanism but are not covered by
> the filter chain.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)