[ 
https://issues.apache.org/jira/browse/CAMEL-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pasquale Congiusti closed CAMEL-23380.
--------------------------------------
    Resolution: Fixed

> `TracingRoutePolicy.onExchangeBegin` opens scopes and spans that are never 
> closed/ended
> ---------------------------------------------------------------------------------------
>
>                 Key: CAMEL-23380
>                 URL: https://issues.apache.org/jira/browse/CAMEL-23380
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-opentelemetry
>    Affects Versions: 4.14.5
>            Reporter: Nikita
>            Assignee: Pasquale Congiusti
>            Priority: Major
>             Fix For: 4.21.0
>
>         Attachments: threaddump-bbjjj-1777575894 json.txt
>
>
> *Summary*
> `Tracer$TracingRoutePolicy.onExchangeBegin()` calls `beginEventSpan()` which 
> opens an OTel scope via `makeCurrent()`, but the corresponding 
> `endEventSpan()` is never reached. The span is never ended 
> (`endEpochNanos=0`) and the scope is never closed. On pooled Tomcat threads, 
> this causes `ArrayBasedContext` objects to accumulate in `ThreadLocal` 
> storage, growing ~8 MB/hour until OOM.
> *Evidence*
> With `-Dio.opentelemetry.context.enableStrictContext=true` enabled, *{*}71 
> `Scope garbage collected before being closed` errors{*}* were captured over 
> 15 minutes of normal traffic. All 71 share an identical stack trace and all 
> 71 leaked spans have `endEpochNanos=0`.
> *{*}Stack trace (all 71 entries):{*}*
> ```
> io.opentelemetry.context.Context.makeCurrent (Context.java:231)
> io.opentelemetry.context.ImplicitContextKeyed.makeCurrent 
> (ImplicitContextKeyed.java:33)
> org.apache.camel.opentelemetry2.OpenTelemetrySpanAdapter.makeCurrent 
> (OpenTelemetrySpanAdapter.java:47)
> org.apache.camel.opentelemetry2.OpenTelemetryTracer$OpentelemetrySpanLifecycleManager.activate
>  (OpenTelemetryTracer.java:141)
> org.apache.camel.telemetry.Tracer.beginEventSpan (Tracer.java:267)
> org.apache.camel.telemetry.Tracer$TracingRoutePolicy.onExchangeBegin 
> (Tracer.java:220)
> org.apache.camel.impl.engine.CamelInternalProcessor$RoutePolicyAdvice.before 
> (CamelInternalProcessor.java:554)
> org.apache.camel.impl.engine.CamelInternalProcessor.process 
> (CamelInternalProcessor.java:324)
> ```
> *{*}Per-exchange pattern:{*}*
>  - 41 exchanges produced 71 leaks
>  - 26 exchanges leaked 2 scopes: `[router-route span, child-route span]`
>  - 15 exchanges leaked 3 scopes: `[router-route span, router-route span, 
> child-route span]`
> All 71 leaks occur on the same Tomcat thread as the originating exchange (no 
> async/cross-thread handoff for the HTTP side)
> The double router-route leak in the 3-scope pattern may indicate 
> `TracingRoutePolicy` is registered twice on some routes, but the core issue 
> (spans never ended) affects all patterns.
> *{*}Additional observation — wireTap executor threads also affected:{*}*
> Heap dump analysis from Apr 29 shows `wireTapIoExecutor` threads accumulating 
> OTel contexts via the same ThreadLocal mechanism. In the most leaked pod 
> observed (879 MB live heap), two of the four wireTap executor threads each 
> retained ~15.9 MB of OTel context chains while the other two remained at 
> baseline (~147 KB). The selectively affected threads are those that have 
> processed exchanges on the child handler routes (route 3 above), which 
> contain multiple `.wireTap()` calls. The wireTap executor threads are not 
> Tomcat servlet threads, so the leak on these threads follows a different call 
> path than the HTTP side but appears to share the same root cause: 
> `TracingRoutePolicy.onExchangeBegin` opening a scope that `endEventSpan` 
> never closes.
> *{*}Span names from the leaking routes:{*}*
> |Leaks|Span Name|
> |—|—|
> |31|`corsair/carrier-direct-documents` (child route)|
> |18|`carrier-direct-tracking-rest-consumer-*-router` (router route)|
> |12|`carrier-direct-document-rest-consumer-*-router` (router route)|
> |10|`corsair/carrier-direct-tracking` (child route)|
> *Route structure*
> The leaking routes follow a standard REST DSL pattern with three routes per 
> endpoint:
> 1. `rest:post:/path` (auto-generated, `routeId = X--rest`) calls 
> `.to("direct:X-router")`
> 2. `direct:X-router` (`routeId = X-router`) dispatches via 
> `toD("direct:${header.ROUTE_TO}")`
> 3. `direct:X-<version>` (`routeId = X::<version>`) is the handler with 
> multiple `.wireTap()` calls
> Spans for routes 2 and 3 leak. Route 1's span does not appear in the 
> strict-context output.
> *Heap dump confirmation*
> Production heap dumps across multiple pods and dates show linear growth of 
> OTel objects retained by Tomcat `http-nio-*-exec` ThreadLocals, with 
> `wireTapIoExecutor` threads also affected at higher uptimes:
> |Date|Uptime|TaskThread Retained|Live Heap (MAT)|`SdkSpan` Count|
> |—|—|—|—|—|
> |Apr 17|0.2h|not measurable|~470 MB (baseline)|—|
> |Apr 17|10.8h|104 MB (16.8%)|593 MB|23,188|
> |Apr 17|23.9h|201 MB (26.2%)|733 MB|47,145|
> |Apr 24|23.9h|209 MB (30.1%)|697 MB|51,444|
> |Apr 29|high uptime|*{*}309 MB (33.5%){*}*|*{*}879 MB{*}*|*{*}83,138{*}*|
> Growth rate: ~8 MB/hour across HTTP threads. With an 8 GB max heap, pods OOM 
> after approximately 40-50 hours. `SdkSpan` count (unended spans) grows 
> proportionally and serves as the most direct indicator of leak severity.
> *Configuration*
> ```yaml
> camel:
>   opentelemetry2:
>     enabled: true
>     trace-processors: false   # only route-level spans, no per-processor spans
> ```
> *Environment*
> |Dependency|Version|
> |—|—|
> |`camel-opentelemetry2-starter`|4.14.5|
> |`camel-telemetry`|4.14.5 (transitive)|
> |`io.opentelemetry:opentelemetry-bom`|1.60.1|
> |`io.opentelemetry.instrumentation:opentelemetry-instrumentation-bom`|2.26.1|
> |`io.micrometer:micrometer-tracing-bridge-otel`|1.5.10|
> |Spring Boot|3.5.13|
> |JDK|25.0.1+8-LTS|
> Note: `CAMEL-21302` (async producer context leak, fixed in 4.4.5) does not 
> cover this case. Our leak is synchronous, same-thread, and caused by 
> `beginEventSpan`/`endEventSpan` lifecycle mismatch rather than cross-thread 
> scope closure.
> *Expected behavior*
> Every `beginEventSpan()` call in `TracingRoutePolicy.onExchangeBegin` should 
> have a matching `endEventSpan()` that ends the span and closes the scope when 
> the exchange completes on that route.
> *Workaround*
> A servlet filter calling `Context.root().makeCurrent().close()` after each 
> request caps the heap impact on Tomcat HTTP threads but does not fix the 
> underlying issue (spans still go unended, trace data is lost). It also has no 
> effect on `wireTapIoExecutor` threads or other non-servlet thread pools, 
> which accumulate OTel contexts via the same mechanism but are not covered by 
> the filter chain.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to