qianye1001 commented on issue #10395:
URL: https://github.com/apache/rocketmq/issues/10395#issuecomment-4560468682

   ## 🤖 Automated Fix Proposal
   
   I've analyzed this issue against the latest upstream code and generated a 
fix specification.
   
   ### Summary
   
   **Root Cause**: When TLS certificates are dynamically reloaded via 
`FileWatchService`, the `loadSslContext()` methods in `NettyRemotingServer` and 
`ProxyAndTlsProtocolNegotiator` directly overwrite the `sslContext` field 
without releasing the old instance. Since `ReferenceCountedOpenSslContext` 
allocates native off-heap memory (SSL_CTX, X509 chain, EVP_PKEY), this native 
memory is leaked on every certificate rotation cycle.
   
   **Fix Strategy**:
   1. Save the old `SslContext` reference before building the replacement
   2. After successful assignment of the new context, call 
`ReferenceCountUtil.release(oldSslContext)` to free native memory
   3. Use "build new, then release old" ordering to ensure `sslContext` is 
never null or prematurely released
   
   **Files to Modify**:
   - 
`remoting/src/main/java/org/apache/rocketmq/remoting/netty/NettyRemotingServer.java`
   - 
`proxy/src/main/java/org/apache/rocketmq/proxy/grpc/ProxyAndTlsProtocolNegotiator.java`
   
   **Safety**:
   - `ReferenceCountUtil.release()` is safe for JDK provider `SslContext` 
(no-op for non-ReferenceCounted)
   - In-flight connections using the old context are unaffected (refCount 
managed by Netty channel pipeline)
   - Added `volatile` to `sslContext` field in `ProxyAndTlsProtocolNegotiator` 
for thread safety
   
   <details>
   <summary>Full Specification (click to expand)</summary>
   
   ```markdown
   # Fix Specification: Issue #10395 - Dynamic Certificate Reload Native Memory 
Leak
   
   ## 1. Root Cause Analysis
   
   When TLS certificates are dynamically reloaded, `loadSslContext()` in both 
`NettyRemotingServer` and `ProxyAndTlsProtocolNegotiator` directly overwrites 
the `sslContext` field with a newly built `SslContext` instance. The previous 
`SslContext` is never released.
   
   Netty's `SslContext` (particularly when backed by `SslProvider.OPENSSL` / 
`SslProvider.OPENSSL_REFCNT`) implements `ReferenceCounted`. Each instance 
holds native OpenSSL structures (SSL_CTX, X509, EVP_PKEY, etc.) allocated 
outside the JVM heap. When the old reference is discarded without calling 
`release()`, these native allocations are leaked permanently — the GC cannot 
reclaim them because they live in native memory managed by reference counting.
   
   Over time, repeated certificate reloads (e.g., every few minutes via a 
file-watcher or scheduled task) cause unbounded native memory growth, 
eventually leading to OOM kills or container evictions.
   
   **Affected locations:**
   
   | File | Method | Line |
   |------|--------|------|
   | 
`remoting/src/main/java/org/apache/rocketmq/remoting/netty/NettyRemotingServer.java`
 | `loadSslContext()` | 186 |
   | 
`proxy/src/main/java/org/apache/rocketmq/proxy/grpc/ProxyAndTlsProtocolNegotiator.java`
 | `loadSslContext()` | 118, 131 |
   
   **Field declarations:**
   
   - `NettyRemotingAbstract.java:121` — `protected volatile SslContext 
sslContext;`
   - `ProxyAndTlsProtocolNegotiator.java:80` — `private static SslContext 
sslContext;`
   
   ## 2. Fix Strategy
   
   ### Step 1: Save Old Reference Before Replacement
   
   Before building the new `SslContext`, capture the current value of 
`sslContext` into a local variable.
   
   ### Step 2: Build and Assign New SslContext
   
   Build the new `SslContext` and assign it to the field (existing behavior, 
unchanged).
   
   ### Step 3: Release Old SslContext
   
   After the new context is successfully assigned, check if the old reference 
is non-null and implements `ReferenceCounted`, th
   ```
   
   </details>
   
   ### Next Steps
   
   - Reply `/approve` to proceed with the PR review process
   - Reply `/revise <feedback>` to request changes to the fix approach
   - Reply `/reject` to close this proposal
   
   **Note**: A PR (#10396) was submitted prematurely without community 
approval. If this fix direction is acceptable, please `/approve` here. If not, 
we can close that PR and revise the approach based on your feedback.
   
   *This proposal will expire in 72 hours if no response is received.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to