[I] [Enhancement] Synchronize metrics shutdown to prevent JVM crashes during broker shutdown [rocketmq]

via GitHub Sun, 14 Sep 2025 19:57:46 -0700


guyinyou opened a new issue, #9701:
URL: https://github.com/apache/rocketmq/issues/9701


   ### Before Creating the Enhancement Request
   
   - [x] I have confirmed that this should be classified as an enhancement 
rather than a bug/feature.
   
   
   ### Summary
   
   Add synchronous blocking wait mechanism for metrics components shutdown to 
prevent JVM crashes caused by race conditions during broker shutdown process.
   
   ### Motivation
   
   Currently, the metrics shutdown process in BrokerMetricsManager uses 
asynchronous operations without proper synchronization. This creates race 
conditions where:
   
   1. Dependencies (like periodicMetricReader, metricExporter) may shutdown 
before the services that depend on them
   2. Services continue to access already-shutdown dependencies, causing JVM 
crashes
   3. Data loss may occur due to incomplete flush operations during shutdown
   
   This enhancement is critical for production stability, as JVM crashes during 
broker shutdown can lead to:
   - Data corruption
   - Incomplete metrics export
   - Service unavailability
   - Difficult troubleshooting in production environments
   
   The enhancement benefits the entire RocketMQ community by ensuring graceful 
and reliable broker shutdowns, especially in high-throughput production 
environments where metrics collection is heavily utilized.
   
   ### Describe the Solution You'd Like
   
   Implement synchronous blocking wait for all metrics-related shutdown 
operations in BrokerMetricsManager.shutdown():
   
   1. **Replace async calls with sync blocking**: Convert all shutdown 
operations to use CompletableFuture.join() with appropriate timeout
   2. **Ensure proper shutdown order**: Force each component to complete 
shutdown before proceeding to the next
   3. **Add retry mechanism**: Use while loops to retry failed operations until 
successful
   4. **Apply to all exporter types**: Implement the fix for OTLP_GRPC, PROM, 
and LOG metrics exporters
   
   **Implementation details:**
   - Use `join(Integer.MAX_VALUE, TimeUnit.DAYS)` to ensure completion
   - Add `isSuccess()` checks to verify operation completion
   - Maintain the same shutdown sequence but with proper synchronization
   - Ensure forceFlush() completes before shutdown() for each component
   
   **Code changes:**
   ```java
   // Before (async - causes race conditions)
   periodicMetricReader.forceFlush();
   periodicMetricReader.shutdown();
   
   // After (sync - prevents race conditions)  
   while (periodicMetricReader.forceFlush().join(Integer.MAX_VALUE, 
TimeUnit.DAYS).isSuccess());
   while (periodicMetricReader.shutdown().join(Integer.MAX_VALUE, 
TimeUnit.DAYS).isSuccess());
   ```
   ```
   
   ## Describe Alternatives You've Considered
   
   ### Describe Alternatives You've Considered
   
   1. **Add shutdown hooks**: Considered using JVM shutdown hooks, but this 
doesn't solve the core race condition issue and may introduce additional 
complexity.
   
   2. **Implement timeout-based shutdown**: Instead of infinite wait, use 
configurable timeouts. However, this could lead to incomplete shutdowns in slow 
environments and doesn't address the fundamental synchronization issue.
   
   3. **Add dependency tracking**: Track component dependencies and shutdown in 
reverse dependency order. This would be more complex and doesn't guarantee that 
async operations complete before dependencies are accessed.
   
   4. **Use CountDownLatch or similar synchronization primitives**: While this 
could work, CompletableFuture.join() is more appropriate for this use case as 
it's already part of the async operation chain.
   
   The chosen solution (synchronous blocking wait) is the most straightforward 
and reliable approach that directly addresses the root cause of the race 
condition without introducing unnecessa
   
   ### Additional Context
   
   **Current Issue:**
   - Broker shutdown process has race conditions in metrics components
   - JVM crashes occur when services access already-shutdown dependencies
   - Affects all metrics exporter types (OTLP_GRPC, PROM, LOG)
   
   **Environment:**
   - RocketMQ 5.3.2-SNAPSHOT
   - Java 8+ environments
   - Production environments with high metrics throughput
   
   **Testing:**
   - The fix has been implemented and tested locally
   - Commit: 5cd58a537f - "fix: synchronize metrics shutdown to prevent JVM 
crash"
   - No breaking changes to existing APIs
   - Backward compatible with existing configurations
   
   **Related Components:**
   - `org.apache.rocketmq.broker.metrics.BrokerMetricsManager`
   - Metrics exporters (OTLP, Prometheus, Logging)
   - Periodic metric readers
   
   This enhancement is essential for production stability and should be 
prioritized for the next releas


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Enhancement] Synchronize metrics shutdown to prevent JVM crashes during broker shutdown [rocketmq]

Reply via email to