guyinyou opened a new issue, #9705: URL: https://github.com/apache/rocketmq/issues/9705
### Before Creating the Enhancement Request - [x] I have confirmed that this should be classified as an enhancement rather than a bug/feature. ### Summary Enhance the persist() methods in TimerMetrics, TransactionMetrics, and ConfigManager to prevent broker startup failure in power outage scenarios. Currently, when broker starts with a corrupted config file and normal bak file, the persist() method directly overwrites the config file without proper backup mechanism. If power outage occurs during writing, both config and bak files become corrupted, causing broker unable to start. The enhancement includes: - Add atomic file backup mechanism before writing new config - Delete corrupted config files during startup to prevent bak file pollution - Add directory sync to ensure file operations visibility - Improve error handling with comprehensive exception catching ### Motivation This enhancement is critical for production environments where power outages can occur. The current implementation has a serious flaw: 1. When broker starts with corrupted config file, it uses bak file to start normally 2. During first persist(), the method directly overwrites config file without backup 3. If power outage occurs during writing, both config and bak files become corrupted 4. Broker cannot start on next restart, causing service interruption This affects system reliability and high availability, especially in distributed environments where broker downtime can impact the entire message queue system. The enhancement ensures broker can always recover from power outages without manual intervention. ### Describe the Solution You'd Like Implement the following changes: 1. **Atomic File Backup**: Before writing new config, atomically move existing config to bak file using `Files.move()` with `StandardCopyOption.ATOMIC_MOVE` 2. **Startup Cleanup**: In load() method, delete corrupted config files when they cannot be parsed, preventing them from being used as backup source 3. **Directory Sync**: Add `MixAll.fsyncDirectory()` calls after file operations to ensure data visibility 4. **Improved Error Handling**: Use `Throwable` instead of specific exceptions for comprehensive error handling 5. **Force Sync**: Use `RandomAccessFile` with `force(true)` to ensure data is written to disk The implementation should be applied to: - `TimerMetrics#persist()` - `TransactionMetrics#persist()` - `ConfigManager#persist()` - `ConfigManager#load()` ### Describe Alternatives You've Considered 1. **WAL (Write-Ahead Log)**: Could implement a write-ahead log system, but this would be overkill for simple config persistence and adds significant complexity 2. **Database Storage**: Moving config to database would solve the problem but requires major architectural changes and adds external dependencies 3. **Multiple Backup Files**: Creating multiple backup files (bak1, bak2, etc.) would provide redundancy but adds complexity and storage overhead 4. **Checksum Validation**: Adding checksums to validate file integrity could help detect corruption but doesn't prevent the core issue of bak file pollution 5. **External Config Management**: Using external config management systems would solve the problem but requires infrastructure changes and adds operational complexity The proposed solution is the most appropriate as it: - Requires minimal code changes - Maintains backward compatibility - Solves the core problem without adding complexity - Follows existing patterns in the codebase - Provides immediate benefits with low risk ### Additional Context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
