guyinyou opened a new issue, #9705:
URL: https://github.com/apache/rocketmq/issues/9705

   ### Before Creating the Enhancement Request
   
   - [x] I have confirmed that this should be classified as an enhancement 
rather than a bug/feature.
   
   
   ### Summary
   
   Enhance the persist() methods in TimerMetrics, TransactionMetrics, and 
ConfigManager to prevent broker startup failure in power outage scenarios. 
Currently, when broker starts with a corrupted config file and normal bak file, 
the persist() method directly overwrites the config file without proper backup 
mechanism. If power outage occurs during writing, both config and bak files 
become corrupted, causing broker unable to start.
   
   The enhancement includes:
   - Add atomic file backup mechanism before writing new config
   - Delete corrupted config files during startup to prevent bak file pollution
   - Add directory sync to ensure file operations visibility
   - Improve error handling with comprehensive exception catching
   
   ### Motivation
   
   This enhancement is critical for production environments where power outages 
can occur. The current implementation has a serious flaw:
   
   1. When broker starts with corrupted config file, it uses bak file to start 
normally
   2. During first persist(), the method directly overwrites config file 
without backup
   3. If power outage occurs during writing, both config and bak files become 
corrupted
   4. Broker cannot start on next restart, causing service interruption
   
   This affects system reliability and high availability, especially in 
distributed environments where broker downtime can impact the entire message 
queue system. The enhancement ensures broker can always recover from power 
outages without manual intervention.
   
   ### Describe the Solution You'd Like
   
   Implement the following changes:
   
   1. **Atomic File Backup**: Before writing new config, atomically move 
existing config to bak file using `Files.move()` with 
`StandardCopyOption.ATOMIC_MOVE`
   
   2. **Startup Cleanup**: In load() method, delete corrupted config files when 
they cannot be parsed, preventing them from being used as backup source
   
   3. **Directory Sync**: Add `MixAll.fsyncDirectory()` calls after file 
operations to ensure data visibility
   
   4. **Improved Error Handling**: Use `Throwable` instead of specific 
exceptions for comprehensive error handling
   
   5. **Force Sync**: Use `RandomAccessFile` with `force(true)` to ensure data 
is written to disk
   
   The implementation should be applied to:
   - `TimerMetrics#persist()`
   - `TransactionMetrics#persist()`  
   - `ConfigManager#persist()`
   - `ConfigManager#load()`
   
   ### Describe Alternatives You've Considered
   
   1. **WAL (Write-Ahead Log)**: Could implement a write-ahead log system, but 
this would be overkill for simple config persistence and adds significant 
complexity
   
   2. **Database Storage**: Moving config to database would solve the problem 
but requires major architectural changes and adds external dependencies
   
   3. **Multiple Backup Files**: Creating multiple backup files (bak1, bak2, 
etc.) would provide redundancy but adds complexity and storage overhead
   
   4. **Checksum Validation**: Adding checksums to validate file integrity 
could help detect corruption but doesn't prevent the core issue of bak file 
pollution
   
   5. **External Config Management**: Using external config management systems 
would solve the problem but requires infrastructure changes and adds 
operational complexity
   
   The proposed solution is the most appropriate as it:
   - Requires minimal code changes
   - Maintains backward compatibility
   - Solves the core problem without adding complexity
   - Follows existing patterns in the codebase
   - Provides immediate benefits with low risk
   
   ### Additional Context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to