ShivramSriramulu opened a new pull request, #20515: URL: https://github.com/apache/kafka/pull/20515
## Summary This PR enhances MirrorMaker 2 (MM2) with fault-tolerance capabilities to address critical data loss scenarios in cross-cluster replication setups. ## Problem Statement Vanilla MM2 has two critical gaps: 1. **Silent Data Loss**: Retention policies may purge messages before replication completes, creating undetectable gaps 2. **Service Disruption**: Topic delete/recreate operations can cause replication failures or stalls ## Solution Added fault-tolerance enhancements to `MirrorSourceTask`: ### Fail-Fast Truncation Detection - Catches `OffsetOutOfRangeException` during consumer polling - Logs detailed diagnostics with partition assignments and earliest offsets - Throws `ConnectException` to fail-fast and alert operators immediately - Configurable via `mirrorsource.fail.on.truncation=true` (default) ### Graceful Topic Reset Handling - Uses `AdminClient` to track topic IDs and detect delete/recreate events - Automatically seeks to beginning offset for reset topics - Handles `UnknownTopicOrPartitionException` with retry logic - Configurable via `mirrorsource.auto.recover.on.reset=true` (default) ## Technical Details - **File Modified**: `connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceTask.java` - **Lines Added**: ~75 LOC (well under 500 LOC requirement) - **Backward Compatibility**: Maintained - all changes are additive - **Configuration**: New properties with sensible defaults - **Logging**: Uses dedicated logger `mm2.fault.tolerance` for easy filtering ## Testing - Comprehensive test scenarios in companion repository - Docker-based demo with Primary/DR clusters - Validates both fail-fast and auto-recovery behaviors - Test repository: https://github.com/ShivramSriramulu/Tiger_Graph_MM2 ## Impact - **RPO Improvement**: Makes data loss immediately visible instead of silent - **RTO Improvement**: Reduces manual intervention during maintenance - **Operational**: Clear error messages for troubleshooting - **Production Ready**: Minimal performance impact, configurable behavior -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
