liumingjian opened a new issue, #10427: URL: https://github.com/apache/rocketmq/issues/10427
### Search before creation - [x] I had searched in the issues and found no similar issues. ### Documentation Related I would like to request clearer official documentation for RocketMQ deployments across two same-city availability zones or data centers. This is not intended as environment-specific consulting. The goal is to make the official documentation more explicit about the supported production topology, minimum node count, failure recovery boundaries, and active-active limitations for same-city dual-site deployments. The scenario is: - two availability zones or data centers in the same city; - both sites are expected to serve production traffic; - producers and consumers may connect to either site; - the deployment should tolerate single-node failures and, where possible, one-site failures; - the architecture should avoid split-brain, ambiguous failover behavior, or message availability assumptions that are not officially supported. It would be very helpful if the documentation could clarify the recommended approach for this scenario, especially: 1. Whether RocketMQ recommends a single logical cluster stretched across the two sites, or separate RocketMQ clusters with replication/application-level routing. 2. The minimum production-ready node count for NameServer, Broker, Controller, and/or DLedger-based deployments in this scenario. 3. Whether a third failure domain, arbitration node, or witness-like deployment is required for quorum and split-brain avoidance. 4. How Broker master/slave replicas, Controller nodes, DLedger groups, and NameServer nodes should be distributed across the two sites. 5. Which failure scenarios can recover automatically, for example single Broker failure, Controller leader failure, one-site failure, NameServer failure, cross-site network partition, or loss of an arbitration node. 6. Whether active-active writes to the same logical topic from both sites are supported, discouraged, or intentionally out of scope. 7. If active-active writes are not recommended, what the official alternative is, such as active-passive disaster recovery, dual clusters with application-level routing, or another documented pattern. 8. Whether there are special limitations for ordered messages, transactional messages, delayed messages, consumer offset consistency, and message duplication during failover. A reference architecture or decision matrix in the documentation would be valuable. For example, it could compare: - a single RocketMQ cluster deployed across two same-city sites; - a two-site deployment plus a third quorum or arbitration failure domain; - two independent RocketMQ clusters with replication or application-level routing; - active-passive disaster recovery; - patterns that are not recommended, such as unsupported active-active writes to the same logical topic. This clarification would help production users avoid incorrect assumptions about quorum, failover, data consistency, and message availability when designing same-city dual-site RocketMQ architectures. ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
