liumingjian opened a new issue, #10427:
URL: https://github.com/apache/rocketmq/issues/10427

   ### Search before creation
   
   - [x] I had searched in the issues and found no similar issues.
   
   ### Documentation Related
   
   I would like to request clearer official documentation for RocketMQ 
deployments across two same-city availability zones or data centers.
   
   This is not intended as environment-specific consulting. The goal is to make 
the official documentation more explicit about the supported production 
topology, minimum node count, failure recovery boundaries, and active-active 
limitations for same-city dual-site deployments.
   
   The scenario is:
   
   - two availability zones or data centers in the same city;
   - both sites are expected to serve production traffic;
   - producers and consumers may connect to either site;
   - the deployment should tolerate single-node failures and, where possible, 
one-site failures;
   - the architecture should avoid split-brain, ambiguous failover behavior, or 
message availability assumptions that are not officially supported.
   
   It would be very helpful if the documentation could clarify the recommended 
approach for this scenario, especially:
   
   1. Whether RocketMQ recommends a single logical cluster stretched across the 
two sites, or separate RocketMQ clusters with replication/application-level 
routing.
   2. The minimum production-ready node count for NameServer, Broker, 
Controller, and/or DLedger-based deployments in this scenario.
   3. Whether a third failure domain, arbitration node, or witness-like 
deployment is required for quorum and split-brain avoidance.
   4. How Broker master/slave replicas, Controller nodes, DLedger groups, and 
NameServer nodes should be distributed across the two sites.
   5. Which failure scenarios can recover automatically, for example single 
Broker failure, Controller leader failure, one-site failure, NameServer 
failure, cross-site network partition, or loss of an arbitration node.
   6. Whether active-active writes to the same logical topic from both sites 
are supported, discouraged, or intentionally out of scope.
   7. If active-active writes are not recommended, what the official 
alternative is, such as active-passive disaster recovery, dual clusters with 
application-level routing, or another documented pattern.
   8. Whether there are special limitations for ordered messages, transactional 
messages, delayed messages, consumer offset consistency, and message 
duplication during failover.
   
   A reference architecture or decision matrix in the documentation would be 
valuable. For example, it could compare:
   
   - a single RocketMQ cluster deployed across two same-city sites;
   - a two-site deployment plus a third quorum or arbitration failure domain;
   - two independent RocketMQ clusters with replication or application-level 
routing;
   - active-passive disaster recovery;
   - patterns that are not recommended, such as unsupported active-active 
writes to the same logical topic.
   
   This clarification would help production users avoid incorrect assumptions 
about quorum, failover, data consistency, and message availability when 
designing same-city dual-site RocketMQ architectures.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to