|
Page Edited :
qpid :
Cluster Design Note
Cluster Design Note has been edited by Alan Conway (Mar 09, 2007). Content:Cluster Design NoteA Qpid cluster is a group of brokers co-operating to provide the
This design discusses clustering at the AMQP protocol layer, i.e. members of a cluster have distinct AMQP addresses and AMQP protocol commands are exchanged to negotiate reconnection and failover. "Transparent" failover in this context means transparent to the application, the AMQP client will be aware of disconnects and must take action to fail over. More precisely we define transparent failover to mean this: In the Given this definition the failover component of an AMQP client library can simply hide all failover-related commands from the application (indeed from the rest of the library) without breaking any semantics. RequirementsTODO: Define levels of reliability we want to provide - survive one TODO: The requirements triangle. Concrete performance data. Clients only need to use standard AMQP to talk to a cluster. They need Ultimately we may propose extensions to AMQP spec but for the initial implementation we can use existing extension points:
Abstract Model and TermsA quick re-cap of AMQP terminology and introduction to some new terms: A broker is a container for 3 types of broker components: queues, exchanges and bindings. Broker components represent resources available to multiple clients, and are not affected by clients connecting and disconnecting . Persistent broker components are unaffected by shut-down and re-start of a broker. A client uses the components contained in a broker via the AMQP TODO: Where do transactions fit in the model? They are also a kind of relationship components but Dtx transactions may span more than just client/broker. A client's interaction with a unclustered individual broker[[Footnote(An individual broker by this definition is really a broker behind a single AMQP address. Such a broker might in fact be a cluster using technologies like TCP failover/load balancing. This is outside the scope of this design, which focusses on clustering at the AMQP protocol layer, where cluster members have separate AMQP addresses.)] is A broker cluster (or just cluster) is a "virtual" broker implemented by several member brokers (or just members.) A cluster has many AMQP addresses - the addresses of all its members - all semantically equivalent for client connection [Footnote(They may not be equivalent on other grounds, e.g. network distance from client, load etc.)
A session is an identity for a collection of client components (i.e. a client-broker relationship) that can outlive a single connection. AMQP 0-9 provides some support for sessions in the `resume` command. If a connection is closed in an orderly manner by either end, the A session is like an extended connection: if you cut out the failover commands the conversation on a session is exactly the conversation the client would have had on a single connection to an individual broker with no failures. If a connection is in a session, events in the AMQP.0-8 spec that are Note the session concept could also be used in the absence of clustering to allow a client to disconnect and resume a long-running session with the same broker. This is outside the scope of this design. Cluster StateWe have to consider several kinds of state:
Wiring and content needs to be replicated in a fault tolerant way among the brokers so all brokers can provide access to all resources. Persistent data also needs to be stored persistently for recovery from total failure or deliberate shutdown. Cluster membership needs to be replicated among brokers and passed to Conversational state relates to a client-broker relationship:
To resume successfully the converstaional state must be re-established on both sidess. There are several options about how much state is stored where. We'll outline the solution that minimizes broker-side replication, but it's not clear yet if this is the optimal choice. To minimize converstaional replication on the broker, the broker must
Everything else can be re-established by the client:
Note that all of the above can be accomplished with normal AMQP commands. The broker could replicate more converstaional state to relieve the ReplicationThe different types of state are replicated in different ways to Cluster membership and WiringMembership changes are replicated to entire cluster symmetrically Connected clients are also notified of changes to the set of fail-over candidates for that client. Clients are notified over AMQP by binding a queue to a special system exchange. Wiring is low volume so and can also be replicated via virtual Queue ContentMessage content is too high volume to replicate to the entire cluster.
A single cluster member may contain a mix of primary, backup and proxy queues. (TODO: mechanism for establishing primary, backup etc.) The client is unaware of the distinction, it sees an identical picture TODO: Ordering issues with proxys and put-back messages (reject, transaction rollback) or selectors. Fragmented Shared QueuesA shared queue has reduced ordering requirements and increased Each fragment's content is replicated to backups independently just like a normal queue. The appearance of a single queue is created by collaboration between Proxies to a fragmented queue will consume from the "nearest" fragment if possible. TODO: Proxies can play a more active role. Ordering guarantees, we can Conversational StateThe minimum converstaional state a broker needs to replicate for failover is
Session identities and processed requiest history is very low volume and could be replicated with virtual synchrony. However commands in flight and open incoming references are probably too large to be replicated this way. Enter the session backup: the broker directly connected to a client is Private queues are a bit special: they're not exactly conversational state but they are closely tied to the session. Private queues will always be backed up by the same node as the session that created them. Shared storageIn the case of high-performance, reliable shared storage (e.g. GFS) For commodity hardware cases we need a solution with no shared store as described above. TODO: Would we wan to use shared store for conversational state? Client-Broker ProtocolTODO: How it looks in client protocol terms - system queues & Membership updates, choosing a replica, reconnection, building conversational state => client stores representation of state unacked messages & duplicates, hiding failover. Broker-Broker ProtocolBroker-broker communication uses normal AMQP over specially identified Proxying: Proxies simply forward methods between client and primary and create consumers on behalf of the client. TODO: any issues with message put-back and transactions? Queue/fragment replication: Use AMQP transfer commands to transfer content Session replication: Session replication: AMQP on special connections. Primary forwards all outgoing requests and incoming responses to session backup. Backup can track the primary request/response tables and retransmit messages. Note the Persistence and RecoveryCompeting failure modes:Tibco: fast when running clean but performance over time has GC "spikes" Single journal for all queues. "holes" in log have to be garbage collected to re-use the log. 1 slow consumer affects everyone because it causes fragmentation of the log. MQ: write to journal, write journal to DB, read from DB. Street homegrown solutions: transient MQ with home grown persistence. Can we get more design details for these solutions? Persistence overviewThere are 3 reasons to persist a message: Durable messages: must be stored to disk across broker shutdowns or failures.
Reliability: recover after a crash.
Flow-to-disk: to reduce memory use for full queues.
Durable and reliable cases are very similar: storage time is performance-critical (blocks response to sender) but reading is not and cleanup can be done by an async thread or process. For flow-to-disk, when queues are full, both store and reading are So it looks like the same solution will work for durable and reliable. Flow-to-disk has different requirements but it would be desirable to We also need to persist wiring (Queues/Exchanges/Bindings), but this is much less performance critical. The entire wiring model is held in memory so wiring is only read at startup, and updates are low volume and not performance-critical. A simple database should suffice. Message JournalsFor reliability and durability we will use a message journal per A cleanup agent (thread or process) removes, recycles or compacts journal files that have no live (undelivered) messages. (References complicate the book-keeping a little but don't alter the conceptual model.) Recovery or restart reconstructs the queue contents from the Flow-to-disk can re-use the journal framework, with a simple extension: the broker keeps an in-memory index of live messages in the journal. If flow-to-disk is combined with reliability then messages are Without reliability flow-to-disk is similar except that messages are only journalled if memory gets tight. Disk thrashing: Why do we think skipping disk heads around between multiple journals will be better than seeking up and down a single journal? Are we assuming that we only need to optimize the case where long sequences of traffic tend to be for the same queue? No write on fast consume: Optimization - if we can deliver (and get Async journalling: writing to client, writing to journal, acks from client, acks from journal are separate async streams? So if we get client ack before the journalling stream has written the journal we cancel the write? But what kind of ack info do we need? Need a diagram of interactions, failure points and responses at each point. Start simple and optimize, but dont rule out optimizations. What about persistence-free reliability?Is memory-only replication with no disk a viable option for high-speed Virtual synchrony TODO: Wiring & membership via virtual synchrony TODO: journaling, speed. Will file-per-q really help with disk burnout?ConfigurationSimplifying patternsPossible ways to configure a cluster:
Dynamic cluster configuration
Issue: unit of failover/redirect is connection/channel but "working set" of queues and exchanges is unrelated. Use virtual host as unit for failover/relocation? It's also a queue namespace... If a queue moves we have to redirect its consumers, can't redirect Backups: chained backups rather than multi-backup? Ring backup? What about split brain, elections, quorums etc. Should new backups acquire state from primary, from disk or possibly Open QuestionsIssues: double failure in backup ring: A -> B -> C. Simultaneous Java/C++ interworking - is there a requirement? Fail over from C++ to Java? Common persistence formats? Implementation breakdown.The following are independently useful units of work that combine to Proxy Queues: Useful in federation. Pure-AMQP proxies for exchanges might also be useful but are not needed for current purpose as we will use virtual synchrony to replicate wiring. Fragmented queues: Over pure AMQP (no VS) useful by itself for unreliable Virtual Synchrony Cluster: Multicast membership and total ordering protocol for brokers. Not useful alone, but useful with proxies and/or fragments for dynamic federations. Primary-backup replication: Over AMQP, no persistence. Still offers some Persistence: Useful on its own for flow-to-disk and durable messages. Must meet the performance requirements of reliable journalling. <muse:fn-sep?/> Exclusive or auto-delete queues are deleted on disconnect, we'll return to this point. |
Unsubscribe or edit your notifications preferences
