Author: aconway
Date: Mon Oct 1 18:07:47 2012
New Revision: 1392489
URL: http://svn.apache.org/viewvc?rev=1392489&view=rev
Log:
NO-JIRA: HA Updated design document.
Modified:
qpid/trunk/qpid/cpp/design_docs/new-ha-design.txt
Modified: qpid/trunk/qpid/cpp/design_docs/new-ha-design.txt
URL:
http://svn.apache.org/viewvc/qpid/trunk/qpid/cpp/design_docs/new-ha-design.txt?rev=1392489&r1=1392488&r2=1392489&view=diff
==============================================================================
--- qpid/trunk/qpid/cpp/design_docs/new-ha-design.txt (original)
+++ qpid/trunk/qpid/cpp/design_docs/new-ha-design.txt Mon Oct 1 18:07:47 2012
@@ -84,12 +84,6 @@ retry on a single address to fail over.
support configuring a fixed list of broker addresses when qpid is run
outside of a resource manager.
-Aside: Cold-standby is also possible using rgmanager with shared
-storage for the message store (e.g. GFS). If the broker fails, another
-broker is started on a different node and and recovers from the
-store. This bears investigation but the store recovery times are
-likely too long for failover.
-
** Replicating configuration
New queues and exchanges and their bindings also need to be replicated.
@@ -109,13 +103,9 @@ configuration.
Explicit exchange/queue qpid.replicate argument:
- none: the object is not replicated
- configuration: queues, exchanges and bindings are replicated but messages
are not.
-- messages: configuration and messages are replicated.
-
-TODO: provide configurable default for qpid.replicate
+- all: configuration and messages are replicated.
-[GRS: current prototype relies on queue sequence for message identity
-so selectively replicating certain messages on a given queue would be
-challenging. Selectively replicating certain queues however is trivial.]
+Set configurable default all/configuration/none
** Inconsistent errors
@@ -125,12 +115,13 @@ eliminates the need to stall the whole c
resolved. We still have to handle inconsistent store errors when store
and cluster are used together.
-We have 2 options (configurable) for handling inconsistent errors,
+We have 3 options (configurable) for handling inconsistent errors,
on the backup that fails to store a message from primary we can:
- Abort the backup broker allowing it to be re-started.
- Raise a critical error on the backup broker but carry on with the message
lost.
-We can configure the option to abort or carry on per-queue, we
-will also provide a broker-wide configurable default.
+- Reset and re-try replication for just the affected queue.
+
+We will provide some configurable options in this regard.
** New backups connecting to primary.
@@ -156,8 +147,8 @@ In backup mode, brokers reject connectio
so clients will fail over to the primary. HA admin tools mark their
connections so they are allowed to connect to backup brokers.
-Clients discover the primary by re-trying connection to the client URL
-until the successfully connect to the primary. In the case of a
+Clients discover the primary by re-trying connection to all addresses in the
client URL
+until they successfully connect to the primary. In the case of a
virtual IP they re-try the same address until it is relocated to the
primary. In the case of a list of IPs the client tries each in
turn. Clients do multiple retries over a configured period of time
@@ -168,12 +159,6 @@ is a separate broker URL for brokers sin
over a different network. The broker URL has to be a list of real
addresses rather than a virtual address.
-Brokers have the following states:
-- connecting: Backup broker trying to connect to primary - loops retrying
broker URL.
-- catchup: Backup connected to primary, catching up on pre-existing
configuration & messages.
-- ready: Backup fully caught-up, ready to take over as primary.
-- primary: Acting as primary, serving clients.
-
** Interaction with rgmanager
rgmanager interacts with qpid via 2 service scripts: backup &
@@ -190,8 +175,8 @@ the primary state. Backups discover the
*** Failover
-primary broker or node fails. Backup brokers see disconnect and go
-back to connecting mode.
+primary broker or node fails. Backup brokers see the disconnect and
+start trying to re-connect to the new primary.
rgmanager notices the failure and starts the primary service on a new node.
This tells qpidd to go to primary mode. Backups re-connect and catch up.
@@ -225,71 +210,30 @@ to become a ready backup.
** Interaction with the store.
-Clean shutdown: entire cluster is shut down cleanly by an admin tool:
-- primary stops accepting client connections till shutdown is complete.
-- backups come fully up to speed with primary state.
-- all shut down marking stores as 'clean' with an identifying UUID.
-
-After clean shutdown the cluster can re-start automatically as all nodes
-have equivalent stores. Stores starting up with the wrong UUID will fail.
-
-Stored status: clean(UUID)/dirty, primary/backup, generation number.
-- All stores are marked dirty except after a clean shutdown.
-- Generation number: passed to backups and incremented by new primary.
-
-After total crash must manually identify the "best" store, provide admin tool.
-Best = highest generation number among stores in primary state.
-
-Recovering from total crash: all brokers will refuse to start as all stores
are dirty.
-Check the stores manually to find the best one, then either:
-1. Copy stores:
- - copy good store to all hosts
- - restart qpidd on all hosts.
-2. Erase stores:
- - Erase the store on all other hosts.
- - Restart qpidd on the good store and wait for it to become primary.
- - Restart qpidd on all other hosts.
-
-Broker startup with store:
-- Dirty: refuse to start
-- Clean:
- - Start and load from store.
- - When connecting as backup, check UUID matches primary, shut down if not.
-- Empty: start ok, no UUID check with primary.
+Needs more detail:
+
+We want backup brokers to be able to user their stored messages on restart
+so they don't have to download everything from priamary.
+This requires a HA sequence number to be stored with the message
+so the backup can identify which messages are in common with the primary.
+
+This will work very similarly to the way live backups can use in-memory
+messages to reduce the download.
+
+Need to determine which broker is chosen as initial primary based on currency
of
+stores. Probably using stored generation numbers and status flags. Ideally
+automated with rgmanager, or some intervention might be reqiured.
* Current Limitations
(In no particular order at present)
-For message replication:
-
-LM1a - On failover, backups delete their queues and download the full queue
state from the
-primary. There was code to use messags already on the backup for
re-synchronisation, it
-was removed in early development (r1214490) to simplify the logic while
getting basic
-replication working. It needs to be re-introduced.
-
-LM1b - This re-synchronisation does not handle the case where a newly elected
primary is *behind*
-one of the other backups. To address this I propose a new event for restting
the sequence
-that the new primary would send out on detecting that a replicating browser is
ahead of
-it, requesting that the replica revert back to a particular sequence number.
The replica
-on receiving this event would then discard (i.e. dequeue) all the messages
ahead of that
-sequence number and reset the counter to correctly sequence any subsequently
delivered
-messages.
-
-LM2 - There is a need to handle wrap-around of the message sequence to avoid
-confusing the resynchronisation where a replica has been disconnected
-for a long time, sufficient for the sequence numbering to wrap around.
+For message replication (missing items have been fixed)
LM3 - Transactional changes to queue state are not replicated atomically.
-LM4 - Acknowledgements are confirmed to clients before the message has been
-dequeued from replicas or indeed from the local store if that is
-asynchronous.
-
-LM5 - During failover, messages (re)published to a queue before there are
-the requisite number of replication subscriptions established will be
-confirmed to the publisher before they are replicated, leaving them
-vulnerable to a loss of the new primary before they are replicated.
+LM4 - (No worse than store) Acknowledgements are confirmed to clients before
the message
+has been dequeued from replicas or indeed from the local store if that is
asynchronous.
LM6 - persistence: In the event of a total cluster failure there are
no tools to automatically identify the "latest" store.
@@ -323,21 +267,11 @@ case (b) can be addressed in a simple ma
(c) would require changes to the broker to allow client to simply
determine when the command has fully propagated.
-LC3 - Queues that are not in the query response received when a
-replica establishes a propagation subscription but exist locally are
-not deleted. I.e. Deletion of queues/exchanges while a replica is not
-connected will not be propagated. Solution is to delete any queues
-marked for propagation that exist locally but do not show up in the
-query response.
-
LC4 - It is possible on failover that the new primary did not
previously receive a given QMF event while a backup did (sort of an
analogous situation to LM1 but without an easy way to detect or remedy
it).
-LC5 - Need richer control over which queues/exchanges are propagated, and
-which are not.
-
LC6 - The events and query responses are not fully synchronized.
In particular it *is* possible to not receive a delete event but
@@ -356,12 +290,11 @@ LC6 - The events and query responses are
LC7 Federated links from the primary will be lost in failover, they will not
be re-connected on
the new primary. Federation links to the primary can fail over.
-LC8 Only plain FIFO queues can be replicated. LVQs and ring queues are not yet
supported.
-
LC9 The "last man standing" feature of the old cluster is not available.
* Benefits compared to previous cluster implementation.
+- Allows per queue/exchange control over what is replicated.
- Does not depend on openais/corosync, does not require multicast.
- Can be integrated with different resource managers: for example rgmanager,
PaceMaker, Veritas.
- Can be ported to/implemented in other environments: e.g. Java, Windows
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]