new-ha-design.txt

aconway Mon, 01 Oct 2012 11:08:36 -0700

Author: aconway
Date: Mon Oct  1 18:07:47 2012
New Revision: 1392489

URL: http://svn.apache.org/viewvc?rev=1392489&view=rev
Log:
NO-JIRA: HA Updated design document.


Modified:
    qpid/trunk/qpid/cpp/design_docs/new-ha-design.txt

Modified: qpid/trunk/qpid/cpp/design_docs/new-ha-design.txt
URL: 
http://svn.apache.org/viewvc/qpid/trunk/qpid/cpp/design_docs/new-ha-design.txt?rev=1392489&r1=1392488&r2=1392489&view=diff
==============================================================================
--- qpid/trunk/qpid/cpp/design_docs/new-ha-design.txt (original)
+++ qpid/trunk/qpid/cpp/design_docs/new-ha-design.txt Mon Oct  1 18:07:47 2012
@@ -84,12 +84,6 @@ retry on a single address to fail over. 
 support configuring a fixed list of broker addresses when qpid is run
 outside of a resource manager.
 
-Aside: Cold-standby is also possible using rgmanager with shared
-storage for the message store (e.g. GFS). If the broker fails, another
-broker is started on a different node and and recovers from the
-store. This bears investigation but the store recovery times are
-likely too long for failover.
-
 ** Replicating configuration
 
 New queues and exchanges and their bindings also need to be replicated.
@@ -109,13 +103,9 @@ configuration.
 Explicit exchange/queue qpid.replicate argument:
 - none: the object is not replicated
 - configuration: queues, exchanges and bindings are replicated but messages 
are not.
-- messages: configuration and messages are replicated.
-
-TODO: provide configurable default for qpid.replicate
+- all: configuration and messages are replicated.
 
-[GRS: current prototype relies on queue sequence for message identity
-so selectively replicating certain messages on a given queue would be
-challenging. Selectively replicating certain queues however is trivial.]
+Set configurable default all/configuration/none
 
 ** Inconsistent errors
 
@@ -125,12 +115,13 @@ eliminates the need to stall the whole c
 resolved. We still have to handle inconsistent store errors when store
 and cluster are used together.
 
-We have 2 options (configurable) for handling inconsistent errors,
+We have 3 options (configurable) for handling inconsistent errors,
 on the backup that fails to store a message from primary we can:
 - Abort the backup broker allowing it to be re-started.
 - Raise a critical error on the backup broker but carry on with the message 
lost.
-We can configure the option to abort or carry on per-queue, we
-will also provide a broker-wide configurable default.
+- Reset and re-try replication for just the affected queue.
+
+We will provide some configurable options in this regard.
 
 ** New backups connecting to primary.
 
@@ -156,8 +147,8 @@ In backup mode, brokers reject connectio
 so clients will fail over to the primary. HA admin tools mark their
 connections so they are allowed to connect to backup brokers.
 
-Clients discover the primary by re-trying connection to the client URL
-until the successfully connect to the primary. In the case of a
+Clients discover the primary by re-trying connection to all addresses in the 
client URL
+until they successfully connect to the primary. In the case of a
 virtual IP they re-try the same address until it is relocated to the
 primary. In the case of a list of IPs the client tries each in
 turn. Clients do multiple retries over a configured period of time
@@ -168,12 +159,6 @@ is a separate broker URL for brokers sin
 over a different network. The broker URL has to be a list of real
 addresses rather than a virtual address.
 
-Brokers have the following states:
-- connecting: Backup broker trying to connect to primary - loops retrying 
broker URL.
-- catchup: Backup connected to primary, catching up on pre-existing 
configuration & messages.
-- ready: Backup fully caught-up, ready to take over as primary.
-- primary: Acting as primary, serving clients.
-
 ** Interaction with rgmanager
 
 rgmanager interacts with qpid via 2 service scripts: backup &
@@ -190,8 +175,8 @@ the primary state. Backups discover the 
 
 *** Failover
 
-primary broker or node fails. Backup brokers see disconnect and go
-back to connecting mode.
+primary broker or node fails. Backup brokers see the disconnect and
+start trying to re-connect to the new primary.
 
 rgmanager notices the failure and starts the primary service on a new node.
 This tells qpidd to go to primary mode. Backups re-connect and catch up.
@@ -225,71 +210,30 @@ to become a ready backup.
 
 ** Interaction with the store.
 
-Clean shutdown: entire cluster is shut down cleanly by an admin tool:
-- primary stops accepting client connections till shutdown is complete.
-- backups come fully up to speed with primary state.
-- all shut down marking stores as 'clean' with an identifying  UUID.
-
-After clean shutdown the cluster can re-start automatically as all nodes
-have equivalent stores. Stores starting up with the wrong UUID will fail.
-
-Stored status: clean(UUID)/dirty, primary/backup, generation number.
-- All stores are marked dirty except after a clean shutdown.
-- Generation number: passed to backups and incremented by new primary.
-
-After total crash must manually identify the "best" store, provide admin tool.
-Best = highest generation number among stores in primary state.
-
-Recovering from total crash: all brokers will refuse to start as all stores 
are dirty.
-Check the stores manually to find the best one, then either:
-1. Copy stores:
- - copy good store to all hosts
- - restart qpidd on all hosts.
-2. Erase stores:
- - Erase the store on all other hosts.
- - Restart qpidd on the good store and wait for it to become primary.
- - Restart qpidd on all other hosts.
-
-Broker startup with store:
-- Dirty: refuse to start
-- Clean:
-  - Start and load from store.
-  - When connecting as backup, check UUID matches primary, shut down if not.
-- Empty: start ok, no UUID check with primary.
+Needs more detail:
+
+We want backup  brokers to be able to user their stored messages on restart
+so they don't have to download everything from priamary.
+This requires a HA sequence number to be stored with the message
+so the backup can identify which messages are in common with the primary.
+
+This will work very similarly to the way live backups can use in-memory
+messages to reduce the download.
+
+Need to determine which broker is chosen as initial primary based on currency 
of
+stores. Probably using stored generation numbers and status flags. Ideally
+automated with rgmanager, or some intervention might be reqiured.
 
 * Current Limitations
 
 (In no particular order at present)
 
-For message replication:
-
-LM1a - On failover, backups delete their queues and download the full queue 
state from the
-primary.  There was code to use messags already on the backup for 
re-synchronisation, it
-was removed in early development (r1214490) to simplify the logic while 
getting basic
-replication working. It needs to be re-introduced.
-
-LM1b - This re-synchronisation does not handle the case where a newly elected 
primary is *behind*
-one of the other backups. To address this I propose a new event for restting 
the sequence
-that the new primary would send out on detecting that a replicating browser is 
ahead of
-it, requesting that the replica revert back to a particular sequence number. 
The replica
-on receiving this event would then discard (i.e. dequeue) all the messages 
ahead of that
-sequence number and reset the counter to correctly sequence any subsequently 
delivered
-messages.
-
-LM2 - There is a need to handle wrap-around of the message sequence to avoid
-confusing the resynchronisation where a replica has been disconnected
-for a long time, sufficient for the sequence numbering to wrap around.
+For message replication (missing items have been fixed)
 
 LM3 - Transactional changes to queue state are not replicated atomically.
 
-LM4 - Acknowledgements are confirmed to clients before the message has been
-dequeued from replicas or indeed from the local store if that is
-asynchronous.
-
-LM5 - During failover, messages (re)published to a queue before there are
-the requisite number of replication subscriptions established will be
-confirmed to the publisher before they are replicated, leaving them
-vulnerable to a loss of the new primary before they are replicated.
+LM4 - (No worse than store) Acknowledgements are confirmed to clients before 
the message
+has been dequeued from replicas or indeed from the local store if that is 
asynchronous.
 
 LM6 - persistence: In the event of a total cluster failure there are
 no tools to automatically identify the "latest" store.
@@ -323,21 +267,11 @@ case (b) can be addressed in a simple ma
 (c) would require changes to the broker to allow client to simply
 determine when the command has fully propagated.
 
-LC3 - Queues that are not in the query response received when a
-replica establishes a propagation subscription but exist locally are
-not deleted. I.e. Deletion of queues/exchanges while a replica is not
-connected will not be propagated. Solution is to delete any queues
-marked for propagation that exist locally but do not show up in the
-query response.
-
 LC4 - It is possible on failover that the new primary did not
 previously receive a given QMF event while a backup did (sort of an
 analogous situation to LM1 but without an easy way to detect or remedy
 it).
 
-LC5 - Need richer control over which queues/exchanges are propagated, and
-which are not.
-
 LC6 - The events and query responses are not fully synchronized.
 
       In particular it *is* possible to not receive a delete event but
@@ -356,12 +290,11 @@ LC6 - The events and query responses are
 LC7 Federated links from the primary will be lost in failover, they will not 
be re-connected on
 the new primary. Federation links to the primary can fail over.
 
-LC8 Only plain FIFO queues can be replicated. LVQs and ring queues are not yet 
supported.
-
 LC9 The "last man standing" feature of the old cluster is not available.
 
 * Benefits compared to previous cluster implementation.
 
+- Allows per queue/exchange control over what is replicated.
 - Does not depend on openais/corosync, does not require multicast.
 - Can be integrated with different resource managers: for example rgmanager, 
PaceMaker, Veritas.
 - Can be ported to/implemented in other environments: e.g. Java, Windows



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

svn commit: r1392489 - /qpid/trunk/qpid/cpp/design_docs/new-ha-design.txt

Reply via email to