Re: [PR] PHOENIX-7843 TLA+ specification of Consistent Failover [phoenix]

via GitHub Mon, 11 May 2026 11:33:20 -0700


ritegarg commented on code in PR #2461:
URL: https://github.com/apache/phoenix/pull/2461#discussion_r3220315747



##########
src/main/tla/ConsistentFailover/markdown/Admin.md:
##########
@@ -0,0 +1,151 @@
+# Admin -- Operator-Initiated Actions
+
+**Source:** [`Admin.tla`](../Admin.tla)
+
+## Overview
+
+`Admin` models the human operator (Admin actor) who drives failover and abort 
via the `PhoenixHAAdminTool` CLI, which delegates to `HAGroupStoreManager` 
coprocessor endpoints. This module contains four actions: `AdminStartFailover` 
(initiate failover), `AdminAbortFailover` (abort an in-progress failover), 
`AdminGoOffline` (take a standby cluster offline), and `AdminForceRecover` 
(force-recover from OFFLINE). The last two are gated on the 
`UseOfflinePeerDetection` feature flag and model the proactive design for peer 
OFFLINE detection using `PhoenixHAAdminTool update --force`.
+
+These are the only actions in the specification that represent deliberate 
human decisions rather than automated system behavior. All four 
(`AdminStartFailover`, `AdminAbortFailover`, `AdminGoOffline`, 
`AdminForceRecover`) receive no fairness in the liveness specifications. The 
admin is genuinely non-deterministic. The admin might never initiate a 
failover, or might abort every failover attempt. Imposing fairness on admin 
actions would force unrealistic guarantees about human behavior.
+
+### Modeling Choice: Direct ZK Writes
+
+Unlike the peer-reactive transitions in [HAGroupStore.md](HAGroupStore.md), 
admin actions are direct ZK writes -- they are not watcher-dependent. The admin 
CLI writes directly to the local ZK znode via the coprocessor endpoint. No 
`zkPeerConnected` or `zkLocalConnected` guard is needed because the admin tool 
manages its own ZK connection independently of the `HAGroupStoreClient` watcher 
infrastructure. `AdminGoOffline` and `AdminForceRecover` also use the `--force` 
path which writes directly to ZK, so no `zkLocalConnected` guard is needed for 
those actions either.
+
+## Implementation Traceability
+
+| TLA+ Action | Java Source |
+|---|---|
+| `AdminStartFailover(c)` | 
`HAGroupStoreManager.initiateFailoverOnActiveCluster()` L375-400 |
+| `AdminAbortFailover(c)` | 
`HAGroupStoreManager.setHAGroupStatusToAbortToStandby()` L419-425; also clears 
`failoverPending` (models `abortFailoverListener` L173-185) |
+| `AdminGoOffline(c)` | `PhoenixHAAdminTool update --state OFFLINE` (gated on 
`UseOfflinePeerDetection`) |
+| `AdminForceRecover(c)` | `PhoenixHAAdminTool update --force --state STANDBY` 
(OFFLINE -> S; gated on `UseOfflinePeerDetection`) |
+
+```tla
+EXTENDS SpecState, Types
+```
+
+## AdminStartFailover -- Initiate Failover
+
+The admin initiates failover on the active cluster. Two paths depending on 
current state:
+
+### AIS Path: AIS -> ATS
+
+The cluster is fully in sync. The OUT directory must be empty and all live RS 
must be in SYNC mode. DEAD RSes are allowed -- an RS can crash while the 
cluster is AIS without changing the HA group state. The implementation checks 
`clusterState = AIS`, not per-RS modes; a DEAD RS is not writing, so the 
remaining SYNC RSes and empty OUT dir ensure safety.
+
+### ANIS Path: ANIS -> ANISTS
+
+The cluster is not in sync (at least one RS is in STORE_AND_FWD). The 
implementation only validates the current state (ANIS) and peer state -- no 
`outDirEmpty` or writer-mode guards are needed because the forwarder will drain 
OUT after the transition. The ANISTS -> ATS transition (`ANISTSToATS` in 
[HAGroupStore.md](HAGroupStore.md)) guards on `outDirEmpty` and the 
anti-flapping gate.
+
+### Peer-State Guard (Both Paths)
+
+The peer must be in a stable standby state (S or DS) to prevent initiating a 
new failover during the non-atomic window of a previous failover (where the 
peer may still be in ATS). Without this guard, the admin could produce an 
irrecoverable `(ATS, ATS)` or `(ANISTS, ATS)` deadlock where both clusters are 
transitioning to standby with mutations blocked on both sides.
+
+### Post-Condition
+
+Cluster `c` transitions to ATS or ANISTS, both of which map to the 
ACTIVE_TO_STANDBY role, blocking mutations (`isMutationBlocked() = true`).

Review Comment:
   This will change, shall I update it later when I make the implementation 
change? or do we want to right now? ANISTS will map to ACTIVE. This will 
optimize the failover time



##########
src/main/tla/ConsistentFailover/markdown/HAGroupStore.md:
##########
@@ -0,0 +1,445 @@
+# HAGroupStore -- Peer-Reactive Transitions and Auto-Completion
+
+**Source:** [`HAGroupStore.tla`](../HAGroupStore.tla)
+
+## Overview
+
+`HAGroupStore` models the peer-reactive transitions and auto-completion 
actions of the Phoenix Consistent Failover protocol. These actions correspond 
to the `FailoverManagementListener` (`HAGroupStoreManager.java` L633-706), 
which reacts to peer ZK state changes via `PathChildrenCache` watchers, and the 
local auto-completion resolvers from `createLocalStateTransitions()` (L140-150).

Review Comment:
   There is no discussion of SYSTEM.HA_GROUP system table.  Is that intentional?



##########
src/main/tla/ConsistentFailover/HAGroupStore.tla:
##########
@@ -0,0 +1,583 @@
+-------------------- MODULE HAGroupStore 
----------------------------------------
+(*
+ * Peer-reactive transitions and auto-completion actions for the
+ * Phoenix Consistent Failover protocol.
+ *
+ * Actions model the FailoverManagementListener (HAGroupStoreManager.java
+ * L633-706) which reacts to peer ZK state changes via PathChildrenCache
+ * watchers, and the local auto-completion resolvers from
+ * createLocalStateTransitions() (L140-150).
+ *
+ * ZK WATCHER DELIVERY DEPENDENCY: All PeerReact* actions depend on
+ * the peer ZK connection and session being alive (guarded by
+ * zkPeerConnected[c] and zkPeerSessionAlive[c]). AutoComplete
+ * actions depend on the local ZK connection (guarded by
+ * zkLocalConnected[c]). Without these connections, watcher
+ * notifications cannot be delivered.
+ *
+ * RETRY EXHAUSTION: The FailoverManagementListener retries each
+ * reactive transition exactly 2 times (HAGroupStoreManager.java
+ * L653-704). After exhaustion, the method returns silently. This
+ * is modeled by the ReactiveTransitionFail(c) action, which
+ * non-deterministically "consumes" a pending peer-reactive
+ * transition without updating clusterState.
+ *
+ * Notification chain (peer-reactive transitions):
+ *   Peer ZK znode change
+ *     -> Curator peerPathChildrenCache
+ *       -> HAGroupStoreClient.handleStateChange() [L1088-1110]
+ *         -> notifySubscribers() [L1119-1151]
+ *           -> FailoverManagementListener.onStateChange() [L653-705]
+ *             -> setHAGroupStatusIfNeeded() (2-retry limit)
+ *
+ * Notification chain (auto-completion transitions):
+ *   Local ZK znode change
+ *     -> Curator pathChildrenCache (local)
+ *       -> HAGroupStoreClient.handleStateChange()
+ *         -> notifySubscribers()
+ *           -> FailoverManagementListener.onStateChange()
+ *
+ * LISTENER FOLDS: The recoveryListener (L147-157) and degradedListener
+ * (L136-145) from ReplicationLogDiscoveryReplay fire synchronously on
+ * the local PathChildrenCache event thread during state entry. Their
+ * effects are folded atomically into the S-entry and DS-entry actions:
+ *   - S entry (PeerReactToANIS ATS->S, PeerReactToAIS ATS->S,
+ *     PeerReactToAIS DS->S, AutoComplete AbTS->S): sets replayState
+ *     to SYNCED_RECOVERY.
+ *   - DS entry (PeerReactToANIS S->DS): sets replayState to DEGRADED.
+ *
+ * Implementation traceability:
+ *
+ *   TLA+ action               | Java source
+ *   --------------------------+-----------------------------------------------
+ *   PeerReactToATS(c)         | createPeerStateTransitions() L109
+ *   PeerReactToANIS(c)        | createPeerStateTransitions() L123, L126
+ *   PeerReactToAbTS(c)        | createPeerStateTransitions() L132
+ *   PeerReactToAIS(c)         | createPeerStateTransitions() L112-120
+ *   AutoComplete(c)           | createLocalStateTransitions() L144, L145, L147
+ *   ANISTSToATS(c)            | HAGroupStoreManager.setHAGroupStatusToSync()
+ *                             |   L341-355 (ANISTS -> ATS drain completion)
+ *   PeerReactToOFFLINE(c)     | (proactive, Iteration 18) intended peer
+ *                             |   OFFLINE detection; no impl trigger yet;
+ *                             |   gated on UseOfflinePeerDetection
+ *   PeerRecoverFromOFFLINE(c) | (proactive, Iteration 18) intended peer
+ *                             |   OFFLINE recovery detection; no impl
+ *                             |   trigger yet; gated on
+ *                             |   UseOfflinePeerDetection
+ *   ReactiveTransitionFail(c) | FailoverManagementListener.onStateChange()
+ *                             |   L653-704 (2 retries exhausted, returns
+ *                             |   silently)
+ *
+ * Failover completion (STA -> AIS) is modeled in Reader.tla
+ * (TriggerFailover action), not in this module.
+ *)
+EXTENDS SpecState, Types
+
+---------------------------------------------------------------------------
+
+(*
+ * Standby entry from ATS (S-entry side effects).
+ *
+ * Shared by the ATS -> S branches of PeerReactToAIS and
+ * PeerReactToANIS. On standby entry from ATS, three side effects
+ * fire atomically:
+ *
+ *   1. Live writers reset to INIT (DEAD preserved). Models the
+ *      replication subsystem restart on standby entry: the entire
+ *      ReplicationLogGroup is destroyed, so any SYNC/SYNC_AND_FWD/
+ *      STORE_AND_FWD writers become INIT. DEAD writers (from
+ *      RSCrash or CAS failure) are preserved because crashed RSes
+ *      cannot process the state-change notification; they are
+ *      handled by RSRestart independently.
+ *
+ *   2. OUT directory cleared. The forwarder drains before ATS
+ *      auto-completes, so outDirEmpty is TRUE by the time the
+ *      peer's AIS triggers the local S entry.
+ *
+ *   3. recoveryListener (L147-157) fires synchronously on the
+ *      local PathChildrenCache event thread during S entry,
+ *      unconditionally setting replayState to SYNCED_RECOVERY.
+ *
+ * Not applied to AdminForceRecover (resets all writers to INIT
+ * with no DEAD-preservation) or AutoComplete AbTS->S (only sets
+ * replayState; no writer/OUT reset needed because AbTS was never
+ * active).
+ *)
+ResetToStandbyEntry(c) ==
+    /\ writerMode' = [writerMode EXCEPT ![c] =
+            [rs \in RS |-> IF writerMode[c][rs] = "DEAD"
+                           THEN "DEAD"
+                           ELSE "INIT"]]
+    /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = TRUE]
+    /\ replayState' = [replayState EXCEPT ![c] = "SYNCED_RECOVERY"]
+
+---------------------------------------------------------------------------
+
+(*
+ * Peer transitions to ATS (ACTIVE_IN_SYNC_TO_STANDBY).
+ *
+ * When the standby detects its peer has entered ATS, it begins the
+ * failover process by transitioning to STA (STANDBY_TO_ACTIVE).
+ * This fires from either S or DS -- the DS case supports the ANIS
+ * failover path where the standby is in DEGRADED_STANDBY when
+ * failover proceeds.
+ *
+ * ZK watcher dependency: Delivered via peerPathChildrenCache.
+ * Guarded on zkPeerConnected[c] and zkPeerSessionAlive[c].
+ * If the peer ZK session expires or the notification is lost, the
+ * standby never learns of the failover. The active cluster remains
+ * in ATS with mutations blocked indefinitely. No polling fallback.
+ *
+ * Source: createPeerStateTransitions() L109 -- resolver is
+ *         unconditional: currentLocal -> STANDBY_TO_ACTIVE.
+ *
+ * Also sets failoverPending[c] = TRUE, modeling the
+ * triggerFailoverListener (ReplicationLogDiscoveryReplay.java
+ * L159-171) which fires on LOCAL STANDBY_TO_ACTIVE. Folded
+ * into PeerReactToATS because the listener fires
+ * deterministically on every STA entry and PeerReactToATS is
+ * the sole producer of STA.
+ *)
+PeerReactToATS(c) ==
+    /\ PeerZKHealthy(c)
+    /\ clusterState[Peer(c)] = "ATS"
+    /\ clusterState[c] \in {"S", "DS"}
+    /\ clusterState' = [clusterState EXCEPT ![c] = "STA"]
+    /\ failoverPending' = [failoverPending EXCEPT ![c] = TRUE]
+    /\ UNCHANGED <<writerVars, replayVars, envVars,
+                   outDirEmpty, antiFlapTimer, inProgressDirEmpty>>
+
+---------------------------------------------------------------------------
+
+(*
+ * Peer transitions to ANIS (ACTIVE_NOT_IN_SYNC).
+ *
+ * Two reactive transitions triggered by peer entering ANIS:
+ *   1. Local S -> DS: standby degrades because peer's replication is
+ *      degraded. Atomically sets replayState = DEGRADED (degraded-
+ *      Listener fold). Source: L126.
+ *   2. Local ATS -> S: old active (in failover) completes transition
+ *      to standby when peer is ANIS. Atomically sets replayState =
+ *      SYNCED_RECOVERY (recoveryListener fold). Source: L123.
+ *
+ * ZK watcher dependency: Delivered via peerPathChildrenCache.
+ * Guarded on zkPeerConnected[c] and zkPeerSessionAlive[c].
+ * If lost: (1) standby stays in S when it should be DS -- consistency
+ * point tracking is incorrect; (2) old active stays in ATS with
+ * mutations blocked. No polling fallback.
+ *)
+PeerReactToANIS(c) ==
+    /\ PeerZKHealthy(c)
+    /\ clusterState[Peer(c)] = "ANIS"
+    /\ \/ /\ clusterState[c] = "S"
+          /\ clusterState' = [clusterState EXCEPT ![c] = "DS"]
+          \* degradedListener: unconditional set(DEGRADED) fires
+          \* synchronously on local PathChildrenCache thread during
+          \* DS entry. Counter advance is handled by ReplayAdvance.
+          /\ replayState' = [replayState EXCEPT ![c] = "DEGRADED"]
+          /\ UNCHANGED <<writerVars, envVars,
+                         outDirEmpty, antiFlapTimer,
+                         failoverPending, inProgressDirEmpty,
+                         lastRoundInSync, lastRoundProcessed>>
+       \/ /\ clusterState[c] = "ATS"
+          /\ clusterState' = [clusterState EXCEPT ![c] = "S"]
+          /\ ResetToStandbyEntry(c)
+          /\ UNCHANGED <<envVars,
+                         antiFlapTimer, failoverPending, inProgressDirEmpty,
+                         lastRoundInSync, lastRoundProcessed>>
+
+---------------------------------------------------------------------------
+
+(*
+ * Peer transitions to AbTS (ABORT_TO_STANDBY).
+ *
+ * When the active cluster (in ATS during failover) detects its peer
+ * has entered AbTS (abort initiated from the standby side), the
+ * active transitions to AbTAIS (ABORT_TO_ACTIVE_IN_SYNC).
+ *
+ * ZK watcher dependency: Delivered via peerPathChildrenCache.
+ * Guarded on zkPeerConnected[c] and zkPeerSessionAlive[c].
+ * If lost: active stays in ATS with mutations blocked; abort does
+ * not propagate. No polling fallback.
+ *
+ * Source: createPeerStateTransitions() L132.
+ *)
+PeerReactToAbTS(c) ==
+    /\ PeerZKHealthy(c)
+    /\ clusterState[Peer(c)] = "AbTS"
+    /\ clusterState[c] = "ATS"
+    /\ clusterState' = [clusterState EXCEPT ![c] = "AbTAIS"]
+    /\ UNCHANGED <<writerVars, replayVars, envVars,
+                   outDirEmpty, antiFlapTimer,
+                   failoverPending, inProgressDirEmpty>>
+
+---------------------------------------------------------------------------
+
+(*
+ * Auto-completion transitions (local, no peer trigger).
+ *
+ * These transitions fire automatically once the cluster enters the
+ * corresponding abort state. They return the cluster to its pre-
+ * failover state.
+ *
+ * ZK watcher dependency: Despite being "local" (no peer trigger),
+ * these transitions are driven by the local pathChildrenCache
+ * watcher chain, not an in-process event bus. Guarded on
+ * zkLocalConnected[c]. If the local ZK connection is lost, the
+ * cluster remains in the AbTS/AbTAIS/AbTANIS state indefinitely.
+ *
+ * AbTAIS auto-completion: conditional -- completes to AIS if all
+ * writers are clean (INIT or SYNC) and OUT dir is empty, otherwise
+ * completes to ANIS. This prevents AIS from coexisting with
+ * degraded writers when HDFS fails during the abort window.
+ *
+ * Source: createLocalStateTransitions() L140-150
+ *   AbTS   -> S    (L144) -- atomically sets replayState =
+ *                            SYNCED_RECOVERY (recoveryListener fold)
+ *   AbTAIS -> AIS or ANIS  (L145) -- conditional on writer/outDir state
+ *   AbTANIS -> ANIS (L147)
+ *)
+AutoComplete(c) ==
+    /\ LocalZKHealthy(c)
+    /\ \/ /\ clusterState[c] = "AbTS"
+          /\ clusterState' = [clusterState EXCEPT ![c] = "S"]
+          \* recoveryListener: unconditional set(SYNCED_RECOVERY)
+          \* fires synchronously on local PathChildrenCache thread
+          \* during S entry.
+          /\ replayState' = [replayState EXCEPT ![c] = "SYNCED_RECOVERY"]
+          /\ UNCHANGED <<writerVars, envVars,
+                         outDirEmpty, antiFlapTimer,
+                         failoverPending, inProgressDirEmpty,
+                         lastRoundInSync, lastRoundProcessed>>
+       \/ /\ clusterState[c] = "AbTAIS"
+          /\ clusterState' = [clusterState EXCEPT ![c] =
+                 IF outDirEmpty[c] /\ \A rs \in RS : writerMode[c][rs] \in 
{"INIT", "SYNC"}
+                 THEN "AIS"
+                 ELSE "ANIS"]

Review Comment:
   This is not the implementation today. IIRC, this change is also not planned 
yet. Currently we go to AIS unconditionally as we expect the Active (A) cluster 
to be in a sync state when it moved to ATS so no new mutation has happened 
since then. Writer should handle AIS to ANIS transition once new write comes in.



##########
src/main/tla/ConsistentFailover/markdown/Admin.md:
##########
@@ -0,0 +1,151 @@
+# Admin -- Operator-Initiated Actions
+
+**Source:** [`Admin.tla`](../Admin.tla)
+
+## Overview
+
+`Admin` models the human operator (Admin actor) who drives failover and abort 
via the `PhoenixHAAdminTool` CLI, which delegates to `HAGroupStoreManager` 
coprocessor endpoints. This module contains four actions: `AdminStartFailover` 
(initiate failover), `AdminAbortFailover` (abort an in-progress failover), 
`AdminGoOffline` (take a standby cluster offline), and `AdminForceRecover` 
(force-recover from OFFLINE). The last two are gated on the 
`UseOfflinePeerDetection` feature flag and model the proactive design for peer 
OFFLINE detection using `PhoenixHAAdminTool update --force`.
+
+These are the only actions in the specification that represent deliberate 
human decisions rather than automated system behavior. All four 
(`AdminStartFailover`, `AdminAbortFailover`, `AdminGoOffline`, 
`AdminForceRecover`) receive no fairness in the liveness specifications. The 
admin is genuinely non-deterministic. The admin might never initiate a 
failover, or might abort every failover attempt. Imposing fairness on admin 
actions would force unrealistic guarantees about human behavior.
+
+### Modeling Choice: Direct ZK Writes
+
+Unlike the peer-reactive transitions in [HAGroupStore.md](HAGroupStore.md), 
admin actions are direct ZK writes -- they are not watcher-dependent. The admin 
CLI writes directly to the local ZK znode via the coprocessor endpoint. No 
`zkPeerConnected` or `zkLocalConnected` guard is needed because the admin tool 
manages its own ZK connection independently of the `HAGroupStoreClient` watcher 
infrastructure. `AdminGoOffline` and `AdminForceRecover` also use the `--force` 
path which writes directly to ZK, so no `zkLocalConnected` guard is needed for 
those actions either.
+
+## Implementation Traceability
+
+| TLA+ Action | Java Source |
+|---|---|
+| `AdminStartFailover(c)` | 
`HAGroupStoreManager.initiateFailoverOnActiveCluster()` L375-400 |

Review Comment:
   The line number citations can become stale, should we use a commit hash to 
pin these or we expect the reader to figure out from the method name?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] PHOENIX-7843 TLA+ specification of Consistent Failover [phoenix]

Reply via email to