amaliujia commented on code in PR #10503: URL: https://github.com/apache/ozone/pull/10503#discussion_r3425428549
########## hadoop-hdds/docs/content/design/leader-planned-execution.md: ########## @@ -0,0 +1,1036 @@ +--- +title: Leader-Planned State Transition Execution +summary: Leader computes DB changes once and sends them to followers, instead of every node repeating the same work +date: 2026-06-12 +jira: HDDS-11898 +status: proposed +author: Abhishek Pal +--- +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +# Leader-Planned State Transition Execution + +## Table of Contents +1. [Motivation](#1-motivation) +2. [Industry Precedent](#2-industry-precedent) + * [2.1 CockroachDB — Proposer-Evaluated KV](#21-cockroachdb--proposer-evaluated-kv) + * [2.2 YugabyteDB — Leader-Computed Write Batches](#22-yugabytedb--leader-computed-write-batches) + * [2.3 TiKV — Classic State Machine Replication](#23-tikv--classic-state-machine-replication) +3. [Proposal](#3-proposal) + * [3.1 High-Level Architecture](#31-high-level-architecture) + * [3.2 Proto Format: ReplicatedStateTransition](#32-proto-format-replicatedstatetransition) + * [3.3 Managed Index Service](#33-managed-index-service) + * [3.4 Leader Planning Framework](#34-leader-planning-framework) + * [3.5 Apply Engine](#35-apply-engine) + * [3.6 State Machine Integration](#36-state-machine-integration) + * [3.7 Granular Locking](#37-granular-locking) +4. [Detailed Concurrency Model](#4-detailed-concurrency-model) + * [4.1 Life of a Request](#41-life-of-a-request-detailed) + * [4.2 Concurrent Scenarios](#42-concurrent-scenarios) + * [4.3 Lock Granularity Summary](#43-lock-granularity-summary) +5. [Correctness Arguments](#5-correctness-arguments) +6. [Testing Strategy](#6-testing-strategy) + * [6.1 Unit Tests](#61-unit-tests) + * [6.2 Concurrency Tests](#62-concurrency-tests-critical-for-correctness) + * [6.3 Integration Tests](#63-integration-tests-3-node-ha-cluster) + * [6.4 Determinism Verification Test](#64-determinism-verification-test) + * [6.5 Stress/Chaos Tests](#65-stresschaos-tests) +7. [Comparison: Legacy vs Planned Path](#7-comparison-legacy-vs-planned-path) +8. [Migration Strategy](#8-migration-strategy) + * [8.1 Per-Command Migration](#81-per-command-migration) + * [8.2 Zero-Downtime Upgrade (ZDU) Compatibility](#82-zero-downtime-upgrade-zdu-compatibility) +9. [FAQs](#9-faqs) + +--- + +## 1. Motivation + +### Problem with the current OM HA write flow + +In the current architecture, every write request follows this path: + +``` +Client -> OM Leader (preExecute) -> Ratis replication -> All OMs (validateAndUpdateCache) +``` + +Every OM node (leader and followers) runs the full business logic again in +`validateAndUpdateCache()`. This causes several problems: + +| Problem | Impact | +|:--------|:-------| +| **Wasted compute on followers** | Every follower repeats the full request processing — lock acquisition, validation, key building, quota computation | +| **Risk of inconsistency** | `validateAndUpdateCache` reads from a table cache that may have different temporary state on leader vs follower when requests run at the same time | +| **Double buffer complexity** | Results are queued in `OzoneManagerDoubleBuffer`, which adds batching delay and a separate flush thread. Snapshot barriers add more coordination | +| **Tight coupling** | External consumers (such as Recon) must understand all command logic to read state changes | +| **Lock contention** | OBS key operations hold a bucket-level write lock even though most creates/commits do not conflict with each other | + +### What we want + +1. Leader computes a **fixed DB patch** (a list of table puts and deletes) once. +2. The patch is sent to all nodes via Ratis as a self-contained payload. +3. All nodes (leader, followers, listeners) apply the patch directly to RocksDB — no business logic runs again. +4. Locks are made more fine-grained so independent key operations can run in parallel. + +--- + +## 2. Industry Precedent + +Before we detail the proposal, it is useful to see how other production +Raft-based systems solve the same problem. There are two models used in +practice: + +| Model | Who runs the logic? | What goes through Raft? | Used by | +|-------|--------------------|-----------------------|---------| +| **State Machine Replication (SMR)** | All replicas | Commands (the operation) | TiKV, older CockroachDB, existing Ozone replication model | +| **Leader Execution (Primary-Backup)** | Leader only | Computed results (WriteBatch) | YugabyteDB, newer CockroachDB, **this proposal** | + +### 2.1 CockroachDB — Proposer-Evaluated KV + +CockroachDB originally used classic state machine replication (all replicas +evaluate commands independently). They later moved to **Proposer-Evaluated KV +(PEVK)** — the same pattern we propose here. + +**How it works:** +1. The lease-holder (leader) evaluates the request and produces a `WriteBatch` + (the exact bytes to write to storage). +2. The `WriteBatch` + computed response are proposed to Raft as a single entry. +3. Followers apply the pre-computed `WriteBatch` directly — they do not + re-execute command logic. + +**Why they moved to this model** (from their design RFC): +- Removes duplicate execution across replicas +- Makes online migrations simpler (only leader needs new evaluation code) +- Puts all complex logic in one place (the proposer) + +**Their concurrency control:** +- **Latches** (fine-grained per-key read/write locks) are acquired **before** + evaluation and released **after** the WriteBatch is built — but **before** + Raft replication. This is the same pattern as our striped locks. +- An in-memory **lock table** handles transaction-level conflict queueing. +- MVCC timestamps provide version ordering. + +**Key insight:** CockroachDB holds latches only during evaluation, NOT during +replication: +``` +acquire_latch → evaluate(request) → build WriteBatch → release_latch → propose to Raft +``` +This matches our design: `acquireLock → plan() → releaseLock → submit to Ratis`. + +Source: `pkg/kv/kvserver/replica_proposal.go`, `pkg/kv/kvserver/concurrency/` + +### 2.2 YugabyteDB — Leader-Computed Write Batches + +YugabyteDB also uses leader execution: + +1. The tablet leader computes the write batch (exact RocksDB key-value pairs). +2. The batch is replicated via Raft to followers. +3. Followers apply the **pre-computed byte-level changes** to their local + RocksDB — no re-execution. + +Their concurrency uses "provisional records" (intents) written to a separate +RocksDB instance as persistent locks, plus hybrid timestamps for ordering. + +### 2.3 TiKV — Classic State Machine Replication + +TiKV uses the traditional approach: all replicas execute the same Raft log +entries. However, TiKV's "commands" are already low-level Put/Delete operations +on key-value pairs. There is no complex business logic to re-execute. + +This works for TiKV because the transaction coordination layer (Percolator 2PC) +runs **above** Raft. By the time something enters the Raft log, it is already a +simple key-value mutation. + +--- + +## 3. Proposal + +### 3.1 High-Level Architecture + +``` + LEADER FOLLOWER + +------------------+ +------------------+ + Client Request | startTransaction | | applyTransaction | + -------------> | | | | + | 1. Create | Ratis log entry | 1. Parse proto | + | PlannedReq | (serialized | 2. For each | + | 2. preProcess | OMRequest with | DBDelta: | + | 3. authorize | embedded | table.put/ | + | 4. acquireLock | BatchedState | delete | + | 5. plan() | Transitions) | 3. Commit batch | + | -> deltas | ------------------> | | + | 6. releaseLock | | (zero business | + | 7. Build proto | | logic) | + +------------------+ +------------------+ Review Comment: I guess in general I have a question of how to compute the RocksDB delta that can be applied on the follower through RocksDB batch commit. Not an expert in this domain, just curious. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
