jcshepherd commented on code in PR #1338:
URL: https://github.com/apache/ratis/pull/1338#discussion_r2734175536


##########
ratis-docs/src/site/markdown/concept/index-v2.md:
##########
@@ -0,0 +1,499 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# Apache Ratis Concepts
+
+## Table of Contents
+
+1. [Overview of Raft and Apache Ratis](#overview-of-raft-and-apache-ratis)
+2. [Raft Cluster Topology](#raft-cluster-topology)
+3. [The Raft Log - Foundation of 
Consensus](#the-raft-log---foundation-of-consensus)
+4. [The State Machine - Your Application's 
Heart](#the-state-machine---your-applications-heart)
+5. [Consistency Models and Read 
Patterns](#consistency-models-and-read-patterns)
+6. [Snapshots - Managing Growth and 
Recovery](#snapshots---managing-growth-and-recovery)
+7. [Logical Organization of Ratis](#logical-organization-of-ratis)
+8. [Leadership and Fault Tolerance](#leadership-and-fault-tolerance)
+9. [Scaling with Multi-Raft Groups](#scaling-with-multi-raft-groups)
+
+## Overview of Raft and Apache Ratis
+
+The Raft consensus algorithm solves a fundamental problem in distributed 
systems: how do you get
+multiple computers to agree on a sequence of operations, even when some might 
fail or become
+unreachable? This problem, known as distributed consensus, is at the heart of 
building reliable
+distributed systems.
+
+Raft ensures that a cluster of servers maintains an identical, ordered log of 
operations. Each
+server applies these operations to its local state machine in the same order, 
guaranteeing that
+all servers end up with identical state. This approach, called state machine 
replication,
+provides both consistency and fault tolerance.
+
+You should consider using Raft when your system needs strong consistency 
guarantees across
+multiple servers. This typically applies to systems where correctness is more 
important than
+absolute performance, such as distributed databases, configuration management 
systems, or any
+application where split-brain scenarios would be unacceptable.
+
+Apache Ratis is a Java library that implements the Raft consensus protocol. 
The key word here
+is "library" - Ratis is not a standalone service that you communicate with 
over the network.
+Instead, you embed Ratis directly into your Java application, and it becomes 
part of your
+application's runtime.
+
+This embedded approach creates tight integration between your application and 
the consensus
+mechanism. Your application and Ratis run in the same JVM, sharing memory and 
computational
+resources. Your application provides the business logic (the "state machine" 
in Raft terminology),
+while Ratis handles the distributed consensus mechanics needed to keep 
multiple instances of your
+application synchronized.
+
+## Raft Cluster Topology
+
+Understanding the basic building blocks of a Raft deployment affects both the 
correctness and
+performance of your system.
+
+### Servers, Clusters, and Groups
+
+A Raft server (also known as a "peer") is a single running instance of your 
application with
+Ratis embedded. Each server runs your state machine and participates in the 
consensus protocol.
+
+A Raft cluster is a physical collection of servers that can participate in 
consensus. A Raft
+group is a logical consensus domain that runs across a specific subset of 
peers in the cluster.
+At any given time, one peer in a group acts as the "leader" while the others 
are "followers" or
+"listeners". The leader handles all write requests and replicates operations 
to other peers in
+the group. Both leaders and followers can service read requests, with 
different consistency
+guarantees.
+
+A single cluster can host multiple independent Raft groups, each with its own 
leader election,
+consistency and state replication. Groups typically consist of an odd number 
of peers (3, 5, or
+7 are common) to ensure clear majority decisions.
+
+### Majority-Based Decision-Making
+
+Raft's safety guarantees depend on majority agreement within each group. The 
leader replicates
+each operation to the followers in its group, and operations are committed 
when at least
+(N/2 + 1) peers in that group acknowledge them. This means a group of 3 peers 
can tolerate 1
+failure, a group of five peers can tolerate 2 failures, and so on.
+
+This majority requirement affects both availability and performance. A group 
remains available as
+long as a majority of its peers are reachable and functioning. However, every 
transaction must
+wait for majority acknowledgment, so the slowest server in the majority 
determines your write
+latency.
+
+### Server Placement and Network Considerations
+
+The physical and network placement of your servers impacts both availability 
and performance.
+Placing all servers in the same rack or data center provides the lowest 
latency but risks
+creating a single point of failure. Distributing servers across multiple 
availability zones or
+data centers improves fault tolerance but can increase latency.
+
+A common approach is to place servers across multiple availability zones 
within a single region
+for a balance of fault tolerance and performance. For applications requiring 
geographic
+distribution, you might place servers in different regions, accepting higher 
latency in exchange
+for better disaster recovery capabilities.
+
+## The Raft Log - Foundation of Consensus
+
+The Raft log is the central data structure that makes distributed consensus 
possible. Each server
+in a Raft group maintains its own copy of this append-only ledger that records 
every operation
+in the exact order they should be applied to the state machine.
+
+Each entry in the log contains three key pieces of information: the operation 
itself (what should
+be done), a log index (a sequential number indicating the entry's position), 
and a term number
+(the period during which a leader created this entry). Terms represent periods 
of leadership and
+increase each time a new leader is elected, preventing old leaders from 
overwriting newer entries.
+The combination of the term and log index is referred to as a term-index 
(`TermIndex`) and 
+establishes the ordering of entries in the log.
+
+The log serves as both the mechanism for replication (leaders send log entries 
to followers) and
+the source of truth for recovery (servers can rebuild their state by replaying 
the log). When we
+talk about "committing" an operation, we mean that a majority of servers have 
acknowledged
+storing that log entry, making it safe to apply to the state machine.
+
+## The State Machine - Your Application's Heart
+
+In Ratis, the state machine is your application's primary integration point. 
Your business logic
+or data storage operations are implemented by the state machine.
+
+The state machine is not a finite state machine with states and transitions. 
Instead, it's a
+deterministic computation engine that processes a sequence of operations and 
maintains some
+internal state. The state machine must be deterministic: given the same 
sequence of operations,
+it must always produce the same results and end up in the same final state. 
Operations are
+processed sequentially, one at a time, in the order they appear in the Raft 
log.
+
+### State Machine Responsibilities
+
+Your state machine has three primary responsibilities. First, it processes 
Raft transactions by
+validating incoming requests before they're replicated and applying committed 
operations to your
+application state. Second, it maintains your application's actual data, which 
might be an
+in-memory data structure, a local database, files on disk, or any combination 
of these. Third,
+it creates point-in-time representations of its state (snapshots) and can 
restore its state from
+snapshots during recovery.
+
+### The State Machine Lifecycle
+
+The state machine operates at two different lifecycle levels: an overall peer 
lifecycle and a
+per-transaction processing lifecycle.
+
+#### Peer Lifecycle
+
+During initialization, when a peer starts up, the state machine loads any 
existing snapshots and
+prepares its internal data structures. The Raft layer then replays any log 
entries that occurred
+after the snapshot, bringing the peer up to the current state of the group.
+
+During normal operation, the state machine continuously processes transactions 
as they're
+committed by the Raft group, responds to leadership changes, and handles 
read-only queries. For
+read-only operations, the state machine can answer queries directly without 
going through the
+Raft log, providing better performance for reads but with consistency 
trade-offs.
+
+Periodically, the state machine creates snapshots of its current state. This 
happens either
+automatically based on configuration (like log size thresholds) or manually 
through
+administrative commands.
+
+#### Transaction Processing Lifecycle
+
+For each individual transaction, the state machine follows a multistep 
processing sequence. In
+the validation phase, the leader's state machine examines incoming requests 
through the
+`startTransaction` method. This is where you validate that the operation is 
properly structured
+and valid in the current context.
+
+In the pre-append phase, just before the operation is written to the log, the 
state machine can
+perform any final preparations through the `preAppendTransaction` method. 
After the operation is
+committed by the Raft group, the state machine is notified via 
`applyTransactionSerial` and can
+handle any order-sensitive logic that must happen before the main application 
logic is invoked.
+
+Finally, in the application phase, the operation is applied to the actual 
application state
+through the `applyTransaction` method. This is where your business logic 
executes and where the
+operation's effects become visible to future queries.
+
+### Designing Your State Machine
+
+When designing your state machine, ensure your operations are deterministic 
and can be
+efficiently serialized for replication. Operations must be idempotent, as Raft 
may occasionally
+replay operations during recovery scenarios.
+
+Plan how you'll represent your application's state for both runtime efficiency 
and snapshot
+serialization. If your state machine maintains state in external systems 
(databases, files),
+ensure your snapshot process captures this external state consistently.
+
+Robust error handling is crucial. Server-side errors require distinguishing 
between recoverable
+errors (like validation failures) and fatal errors (like storage failures). 
Errors in
+`startTransaction` prevent operations from being committed and replicated. 
Errors in
+`applyTransaction` are considered fatal since they indicate the state machine 
cannot process
+already-committed operations.
+
+## Consistency Models and Read Patterns
+
+In a distributed system, consistency refers to the guarantees you have about 
seeing the effects
+of write operations when you read data. For write operations, Raft and Ratis 
provide strong
+consistency: once a write operation is acknowledged as committed, all 
subsequent reads will see
+the effects of that write. Read operations are more complex because Ratis 
offers several
+different approaches with different consistency and performance 
characteristics.
+
+### Write Consistency
+
+Write operations in Ratis follow a straightforward path that provides strong 
consistency. Clients
+send write requests to the leader, which validates the operation through the 
state machine's
+`startTransaction` method, then replicates it to a majority of followers. Once 
a majority
+acknowledges, the operation is committed. The leader applies the operation to 
its state machine
+and returns the result to the client, while followers eventually apply the 
same operation in the
+same order.
+
+### Read Consistency Options
+
+**Linearizable reads** provide the strongest consistency by going through the 
Raft protocol to
+ensure you're reading the most up-to-date committed data. Use the client's 
`sendReadOnly` method,
+which forces the leader to confirm it's still the leader before serving the 
read.
+
+**Leader reads** offer strong consistency but with caveats: these are reads 
served directly by
+the leader without going through the Raft protocol. Use 
`sendReadOnlyNonLinearizable` to query
+the leader's state machine directly. This is faster than linearizable reads 
but may return stale
+data if the leader has been partitioned from the majority.
+
+**Follower reads** provide eventual consistency by serving reads directly from 
followers using
+their local state machine. Call `sendReadOnly(message, serverId)` with a 
specific follower's

Review Comment:
   Ah, this is more nuanced than I thought. I've attempted to rewrite the 
section on "Read Consistency Options". Let me know what you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to