Andrew Wong has posted comments on this change. Change subject: WIP: Kudu Consistency Blog Post Pt1 ......................................................................
Patch Set 3: (32 comments) Really appreciated explicit definitions of what and when things happen. I think the section about design decisions can be shortened a bit but will see after the rest of the post is in. mostly comments about wording, some about content http://gerrit.cloudera.org:8080/#/c/7019/3/_posts/2017-05-30-kudu-consistency-pt1.md File _posts/2017-05-30-kudu-consistency-pt1.md: Line 11: of how they all play together and what are our future plans. would be nice to also include what is discussed in Part 1. PS3, Line 17: users and requests : are executed by many different machines, how about, "...users, and must serve requests across numerous machines,..."? PS3, Line 20: how Kudu implements this "how these systems implement" PS3, Line 29: single thread "single-threaded" Line 45: ## Motivation behind the consistency design decisions in Kudu: Can we assume the reader understands what a tradeoff is? If we can, I think we can take out the first paragraph. Maybe relabels this "Motivation behind some key design decisions in Kudu" since these aren't specifically consistency design decisions The key points here I think are: 1. the overarching goal for Kudu is fast analytic queries 2. we make similar design choices as those made in the Spanner paper w.r.t writes 3. we targeted continuous inserts 4. we wanted to grants users flexibility of the consistency constraint PS3, Line 95: semantics "guarantees", "constraints"? PS3, Line 101: Kudu currently lacks support for atomic multi-row mutation operations (i.e. mutation : operations to more than one row in the same or different tablets, planned as a future feature) so, : for the write path, we’ll be talking about the consistency semantics of single row mutations. : In this context we’ll discuss Kudu’s properties more from a key/value store standpoint. On the : other hand Kudu is an analytical storage engine so, for the read path, we’ll also discuss the : semantics of large (multi-row) scans. This moves the discussion more into the field of traditional : DBMSs. These ingredients make for a non-traditional discussion that is not exactly apples-to-apples : with what the reader might be familiar with, but our hope is that it still provides valuable, or : at least interesting, insight. maybe consider leaving the write and read discussion to their respective sections, or move this entire highlighted portion to the previous section (and leave this section as just "Relevant architectural components") PS3, Line 113: Operation execution "Transaction driver", "TransactionDriver"? trying to have each bullet's title be the component name Line 117: replicated to follower replicas by a leader replica following the Raft consensus algorithm[ref]. ", and thus, each tablet owns its own Consensus module." PS3, Line 121: Timestamp assignment and propagation maybe "Timestamp and Clock"? Or even have a separate bullet for "Clock". PS3, Line 130: landed "been added to"? PS3, Line 130: A request is not considered “committed” until is has landed on : the WAL of a majority of replicas Consider inverting this sentence, since we're introducing a bunch of terms here anyway: "A request is considered "committed" when it has been appended to a majority of replicas' WALs and registered with MVCC as committed" (and then maybe put the definition for WAL after MVCC) PS3, Line 132: This component serves a dual purpose storing the original client : requests and how they mutated each tablet’s internal state after being applied how about: "This component serves two purposes: it stores the original logical client-requested op (i.e. inserts, updates, deletes), and the physical change to the tablet after the op is applied (i.e. flushes and compactions). PS3, Line 136: Scans take a snapshots "A scan will take a snapshot" PS3, Line 136: they start executing and use "it starts executing, and will use" PS3, Line 137: unfinished operations as a tablet is scanned "unfinished tablet operations as it progresses" PS3, Line 139: component remove PS3, Line 159: 2. Timestamp assignment Note that this is done by the leader PS3, Line 161: follow up posts "a later post" PS3, Line 161: now the reader can just think of : it "a timestamp can be thought of" PS3, Line 162: That timestamp This PS3, Line 163: related operations Would "any two operations" be accurate as well? Or does "related operations" refer to something specific (e.g. operations on the same machine, in the same config, etc.)? PS3, Line 164: Register the operation with MVCC super nit: for consistency with previous bullets: "Registration of operation with MVCC" same for the below bullets, or could make the above bullets "Acquire locks" and "Assign timestamps" PS3, Line 166: and they don’t return u missing something? PS3, Line 169: to consensus "to the Consensus module"? "to Consensus"? I don't think "submitting to consensus" has caught on yet with the layman, so maybe specify that consensus is the module you mentioned in the above section. Alternatively just "submit the operation to be replicated" but that loses out on specificity. Line 170: the operation follower replicas will execute the prepare phase in the exact same sequence as the Don't they also send an ACK/vote to the leader? Line 172: the write-ahead-log. And this will only happen when it receives enough votes, right? Could you also note when the followers write to the WAL? PS3, Line 173: the > remove Is successful completion referring to the completion across all replicas? A majority of replicas? Just the leader? PS3, Line 178: For this blog post the relevant aspects : of this are: a) contrary to the Prepare phase this phase is multi-threaded; b) in the large : majority of cases apply only changes in-memory data structures. Regarding a) since operations : that mutate the same rows only move to the apply phase one at a time due to lock acquisition : (step 2.1) it is now safe to have multiple operations be executed at the same I think this a/b is a bit confusing. Maybe: For this blog post, there are a couple of key points to keep in mind. First, in most cases, the Apply phase only changes in-memory data structures, and this is safe even in the event of a crash because logical operations have already been written to the WAL. Second, contrary to the Prepare phase, this phase is multi-threaded, and since operations that mutate the sa... at the same time. Also consider bringing this out of the numbered steps and into its own bullet or paragraph, since b) is already noted below. PS3, Line 184: no need "no immediate need" PS3, Line 184: consensus "the on-disk WAL" PS3, Line 187: write doesn’t "These changes don't" -- To view, visit http://gerrit.cloudera.org:8080/7019 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icaec0c8ace20651a65901a8e1786e785265540d1 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: gh-pages Gerrit-Owner: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-HasComments: Yes
