[bookkeeper] branch master updated: BP-46: Running without journal proposal

ayegorov Mon, 14 Feb 2022 11:12:16 -0800

This is an automated email from the ASF dual-hosted git repository.

ayegorov pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/bookkeeper.git



The following commit(s) were added to refs/heads/master by this push:
     new 8530d5c  BP-46: Running without journal proposal
8530d5c is described below

commit 8530d5c973d1106a4508ac024518df254eac7437
Author: Jack Vanlightly <[email protected]>
AuthorDate: Mon Feb 14 20:11:38 2022 +0100

    BP-46: Running without journal proposal
    
    Includes the BP-46 design proposal markdown document.
    
    Master Issue: #2705
    
    Reviewers: Andrey Yegorov <None>, Enrico Olivelli <[email protected]>
    
    This closes #2706 from Vanlightly/bp-44
---
 site/bps/BP-46-run-without-journal.md | 205 ++++++++++++++++++++++++++++++++++
 1 file changed, 205 insertions(+)

diff --git a/site/bps/BP-46-run-without-journal.md 
b/site/bps/BP-46-run-without-journal.md
new file mode 100644
index 0000000..8303035
--- /dev/null
+++ b/site/bps/BP-46-run-without-journal.md
@@ -0,0 +1,205 @@
+---
+title: "BP-46: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety 
guarantees. An add operation is only acked to a client once written to the 
journal and an fsync performed. This however means that every entry must be 
written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be 
provisioned to service requests and makes disk provisioning more complex 
(separating journal from entry log writes onto separate disks). Running without 
the journal would halve the disk IO required (ignoring indexes) thereby 
reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems 
as the BookKeeper Replication Protocol requires that all writes are persistent 
for correctness. Running without the journal introduces the possibility of lost 
writes. In order to continue to offer strong data safety and support running 
without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other 
errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how 
quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- Whole Ensemble
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the 
following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack 
quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property 
within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only 
used when we need guarantees regarding a single entry. For EC, the cohort is 
the ensemble of bookies of the current fragment. EC is required when we need a 
guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before 
starting the read/write phase of recovery. This is true once EC successful 
fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of 
bookies could exist that might have the entry. Doing otherwise could truncate 
committed entries from a ledger. A read is negative once NoSuchEntry responses 
reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC 
reads cannot cause an explicit negative as fencing creates the ledger on the 
bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily 
lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just 
been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a 
last entry id lower than that of previously acknowledged entries - thus losing 
data.
+
+The following scenarios assume existing behaviour but simply skipping the 
writing of entries and fencing ops to the journal.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to 
recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by 
both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has 
acknowledged the fencing message (EC threshold reached), so continues with the 
read/write phase of recovery, finding that E1 is the last entry of the ledger, 
and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge 
positively, so C1 considers E2 as persisted. B1 rejects the message as the 
ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the 
entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. C1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. C1 tries to change ensemble but gets a metadata 
version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. C2 closes the ledger as empty. 
Losing E0.
+
+The problem is that without the journal (and syncing to entry log files before 
acknowledgement) a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an 
entry. 
+
+Undetected data loss could occur when running without the journal. Bookie 
crashes and loses most recent entries and fence statuses that had not yet been 
written and synced to disk.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but 
their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie 
encapsulates the identity of the bookie and should be considered immutable. 
This identity contains the advertised address of the bookie, the disks used for 
the journal, index, and ledger storage, and a unique ID. The bookie writes the 
cookie to ZK and each of the disks in use. On all subsequent boots, if the 
cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is 
also missing. Cookie validation is performed on boot-up and prevents the boot 
from succeeding if the validation fails, thus preventing the bookie starting 
with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a 
cookie validation error which currently requires human intervention to resolve. 
This automated feature will be configurable (enabled or disabled) and 
additionally a CLI command will be made available so an admin can manually run 
the operation (for when this feature is disabled - likely to be the default). 
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, 
instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies (with new config to enable or disable the feature)
+- CLI command for admin to run fix cookie logic in the case that auto fix is 
disabled
+
+In these proposed changes, when running "without" the journal, the journal 
still exists, but add entry operations skip the addition to the journal. The 
boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on 
the config `journalWriteData`. When set to `false`, add operations are not 
added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown 
(aka a crash or abrupt termination of the bookie). When an unclean shutdown is 
detected further measures are taken to prevent data inconsistency.
+
+The unclean shutdown detection will consist of setting a bit in the index on 
start-up and clearing it on shutdown. On subsequent start-up, the value will be 
checked and if it remains set, it knows that the prior shutdown was not clean.
+
+Cookie validation will continue to be used to detect booting with one or more 
missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism 
is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all 
those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can 
serve read requests, but never respond with an explicit negative, all explicit 
negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from 
peers.
+- Limbo ledgers that have been repaired have their limbo status cleared.
+
+### The Full Boot-Up Sequence
+
+This mechanism of limbo ledgers and self-repair needs to work hand-in hand 
with the cookie validation check. Combining everything together:
+
+On boot:
+1. Check for unclean shutdown and validate cookies
+2. Fetch the metadata for all ledgers in the cluster from ZK where the bookie 
is a member of its ensemble.
+3. Phase one:
+   - If the cookie check fails or unclean shutdown is detected:
+     - For each non-closed ledger, mark the ledger as fenced and in-limbo in 
the index.
+     - Update the cookie if it was a cookie failure
+4. Phase two
+   - For each ledger
+     1. If the ledger is in-limbo, open and recover the ledger.
+     2. Check that all entries assigned to this bookie exist in the index.
+     3. For any entries that are missing, copy from another bookie.
+     4. Clear limbo status if set
+
+When booting a bookie with empty disks, only phase one needs to be complete 
before the bookie makes itself available for client requests. 
+
+In phase one, if the cookie check fails, we mark all non-closed ledgers as 
“fenced”. This prevents any future writes to these ledgers on this bookie. This 
solves the problem of an empty bookie disk allowing writes to closed ledgers 
(Scenario 1).
+
+Given that the algorithm solves both the issues that cookies are designed to 
solve, we can now allow the bookie to update its cookie without operator 
intervention. 
+
+### Formal Verification of Proposed Changes
+
+The use of the limbo status and fencing of all ledgers on boot-up when 
detecting an unclean shutdown has been modelled in TLA+. It does not model the 
whole boot-up sequence but a simplified version with only fencing and limbo 
status. 
+
+The specification models the lifetime of a single ledger and includes a single 
bookie crashing, losing all data. The specification allows the testing of:
+
+- enabling/disabling the fencing
+- enabling/disabling the limbo status.
+
+When running without limbo status, the model checker finds the counterexample 
of scenario 2. When running without fencing of all ledgers, the model checker 
finds the counterexample of scenario 1. When running with both enabled, the 
model checker finds no invariant violation.
+
+The specification can be found here: 
https://github.com/Vanlightly/bookkeeper-tlaplus
+
+### Public Interfaces
+
+- Return codes. Addition of a new return code: `EUNKNOWN` which is returned 
when a read hits an in-limbo ledger and that ledger not contain the requested 
entry id.
+- Bookie ledger metadata format (LedgerData). Addition of the limbo status.
+
+### Compatibility, Deprecation, and Migration Plan
+
+- Because we only skip the journal for add operations, there is no impact on 
existing deployments. When a bookie is booted with the new version, and 
`journalWriteData` is set to false, the journal is still replayed on boot-up 
causing no risk of data loss in the transition.
+
+### Test Plan
+
+- There is confidence in the design due to the modelling in TLA+ but this 
model does not include the full boot sequence.
+- The implementation will require aggressive chaos testing to ensure 
correctness.
+
+### Rejected Alternatives
+
+Entry Log Per Ledger (ELPL) without the journal. From our performance testing 
of ELPL, performance degrades significantly with a large number of active 
ledgers and syncing to disk multiple times a second (which is required to offer 
low latency writes).
+
+In the future this design could be extended to offer ledger level 
configuration of journal use. The scope of this BP is limited to cluster level.
\ No newline at end of file

[bookkeeper] branch master updated: BP-46: Running without journal proposal

Reply via email to