This is an automated email from the ASF dual-hosted git repository.
git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/bookkeeper.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 14eab41 Updated site at revision 9bce1bf
14eab41 is described below
commit 14eab41703de07e99c1588037eb55510e9d18a11
Author: jenkins <[email protected]>
AuthorDate: Mon Feb 26 09:11:54 2018 +0000
Updated site at revision 9bce1bf
---
content/bps/BP-31-durability | 134 ++++++++++++++++++++++
content/community/bookkeeper_proposals/index.html | 12 +-
2 files changed, 144 insertions(+), 2 deletions(-)
diff --git a/content/bps/BP-31-durability b/content/bps/BP-31-durability
new file mode 100644
index 0000000..7c16eb2
--- /dev/null
+++ b/content/bps/BP-31-durability
@@ -0,0 +1,134 @@
+
+---
+title: "BP-31: BookKeeper Durability (Anchor)"
+issue: https://github.com/apache/bookkeeper/1202
+state: 'Accepted'
+release: "N/A"
+---
+## Motivation
+Apache BookKeeper is transitioning into a full fledged distributed storage
that can keep the data for long term. Durability is paramount to achieve the
status of trusted store. This Anchor BP discusses many gaps and areas of
improvement. Each issue listed here will have another issue and this BP is
expected to be updated when that issue is created.
+
+## Durability Contract
+1. **Maintain WQ copies all the time**. If any ledger falls into under
replicated state, there needs to be an SLA on how quickly the replication
levels can be brought back to normal levels.
+2. **Enforce Placement Policy** strictly during write and replication.
+3. **Protect the data** against corruption on the wire or at rest.
+
+## Work Grouping (In the order of priority)
+### Detect Durability Validation
+First step is to understand the areas of durability breaches. Design metrics
that record durability contract violations.
+* At the Creation: Validate durability contract when the ledger is being
created
+* At the Deletion: Validate durability contract when ledger is deleted
+* During lifetime: Validate durability contract during the lifetime of the
ledger.(periodic validator)
+* During Read: IO or Checksum errors in the read path
+### Delete Discipline
+* Build a single delete choke point with stringent validations
+* Archival bit in the metadata to assist Two phase Deletes
+* Stateful/Explicit Deletes
+### Metadata Recovery
+* Metadata recovery tool to reconstruct the metadata if the metadata server
gets wiped out. This tool need to make sure that the data is readable even if
we can't get all the metadata (ex: ctime) back.
+
+### Plug Durability Violations
+Our first step is to identify durability viloations. That gives us the
magnitude of the problem and areas that we need to focus. In this phase, fix
high impact areas.
+* Identify source of problems detected by the work we did in step-1 above
(Detect Durability Validation)
+* Rereplicate under replicated ledgers detected during write
+* Rereplicate under replicated / corrupted ledgers detected during read
+* Replicated under replicated ledgers identified by periodic validator.
+### Durability Test
+* Test plan, new tests and integrating it into CI pipeline.
+### Introduce bookie incarnation
+* Design/Implement bookie incarnation mechanism
+### End 2 End Checksum
+* Efficient checksum implementation (crc32c?)
+* Implement checksum validation on bookies in the write path.
+### Soft Deletes
+* Design and implement soft delete feature
+### BitRot detection
+* Design and implement bitrot detection/correction.
+
+## Durability Contract Violations
+### Write errors beyond AQ are ignored.
+BK client library transparently corrects any write errors while writing to
bookie by changing the ensemble. Take a case where `WQ:3 and AQ:2`. This works
fine only if the write fails to the bookie before it gets 2 successful
responses. But if the 3rd bookie write fails **after** 2 successful responses
and the response sent to client, this error is logged and no immediate action
is taken to bring up the replication of the entry.
+This case **may not be** detected by the auditor’s periodic ledger check.
Given that we allow out of order write, that in the combination of 2 out of 3
to satisfy client, it is possible to have under replication in the middle of
the ensemble entry. Hence ledgercheck is not going to find all under
replication cases, on top of that, periodic ledger check is a complete sweep
of the store, an very expensive and slow crawl hence defaulted to once a week
run.
+
+### Strict enforcement of placement policy
+The best effort placement policy increases the write availability but at the
cost of durability. Due to this non-strict placement, BK can’t guarantee data
availability when a fault domain (rack) is lost. This also makes rolling
upgrade across fault domains more difficult/non possible. Need to enforce
strict ensemble placement and fail the write if all WQ copies are not able to
be placed across different fault domains. Minor fix/enhancement if we agree to
give placement higher priority t [...]
+
+The auditor re-replication uses client library to find a replacement bookie
for each ledger in the lost bookie. But bookies are unaware of the ledger
ensemble placement policy as this information is not part of metadata.
+
+### Detect and act on Ledger disk problems
+While Auditor mechanism detects complete bookie crash, there is no mechanism
to detect individual ledger disk errors. So if a ledger disk goes bad, bookie
continues to run, and auditor can’t recognize under replication condition,
until it runs the complete sweep, periodic ledger check. On the other hand
bookie refuses to come up if it finds a bad disk, which is right thing to do.
This is easy to fix, in the interleaved ledger manger bad disk handle.
+
+### Checksum at bookies in the write path
+Lack of checksum calculations on the write path makes the store not to detect
any corruption at the source issues. Imagine NIC issues on the client. If data
gets corrupted at the client NIC’s level it safely gets stored on bookies (for
the lack of crc calculations in the write path). This is a silent corruption of
all 3 copies. For additional durability guarantees we can add checksum
verification on bookies in the write path. Checksum calculations are cpu
intensive and will add to the l [...]
+
+### No repair in the read path
+When a checksum error is detected, in addition to finding good replica,
sfstore need to repair(replace with good one) bad replica too.
+
+
+## Operations
+### No bookie incarnation mechanism
+A bookie `B1 at time t1` ; and same bookie `B1 at time t2` after bookie format
are treated in the same way.
+For this to cause any durability issues:
+* Replication/Auditor mechanism is stopped or not running for some reason. (A
stuck auditor will start a new one due to ZK)
+* One of bookies(B1) went down (crash or something)
+* B1’s Journal dir and all ledger dir got wiped.
+* B1 came back to life as a fresh bookie
+* Auditor is enabled monitoring again
+
+At this point auditor doesn’t have capability to know that the B1 in the
cluster is not the same B1 that it used to be. Hence doesn’t consider it for
under replication. This is a pathological scenario but we at least need to have
a mechanism to identify and alert this scenario if not taking care of bookie
incarnation issue.
+
+## Enhancements
+### Delete Choke Points
+Every delete must go through single routine/path in the code and that needs to
implement additional checks to perform physical delete.
+
+### Archival bit in the metadata to assist Two phase Deletes
+Main aim of this feature is to be as conservative as possible on the delete
path. As explained in the stateful explicit deletes section, lack of ledgerId
in the metadata means that ledger is deleted. A bug in the client code may
erroneously delete the ledger. To protect from that, we want to introduce a
archive/backedup bit. A separate backup/archival application can mark the bit
after successfully backing up the ledger, and later on main client application
will send the delete. If this [...]
+
+### Stateful explicit deltes
+Current bookkeeper deletes synchronously deletes the metadata in the
zookeeper. Bookies implicitly assume that a particular ledger is deleted if it
is not present in the metadata. This process has no crosscheck if the ledger is
actually deleted. Any ZK corruption or loss of the ledger path znodes will make
bookies to delete data on the disk. No cross check. Even bugs in bookie code
which ‘determines’ if a ledger is present on the zk or not, may lead to data
deletion.
+
+Right way to deal with this is to asynchronously delete metadata after each
bookie explicitly checks that a particular ledger is deleted. This way each
bookie explicitly checks the ‘delete state’ of the ledger before deleting on
the disk data. One of the proposal is to move the deleted ledgers under
/deleted/<ledgerId> other idea is to add a delete state, Open->Closed->Deleted.
+
+As soon as we make the metadata deletions asynchronous, the immediate question
is who will delete it?
+Option-1: A centralized process like auditor will be responsible for deleting
metadata after each bookie deletes on disk data.
+Option-2: A decentralized, more complicated approach: Last bookie that deletes
its on disk data, deletes the metadata too.
+I am sure there can be more ideas. Any path will need a detailed design and
need to consider many corner cases.
+
+#### Obvious points to consider:
+ZK as-is heavily loaded with BK metadata. Keeping these znodes around for more
time ineeded puts more pressure on ZK.
+If a bookie is down for long time, what would be the delete policy for the
metadata?
+There will be lots of corner case scenarios we need to deal with. For example:
+A bookie-1 hosting data for ledger-1 is down for long time
+Ledger-1 data has been replicated to other bookies
+Ledger-1 is deleted, and its data and metadata is clared.
+Now bookie-1 came back to life. Since our policy is ‘explicit state check
delete’ bookie-1 can’t delete ledger-1 data as it can’t explicitly validate
that the ledger-1 has been deleted.
+One possible solution: keep tomestones of deleted ledgers around for some
duration. If a bookie is down for more than that duration, it needs to be
decommissioned and add as a new bookie.
+Enhance: Archival bit in the metadata to assist Two phase Deletes
+Main aim of this feature is to be as conservative as possible on the delete
path. As explained in the stateful explicit deletes section, lack of ledgerId
in the metadata means that ledger is deleted. A bug in the client code may
erroneously delete the ledger. To protect from that, we want to introduce a
archive/backedup bit. A separate backup/archival application can mark the bit
after successfully backing up the ledger, and later on main client application
will send the delete. If this [...]
+
+### Metadata recovery tool
+In case zookkeper completely wiped we need a way to reconstruct enough
metadata to read ledgers back. Currently metadata contains ensemble information
which is critical for reading ledgers back, and also it has additional metadata
like ctime and custom metadata. Every bookie has one index file per ledger and
that has enough information to reconstruct the ensemble information so that the
ledgers can be made readable. This tool can be built in two ways.
+If ZK is completely wiped, reconstruct entire data from bookie index files.
+If ZK is completely wiped, but snapshots are available, restore ZK from
snapshots and built the delta from bookie index files.
+
+### Bit Rot Detection (BP-24)
+If the data stays on the disk for long time(years), it is possible to
experience silent data degradation on the disk. In the current scenario we will
not identify this until the data is read by the application.
+
+### End to end checksum
+Bookies never validate the payload checksum. If the the client’s socket has
issues, it might corrupt the data (at the source) and it won’t be detected
until client reads it back. That will be too late as the original write was
successful for the application. Use efficient checksum mechanisms and enforce
checksum validations on the bookie’s write path. If checksum validation fails,
the the write itself will fail and application will be notified.
+
+
+## Test strategy to validate durability
+BK need to develop a comprehensive testing strategy to test and validate the
store’s durability. Various methods and levels are tests are needed to gain
confidence for deploying the store in production. Specific points are mentioned
here and these are in addition to regular functional testing/validation.
+### White box error injection
+Introduce all possible errors in the write path, kick replication mechanism
and make sure cluster reached desired replica levels.
+Corrupt first readable copy and make sure that the corruption is detected on
the read path, and ultimately read must succeed after trying second replica.
+Corrupt packet after checksum calculation on the client and make sure that it
is detected in the read path, and ultimately read fails as this is corruption
at the source.
+After a write make sure that the replica is distributed across fault zones.
+Kill a bookie, make sure that the auditor detected and replicated all ledgers
in that bookie according to allocation policy (across fault zones)
+### Black box error injection (Chaos Monkey)
+While keeping longevity testing which is doing continues IO to the store
introduce following errors.
+Kill random bookie and reads should continue.
+Kill random bookies keeping minimum fault zones to satisfy AQ Quorum during
write workload.
+Simulate disk errors in random bookies and allow the bookie to go down and
replication gets started.
+Make sure that the cluster is running in full durable state through the tools
and monitoring built.
diff --git a/content/community/bookkeeper_proposals/index.html
b/content/community/bookkeeper_proposals/index.html
index d727f3a..8d15ded 100644
--- a/content/community/bookkeeper_proposals/index.html
+++ b/content/community/bookkeeper_proposals/index.html
@@ -346,7 +346,7 @@ of the thread is of the format <code
class="highlighter-rouge">[DISCUSS] BP-<
<p>This section lists all the <em>bookkeeper proposals</em> made to
BookKeeper.</p>
-<p><em>Next Proposal Number: 30</em></p>
+<p><em>Next Proposal Number: 32</em></p>
<h3 id="inprogress">Inprogress</h3>
@@ -388,7 +388,7 @@ of the thread is of the format <code
class="highlighter-rouge">[DISCUSS] BP-<
</tr>
<tr>
<td style="text-align: left"><a
href="../../bps/BP-27-new-bookkeeper-cli">BP-27: New BookKeeper CLI</a></td>
- <td style="text-align: left">Draft</td>
+ <td style="text-align: left">Accepted</td>
</tr>
<tr>
<td style="text-align: left"><a
href="../../bps/BP-28-etcd-as-metadata-store">BP-28: use etcd as metadata
store</a></td>
@@ -398,6 +398,14 @@ of the thread is of the format <code
class="highlighter-rouge">[DISCUSS] BP-<
<td style="text-align: left"><a
href="../../bps/BP-29-metadata-store-api-module">BP-29: Metadata API
module</a></td>
<td style="text-align: left">Accepted</td>
</tr>
+ <tr>
+ <td style="text-align: left"><a
href="https://docs.google.com/document/d/155xAwWv5IdOitHh1NVMEwCMGgB28M3FyMiQSxEpjE-Y/edit#heading=h.56rbh52koe3f">BP-30:
BookKeeper Table Service</a></td>
+ <td style="text-align: left">Accepted</td>
+ </tr>
+ <tr>
+ <td style="text-align: left"><a href="../../bps/BP-31-durability">BP-31:
BookKeeper Durability Anchor</a></td>
+ <td style="text-align: left">Accepted</td>
+ </tr>
</tbody>
</table>
--
To stop receiving notification emails like this one, please contact
[email protected].