sijie commented on a change in pull request #1138: BP-31 BookKeeper Durability 

 File path: site/bps/BP-31-durability
 @@ -0,0 +1,134 @@
+title: "BP-31: BookKeeper Durability(Anchor)"
+state: 'Anchor BP'
+release: "x.y.z"
+## Motivation
+Apache BookKeeper is transitioning into a full fledged distributed storage 
that can keep the data for long term. Durability is paramount to achieve the 
status of trusted store. This Anchor BP discusses many gaps and areas of 
improvement.  Each issue listed here will have another issue and this BP is 
expected to be updated when that issue is created.
+## Durability Contract
+1. **Maintain WQ copies all the time**. If any ledger falls into under 
replicated state, there needs to be an SLA on how quickly the replication 
levels can be brought back to normal levels.
+2. **Enforce Placement Policy** strictly during write and replication.
+3. **Protect the data** against corruption on the wire or at rest.
+## Work Grouping (In the order of priority)
+### Detect Durability Validation
+First step is to understand the areas of durability breaches. Design metrics 
that record durability contract violations. 
+* At the Creation: Validate durability contract the extent is being created
+* At the Deletion: Validate durability contract when extent is deleted
+* During lifetime: Validate durability contract during the lifetime of the 
extent.(periodic validator)
+* During Read: IO or Checksum errors in the read path
+### Delete Discipline
+* Build a single delete choke point with stringent validations
+* Archival bit in the metadata to assist Two phase Deletes
+* Stateful/Explicit Deletes
+### Metadata Recovery
+* Metadata recovery tool to reconstruct the metadata if the metadata server 
gets wiped out. This tool need to make sure that the data is readable even if 
we can't get all the metadata (ex: ctime) back.
+### Plug Durability Violations
+Our first step is to identify durability viloations. That gives us the 
magnitude of the problem and areas that we need to focus. In this phase, fix 
high impact areas.
+* Identify source of problems detected by the work we did in step-1 above 
(Detect Durability Validation)
+* Rereplicate under replicated extents detected during write
+* Rereplicate under replicated / corrupted extents detected during read
+* Replicated under replicated extents identified by periodic validator.
+### Durability Test
+* Test plan, new tests and integrating it into CI pipeline. 
+### Introduce bookie incarnation 
+* Design/Implement bookie incarnation mechanism 
+### End 2 End Checksum
+* Efficient checksum implementation (crc32c?)
+* Implement checksum validation on bookies in the write path. 
+### Soft Deletes
+* Design and implement soft delete feature
+### BitRot detection
+* Design and implement bitrot detection/correction.
+## Durability Contract Violations 
+### Write errors beyond AQ are ignored.
+BK client library transparently corrects any write errors while writing to 
bookie by changing the ensemble. Take a case where `WQ:3 and AQ:2`. This works 
fine only if the write fails to the bookie before it gets 2 successful 
responses. But if the 3rd bookie write fails **after** 2 successful responses 
and the response sent to client, this error is logged and no immediate action 
is taken to bring up the replication of the entry.
+This case **may not be**  detected by the auditor?s periodic ledger check. 
Given that we allow out of order write, that in the combination of 2 out of 3 
to satisfy client, it is possible to have under replication in the middle of 
the ensemble entry. Hence ledgercheck is not going to find all under 
replication cases, on top of that,   periodic ledger check  is a complete sweep 
of the store, an very expensive and slow crawl hence defaulted to once a week 
+### Strict enforcement of placement policy 
+The best effort placement policy increases the write availability but at the 
cost of durability. Due to this non-strict placement, BK can?t guarantee data 
availability when a fault domain (rack) is lost. This also makes rolling 
upgrade across fault domains more difficult/non possible. Need to enforce 
strict ensemble placement and fail the write if all WQ copies are not able to 
be placed across different fault domains.  Minor fix/enhancement if we agree to 
give placement higher priority than a successful write(availability)
+The auditor re-replication uses client library to find a replacement bookie 
for each ledger in the lost bookie. But bookies are unaware of the ledger 
ensemble placement policy as this information is not part of metadata. 
 Review comment:
   The placement policy base re-replication is already there.
   I think this change here is more about making placement policy as part of 
metadata if I understand this correctly? @jvrao 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

Reply via email to