jvrao commented on a change in pull request #927: BP-24: BookieScanner: Enhance 
Data Integrity
URL: https://github.com/apache/bookkeeper/pull/927#discussion_r159817608
 
 

 ##########
 File path: site/bps/BP-24-BookieScanner.md
 ##########
 @@ -0,0 +1,92 @@
+?---
+title: "BP-24: BookieScanner: Enhance Data Integrity"
+issue: https://github.com/apache/bookkeeper/<issue-number>
+state: "Under Discussion"
+release: "N/A"
+---
+
+
+### Motivation
+
+
+Currently Bookie can't deal entry losing gracefully, the AutoRecovery is 
restricted to the bookie level, which means the AutoRecovery takes effect only 
after bookie is down. However when a disk fails, either or both the ledger 
index files and entry log files could potentially become corrupt. BookKeeper 
needs to provide mechanisms to identify and handle these problems.
+
+
+### Proposed Changes
+
+
+We introduce Bookie Scanner, which is a background task, to scan index files 
and entry log files to detect possible corruptions. Since data corruption may 
happen at any time on any block on any Bookie, it is important to identify 
these errors in a timely manner. This way, the bookie can remove/compact 
corrupted entries and re-replicate entries from other replicas, to maintain 
data integrity and reduce client errors. 
+
+
+The Bookie Scanner needs to detect and cover following conditions:
+
+
+- a ledger is missing local (no index file found for a given ledger), we can 
do this by looking into the ledger metadata.
+- a ledger exists, but some entries are missing (no index entries found in the 
index file), we can check fragment?s metadata to verify this.
+- a ledger exists, entries are found in index file, but the entries in entry 
log files are corrupted, we can use entry?s checksum to verify this.
+
+
+A Bookie Scanner is integrated and run as part of compaction thread which 
already scans the entry log files.
+
+
+#### Suspicious List
+
+
+Besides regular scan, the scanner also maintains a list of suspicious ledgers 
and a list of suspicious entry log files. These are the ledgers / entry log 
files that caused specific types of exceptions to be thrown when entries are 
read from disk. The suspicious lists take priority over the regular ledgers and 
entry log files during scans. Moreover, the scanner should track which 
suspicious ledgers and entry log files it has scanned in the past x minutes, to 
avoid repeatedly scanning the same suspicious ledgers and entry log files.
+
+
+The mechanism bookie scanner to decide which ledgers and entry log files to 
scan is as follows:
+
+
+* When a bookie is serving read requests, if an IOException is caught, then 
the entry log file is marked as suspicious and added to the scanner?s 
suspicious entry log list, if a NoSuchLedger or NoSuchEntry exception is 
caught, then the given ledger is marked as suspicious and added to the 
scanner?s suspicious ledger list.
 
 Review comment:
   There are APIs and situations where tailer can try to read the entry before 
it is acknowledged. Or even readLAC can cause exceptions, so maybe you need to 
add to suspicious ledgers list only if the ledger is closed, but bookie has no 
idea if the ledger is closed or not.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to