> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the > confirm callback to the client is called and the LAC is set to 100.Now the > 3rd bookie times out. Ensemble change is executed and all pending adds that > are above the LAC of 100 are replayed to another bookie, meaning that the > entry e100 is not replayed to another bookie, causing this entry to meet > the rep factor of only AQ.
This is an example of a scenario corresponding to what we suspect is a bug introduced earlier, but Enrico is arguing that this is not the intended behavior, and at this point, I agree. > This is alluded to in the docs as they state > that AQ is also the minimum guaranteed replication factor. By the time a successful callback is received, the client might only have replicated AQ ways, so the guarantee can only be at that point of being able to tolerate AQ - 1 crashes. The ledger configuration states that the application wants to have WQ copies of each entry, though. I'd expect a ledger to have WQ copies of each entry up to the final entry number when it is closed. Do you see it differently? > I'd be happy to set up a meeting to discuss the spec and its findings. That'd be great, I'm interested. -Flavio > On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanligh...@splunk.com.INVALID> > wrote: > >> No you cannot miss data, if the client is not able to find a bookie that > is >> able to answer with the entry it receives an error. > > Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the > confirm callback to the client is called and the LAC is set to 100. Now the > 3rd bookie times out. Ensemble change is executed and all pending adds that > are above the LAC of 100 are replayed to another bookie, meaning that the > entry e100 is not replayed to another bookie, causing this entry to meet > the rep factor of only AQ. This is alluded to in the docs as they state > that AQ is also the minimum guaranteed replication factor. > >> The recovery read fails if it is not possible to read every entry from at >> least AQ bookies (that is it allows WQ-QA read failures), >> and it does not hazard to "repair" (truncate) the ledger if it does not >> find enough bookies. > > This is not quite accurate. A single successful read is enough. However > there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with > explicit NoSuchEntry/Ledger, the read is considered failed and the ledger > recovery process ends there. This means that given the responses > b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is > considered successful is non-deterministic. If the response from b1 is > received last, then the read is already considered failed, otherwise the > read succeeds. > > I have come to the above conclusions through my reverse engineering process > for creating the TLA+ specification. I still have pending to > reproduce the AQ rep factor behaviour via some tests, but have verified via > tests the conclusion about ledger recovery reads. > > Note that I have found two defects with the BookKeeper protocol, most > notably data loss due to that fencing does not prevent further successful > adds. Currently the specification and associated documentation is on a > private Splunk repo, but I'd be happy to set up a meeting to discuss the > spec and its findings. > > Best > Jack > > > On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <eolive...@gmail.com> > wrote: > >> [ External sender. Exercise caution. ] >> >> Jonathan, >> >> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis < >> jbel...@apache.org> >> ha scritto: >> >>> On 2021/01/11 08:31:03, Jack Vanlightly wrote: >>>> Hi, >>>> >>>> I've recently modelled the BookKeeper protocol in TLA+ and can confirm >>> that >>>> once confirmed, that an entry is not replayed to another bookie. This >>>> leaves a "hole" as the entry is now replicated only to 2 bookies, >>> however, >>>> the new data integrity check that Ivan worked on, when run periodically >>>> will be able to repair that hole. >>> >>> Can I read from the bookie with a hole in the meantime, and silently miss >>> data that it doesn't know about? >>> >> >> No you cannot miss data, if the client is not able to find a bookie that is >> able to answer with the entry it receives an error. >> >> The ledger has a known tail (LastAddConfirmed entry) and this value is >> stored on ledger metadata once the ledger is "closed". >> >> When the ledger is still open, that is when the writer is writing to it, >> the reader is allowed to read only up to the LastAddConfirmed entry >> this LAC value is returned to the reader using a piggyback mechanism, >> without reading from metadata. >> The reader cannot read beyond the latest position that has been confirmed >> to the writer by AQ bookies. >> >> We have a third case, the 'recovery read'. >> A reader starts a "recovery read" when you want to recover a ledger that >> has been abandoned by a dead writer >> or when you are a new leader (Pulsar Bundle Owner) or you want to fence out >> the old leader. >> In this case the reader merges the current status of the ledger on ZK with >> the result of a scan of the whole ledger. >> Basically it reads the ledger from the beginning up to the tail, until it >> is able to "read" entries, this way it is setting the 'fenced' flag on the >> ledger >> on every bookie and also it is able to detect the actual tail of the ledger >> (because the writer died and it was not able to flush metadata to ZK). >> >> The recovery read fails if it is not possible to read every entry from at >> least AQ bookies (that is it allows WQ-QA read failures), >> and it does not hazard to "repair" (truncate) the ledger if it does not >> find enough bookies. >> >> I hope that helps >> Enrico >>