Hi Flavio, >> This is an example of a scenario corresponding to what we suspect is a bug introduced earlier, but Enrico is arguing that this is not the intended behavior, and at this point, I agree.
>> By the time a successful callback is received, the client might only have replicated AQ ways, so the guarantee can only be at that point of being able to tolerate AQ - 1 crashes. The ledger configuration states that the application wants to have WQ copies >> of each entry, though. I'd expect a ledger to have WQ copies of each entry up to the final entry number when it is closed. Do you see it differently? I also agree and was pretty surprised when I discovered the behaviour. It is not something that users expect and I think we need to correct it. So I'm with you. What's the best way to share the TLA+ findings? Jack On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <f...@apache.org> wrote: > [ External sender. Exercise caution. ] > > > Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the > > confirm callback to the client is called and the LAC is set to 100.Now > the > > 3rd bookie times out. Ensemble change is executed and all pending adds > that > > are above the LAC of 100 are replayed to another bookie, meaning that the > > entry e100 is not replayed to another bookie, causing this entry to meet > > the rep factor of only AQ. > > This is an example of a scenario corresponding to what we suspect is a bug > introduced earlier, but Enrico is arguing that this is not the intended > behavior, and at this point, I agree. > > > This is alluded to in the docs as they state > > that AQ is also the minimum guaranteed replication factor. > > By the time a successful callback is received, the client might only have > replicated AQ ways, so the guarantee can only be at that point of being > able to tolerate AQ - 1 crashes. The ledger configuration states that the > application wants to have WQ copies of each entry, though. I'd expect a > ledger to have WQ copies of each entry up to the final entry number when it > is closed. Do you see it differently? > > > I'd be happy to set up a meeting to discuss the spec and its findings. > > > That'd be great, I'm interested. > > -Flavio > > > On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanligh...@splunk.com.INVALID> > wrote: > > > >> No you cannot miss data, if the client is not able to find a bookie that > > is > >> able to answer with the entry it receives an error. > > > > Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the > > confirm callback to the client is called and the LAC is set to 100. Now > the > > 3rd bookie times out. Ensemble change is executed and all pending adds > that > > are above the LAC of 100 are replayed to another bookie, meaning that the > > entry e100 is not replayed to another bookie, causing this entry to meet > > the rep factor of only AQ. This is alluded to in the docs as they state > > that AQ is also the minimum guaranteed replication factor. > > > >> The recovery read fails if it is not possible to read every entry from > at > >> least AQ bookies (that is it allows WQ-QA read failures), > >> and it does not hazard to "repair" (truncate) the ledger if it does not > >> find enough bookies. > > > > This is not quite accurate. A single successful read is enough. However > > there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with > > explicit NoSuchEntry/Ledger, the read is considered failed and the ledger > > recovery process ends there. This means that given the responses > > b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is > > considered successful is non-deterministic. If the response from b1 is > > received last, then the read is already considered failed, otherwise the > > read succeeds. > > > > I have come to the above conclusions through my reverse engineering > process > > for creating the TLA+ specification. I still have pending to > > reproduce the AQ rep factor behaviour via some tests, but have verified > via > > tests the conclusion about ledger recovery reads. > > > > Note that I have found two defects with the BookKeeper protocol, most > > notably data loss due to that fencing does not prevent further successful > > adds. Currently the specification and associated documentation is on a > > private Splunk repo, but I'd be happy to set up a meeting to discuss the > > spec and its findings. > > > > Best > > Jack > > > > > > On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <eolive...@gmail.com> > > wrote: > > > >> [ External sender. Exercise caution. ] > >> > >> Jonathan, > >> > >> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis < > >> jbel...@apache.org> > >> ha scritto: > >> > >>> On 2021/01/11 08:31:03, Jack Vanlightly wrote: > >>>> Hi, > >>>> > >>>> I've recently modelled the BookKeeper protocol in TLA+ and can confirm > >>> that > >>>> once confirmed, that an entry is not replayed to another bookie. This > >>>> leaves a "hole" as the entry is now replicated only to 2 bookies, > >>> however, > >>>> the new data integrity check that Ivan worked on, when run > periodically > >>>> will be able to repair that hole. > >>> > >>> Can I read from the bookie with a hole in the meantime, and silently > miss > >>> data that it doesn't know about? > >>> > >> > >> No you cannot miss data, if the client is not able to find a bookie > that is > >> able to answer with the entry it receives an error. > >> > >> The ledger has a known tail (LastAddConfirmed entry) and this value is > >> stored on ledger metadata once the ledger is "closed". > >> > >> When the ledger is still open, that is when the writer is writing to it, > >> the reader is allowed to read only up to the LastAddConfirmed entry > >> this LAC value is returned to the reader using a piggyback mechanism, > >> without reading from metadata. > >> The reader cannot read beyond the latest position that has been > confirmed > >> to the writer by AQ bookies. > >> > >> We have a third case, the 'recovery read'. > >> A reader starts a "recovery read" when you want to recover a ledger that > >> has been abandoned by a dead writer > >> or when you are a new leader (Pulsar Bundle Owner) or you want to fence > out > >> the old leader. > >> In this case the reader merges the current status of the ledger on ZK > with > >> the result of a scan of the whole ledger. > >> Basically it reads the ledger from the beginning up to the tail, until > it > >> is able to "read" entries, this way it is setting the 'fenced' flag on > the > >> ledger > >> on every bookie and also it is able to detect the actual tail of the > ledger > >> (because the writer died and it was not able to flush metadata to ZK). > >> > >> The recovery read fails if it is not possible to read every entry from > at > >> least AQ bookies (that is it allows WQ-QA read failures), > >> and it does not hazard to "repair" (truncate) the ledger if it does not > >> find enough bookies. > >> > >> I hope that helps > >> Enrico > >> > > >