Hi Jack, Thanks for getting back.
> What's the best way to share the TLA+ findings? Would you be able to share the spec? I'm ok with reading TLA+. As for sharing your specific findings, I'd suggest one of the following: 1- Create an email thread describing the scenarios that trigger a bug. 2- Create issues, one for each problem you found. 3- Create a discussion on the project Slack, perhaps a channel specific for it. 4- Set up a zoom call to present and discuss with the community. Option 2 is ideal from a community perspective, but we can also set up a call inviting everyone and create issues out of that discussion. We can in fact set up a call even if we create the issues ahead of time. Does it make sense? -Flavio > On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanligh...@splunk.com.INVALID> > wrote: > > Hi Flavio, > >>> This is an example of a scenario corresponding to what we suspect is a > bug introduced earlier, but Enrico is arguing that this is not the intended > behavior, and at this point, I agree. > >>> By the time a successful callback is received, the client might only > have replicated AQ ways, so the guarantee can only be at that point of > being able to tolerate AQ - 1 crashes. The ledger configuration states that > the application wants to have WQ copies >> of each entry, though. I'd > expect a ledger to have WQ copies of each entry up to the final entry > number when it is closed. Do you see it differently? > > I also agree and was pretty surprised when I discovered the behaviour. It > is not something that users expect and I think we need to correct it. So > I'm with you. > > What's the best way to share the TLA+ findings? > > Jack > > > > > > On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <f...@apache.org> wrote: > >> [ External sender. Exercise caution. ] >> >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the >>> confirm callback to the client is called and the LAC is set to 100.Now >> the >>> 3rd bookie times out. Ensemble change is executed and all pending adds >> that >>> are above the LAC of 100 are replayed to another bookie, meaning that the >>> entry e100 is not replayed to another bookie, causing this entry to meet >>> the rep factor of only AQ. >> >> This is an example of a scenario corresponding to what we suspect is a bug >> introduced earlier, but Enrico is arguing that this is not the intended >> behavior, and at this point, I agree. >> >>> This is alluded to in the docs as they state >>> that AQ is also the minimum guaranteed replication factor. >> >> By the time a successful callback is received, the client might only have >> replicated AQ ways, so the guarantee can only be at that point of being >> able to tolerate AQ - 1 crashes. The ledger configuration states that the >> application wants to have WQ copies of each entry, though. I'd expect a >> ledger to have WQ copies of each entry up to the final entry number when it >> is closed. Do you see it differently? >> >>> I'd be happy to set up a meeting to discuss the spec and its findings. >> >> >> That'd be great, I'm interested. >> >> -Flavio >> >>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanligh...@splunk.com.INVALID> >> wrote: >>> >>>> No you cannot miss data, if the client is not able to find a bookie that >>> is >>>> able to answer with the entry it receives an error. >>> >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the >>> confirm callback to the client is called and the LAC is set to 100. Now >> the >>> 3rd bookie times out. Ensemble change is executed and all pending adds >> that >>> are above the LAC of 100 are replayed to another bookie, meaning that the >>> entry e100 is not replayed to another bookie, causing this entry to meet >>> the rep factor of only AQ. This is alluded to in the docs as they state >>> that AQ is also the minimum guaranteed replication factor. >>> >>>> The recovery read fails if it is not possible to read every entry from >> at >>>> least AQ bookies (that is it allows WQ-QA read failures), >>>> and it does not hazard to "repair" (truncate) the ledger if it does not >>>> find enough bookies. >>> >>> This is not quite accurate. A single successful read is enough. However >>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with >>> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger >>> recovery process ends there. This means that given the responses >>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is >>> considered successful is non-deterministic. If the response from b1 is >>> received last, then the read is already considered failed, otherwise the >>> read succeeds. >>> >>> I have come to the above conclusions through my reverse engineering >> process >>> for creating the TLA+ specification. I still have pending to >>> reproduce the AQ rep factor behaviour via some tests, but have verified >> via >>> tests the conclusion about ledger recovery reads. >>> >>> Note that I have found two defects with the BookKeeper protocol, most >>> notably data loss due to that fencing does not prevent further successful >>> adds. Currently the specification and associated documentation is on a >>> private Splunk repo, but I'd be happy to set up a meeting to discuss the >>> spec and its findings. >>> >>> Best >>> Jack >>> >>> >>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <eolive...@gmail.com> >>> wrote: >>> >>>> [ External sender. Exercise caution. ] >>>> >>>> Jonathan, >>>> >>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis < >>>> jbel...@apache.org> >>>> ha scritto: >>>> >>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote: >>>>>> Hi, >>>>>> >>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can confirm >>>>> that >>>>>> once confirmed, that an entry is not replayed to another bookie. This >>>>>> leaves a "hole" as the entry is now replicated only to 2 bookies, >>>>> however, >>>>>> the new data integrity check that Ivan worked on, when run >> periodically >>>>>> will be able to repair that hole. >>>>> >>>>> Can I read from the bookie with a hole in the meantime, and silently >> miss >>>>> data that it doesn't know about? >>>>> >>>> >>>> No you cannot miss data, if the client is not able to find a bookie >> that is >>>> able to answer with the entry it receives an error. >>>> >>>> The ledger has a known tail (LastAddConfirmed entry) and this value is >>>> stored on ledger metadata once the ledger is "closed". >>>> >>>> When the ledger is still open, that is when the writer is writing to it, >>>> the reader is allowed to read only up to the LastAddConfirmed entry >>>> this LAC value is returned to the reader using a piggyback mechanism, >>>> without reading from metadata. >>>> The reader cannot read beyond the latest position that has been >> confirmed >>>> to the writer by AQ bookies. >>>> >>>> We have a third case, the 'recovery read'. >>>> A reader starts a "recovery read" when you want to recover a ledger that >>>> has been abandoned by a dead writer >>>> or when you are a new leader (Pulsar Bundle Owner) or you want to fence >> out >>>> the old leader. >>>> In this case the reader merges the current status of the ledger on ZK >> with >>>> the result of a scan of the whole ledger. >>>> Basically it reads the ledger from the beginning up to the tail, until >> it >>>> is able to "read" entries, this way it is setting the 'fenced' flag on >> the >>>> ledger >>>> on every bookie and also it is able to detect the actual tail of the >> ledger >>>> (because the writer died and it was not able to flush metadata to ZK). >>>> >>>> The recovery read fails if it is not possible to read every entry from >> at >>>> least AQ bookies (that is it allows WQ-QA read failures), >>>> and it does not hazard to "repair" (truncate) the ledger if it does not >>>> find enough bookies. >>>> >>>> I hope that helps >>>> Enrico >>>> >> >> >>