Let's set up a call and create any issues from that. I have already created the patches in our (Splunk) fork and it might be easiest or not to wait until we re-sync up with the open source repo. We can include the fixes in the discussion.
Jack On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <f...@apache.org> wrote: > [ External sender. Exercise caution. ] > > Hi Jack, > > Thanks for getting back. > > > What's the best way to share the TLA+ findings? > > Would you be able to share the spec? I'm ok with reading TLA+. > > As for sharing your specific findings, I'd suggest one of the following: > > 1- Create an email thread describing the scenarios that trigger a bug. > 2- Create issues, one for each problem you found. > 3- Create a discussion on the project Slack, perhaps a channel specific > for it. > 4- Set up a zoom call to present and discuss with the community. > > Option 2 is ideal from a community perspective, but we can also set up a > call inviting everyone and create issues out of that discussion. We can in > fact set up a call even if we create the issues ahead of time. > > Does it make sense? > > -Flavio > > > On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanligh...@splunk.com.INVALID> > wrote: > > > > Hi Flavio, > > > >>> This is an example of a scenario corresponding to what we suspect is a > > bug introduced earlier, but Enrico is arguing that this is not the > intended > > behavior, and at this point, I agree. > > > >>> By the time a successful callback is received, the client might only > > have replicated AQ ways, so the guarantee can only be at that point of > > being able to tolerate AQ - 1 crashes. The ledger configuration states > that > > the application wants to have WQ copies >> of each entry, though. I'd > > expect a ledger to have WQ copies of each entry up to the final entry > > number when it is closed. Do you see it differently? > > > > I also agree and was pretty surprised when I discovered the behaviour. It > > is not something that users expect and I think we need to correct it. So > > I'm with you. > > > > What's the best way to share the TLA+ findings? > > > > Jack > > > > > > > > > > > > On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <f...@apache.org> wrote: > > > >> [ External sender. Exercise caution. ] > >> > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the > >>> confirm callback to the client is called and the LAC is set to 100.Now > >> the > >>> 3rd bookie times out. Ensemble change is executed and all pending adds > >> that > >>> are above the LAC of 100 are replayed to another bookie, meaning that > the > >>> entry e100 is not replayed to another bookie, causing this entry to > meet > >>> the rep factor of only AQ. > >> > >> This is an example of a scenario corresponding to what we suspect is a > bug > >> introduced earlier, but Enrico is arguing that this is not the intended > >> behavior, and at this point, I agree. > >> > >>> This is alluded to in the docs as they state > >>> that AQ is also the minimum guaranteed replication factor. > >> > >> By the time a successful callback is received, the client might only > have > >> replicated AQ ways, so the guarantee can only be at that point of being > >> able to tolerate AQ - 1 crashes. The ledger configuration states that > the > >> application wants to have WQ copies of each entry, though. I'd expect a > >> ledger to have WQ copies of each entry up to the final entry number > when it > >> is closed. Do you see it differently? > >> > >>> I'd be happy to set up a meeting to discuss the spec and its findings. > >> > >> > >> That'd be great, I'm interested. > >> > >> -Flavio > >> > >>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanligh...@splunk.com > .INVALID> > >> wrote: > >>> > >>>> No you cannot miss data, if the client is not able to find a bookie > that > >>> is > >>>> able to answer with the entry it receives an error. > >>> > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the > >>> confirm callback to the client is called and the LAC is set to 100. Now > >> the > >>> 3rd bookie times out. Ensemble change is executed and all pending adds > >> that > >>> are above the LAC of 100 are replayed to another bookie, meaning that > the > >>> entry e100 is not replayed to another bookie, causing this entry to > meet > >>> the rep factor of only AQ. This is alluded to in the docs as they state > >>> that AQ is also the minimum guaranteed replication factor. > >>> > >>>> The recovery read fails if it is not possible to read every entry from > >> at > >>>> least AQ bookies (that is it allows WQ-QA read failures), > >>>> and it does not hazard to "repair" (truncate) the ledger if it does > not > >>>> find enough bookies. > >>> > >>> This is not quite accurate. A single successful read is enough. However > >>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with > >>> explicit NoSuchEntry/Ledger, the read is considered failed and the > ledger > >>> recovery process ends there. This means that given the responses > >>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is > >>> considered successful is non-deterministic. If the response from b1 is > >>> received last, then the read is already considered failed, otherwise > the > >>> read succeeds. > >>> > >>> I have come to the above conclusions through my reverse engineering > >> process > >>> for creating the TLA+ specification. I still have pending to > >>> reproduce the AQ rep factor behaviour via some tests, but have verified > >> via > >>> tests the conclusion about ledger recovery reads. > >>> > >>> Note that I have found two defects with the BookKeeper protocol, most > >>> notably data loss due to that fencing does not prevent further > successful > >>> adds. Currently the specification and associated documentation is on a > >>> private Splunk repo, but I'd be happy to set up a meeting to discuss > the > >>> spec and its findings. > >>> > >>> Best > >>> Jack > >>> > >>> > >>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <eolive...@gmail.com> > >>> wrote: > >>> > >>>> [ External sender. Exercise caution. ] > >>>> > >>>> Jonathan, > >>>> > >>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis < > >>>> jbel...@apache.org> > >>>> ha scritto: > >>>> > >>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote: > >>>>>> Hi, > >>>>>> > >>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can > confirm > >>>>> that > >>>>>> once confirmed, that an entry is not replayed to another bookie. > This > >>>>>> leaves a "hole" as the entry is now replicated only to 2 bookies, > >>>>> however, > >>>>>> the new data integrity check that Ivan worked on, when run > >> periodically > >>>>>> will be able to repair that hole. > >>>>> > >>>>> Can I read from the bookie with a hole in the meantime, and silently > >> miss > >>>>> data that it doesn't know about? > >>>>> > >>>> > >>>> No you cannot miss data, if the client is not able to find a bookie > >> that is > >>>> able to answer with the entry it receives an error. > >>>> > >>>> The ledger has a known tail (LastAddConfirmed entry) and this value is > >>>> stored on ledger metadata once the ledger is "closed". > >>>> > >>>> When the ledger is still open, that is when the writer is writing to > it, > >>>> the reader is allowed to read only up to the LastAddConfirmed entry > >>>> this LAC value is returned to the reader using a piggyback mechanism, > >>>> without reading from metadata. > >>>> The reader cannot read beyond the latest position that has been > >> confirmed > >>>> to the writer by AQ bookies. > >>>> > >>>> We have a third case, the 'recovery read'. > >>>> A reader starts a "recovery read" when you want to recover a ledger > that > >>>> has been abandoned by a dead writer > >>>> or when you are a new leader (Pulsar Bundle Owner) or you want to > fence > >> out > >>>> the old leader. > >>>> In this case the reader merges the current status of the ledger on ZK > >> with > >>>> the result of a scan of the whole ledger. > >>>> Basically it reads the ledger from the beginning up to the tail, until > >> it > >>>> is able to "read" entries, this way it is setting the 'fenced' flag on > >> the > >>>> ledger > >>>> on every bookie and also it is able to detect the actual tail of the > >> ledger > >>>> (because the writer died and it was not able to flush metadata to ZK). > >>>> > >>>> The recovery read fails if it is not possible to read every entry from > >> at > >>>> least AQ bookies (that is it allows WQ-QA read failures), > >>>> and it does not hazard to "repair" (truncate) the ledger if it does > not > >>>> find enough bookies. > >>>> > >>>> I hope that helps > >>>> Enrico > >>>> > >> > >> > >> > > >