Thanks for the feedback, Sijie: > On 18 Jan 2021, at 19:53, Sijie Guo <guosi...@gmail.com> wrote: > >> One concern for me in this thread is case (3). I'd expect a client that >> doesn't crash to not give up, and eventually replace the bookie if it is >> unresponsive. >> > The current implementation doesn't retry replacing a bookie if an entry is > already acknowledged (receiving AQ responses). It relies on inspection to > repair the hole.
Ok, that's good information, let me think a bit more about it. I'd like to understand why we can't keep a pending add op reference until it is fully replicated, which as I understand would enable bookie replacement for entries with fewer than WQ acks. > > So the memory pressure is not coming from retrying. It is straight that the > bookkeeper client references the sendBuffers until it receives any > responses from the slow bookie. The bookkeeper client allows enqueuing > addEntry operations because the operations meet the AQ requirements. I see, the entries queuing in the bookie client are inducing memory pressure in the presence of a slow bookie. > Pulsar > does add `maxPendingPublishdRequestsPerConnection` mechanism to throttle > the add operations. But this won't work as bookkeeper will notify the > callbacks once the operations meet the AQ requirements. But there is a huge > amount of memory (throughput * timeout period) referenced by a slow bookie. > Hence we have to add a memory-based throttling mechanism as Matteo > suggested. Thanks for pointing to Mateo's message, I reviewed it again. He actually makes two observations: 1- It is difficult to throttle from outside the bookkeeper client because the application using it does not have visibility into what has been fully replicated. A back pressure mechanism internal to the bookie (and possibly configurable) might be necessary. 2- There is some Pulsar work (PIP-74) that could be leveraged to throttle from outside the bookkeeper client based on memory limits. > > If we want to add the retry logic to replace a bookie, this will add more > pressure to the memory. But it can still be solved by a memory-based > back-pressure mechansim. > I don't know much about (2), but I'll have a look to form an opinion. At a high level, it seems reasonable. We might still want to consider doing (1) to simplify the job of the application. -Flavio > Thanks, > Sijie > > On Mon, Jan 18, 2021 at 8:10 AM Flavio Junqueira <f...@apache.org> wrote: > >> In the scenario that WQ > AQ, a client acknowledges the add of an entry e >> to the application once it receives AQ bookie acks. Say now that the client >> is not able to write a copy of e to at least one bookie b, it could be >> because: >> >> 1- The client crashed before it is able to do it >> 2- Bookie b crashed >> 3- The client gave up trying >> >> In case (1), the client crashed and the ledger will be recovered by some >> reader. For all entries that have been acknowledged, including e, I'd >> expect them to be readable from the closed ledger. Each one of these >> entries that haven't been written to bookie b should be written there as >> part of the recovery process. >> >> In case (2), the client is not able to write entry e to the crashed bookie >> b, so it will replace the bookie and write e to the new bookie. I see in >> this discussion that there is an option to disable bookie replacement, I'm >> ignoring that for this discussion. >> >> In case (3), the client say discards the entry after adding successfully >> to AQ bookies, and gives up at some point because it can't reach the >> bookie. The client maybe replaces bookie b or bookie b eventually comes >> back and the client proceeds with the adds. In either case, there is a hole >> that can only be fixed by inspecting the ledger. >> >> One concern for me in this thread is case (3). I'd expect a client that >> doesn't crash to not give up, and eventually replace the bookie if it is >> unresponsive. But, that certainly leads to the memory pressure problem that >> was also mentioned in the thread, for which one potential direction also >> mentioned is to apply back pressure. >> >> Thanks, >> -Flavio >> >>> On 18 Jan 2021, at 12:20, Jack Vanlightly <jvanligh...@splunk.com.INVALID> >> wrote: >>> >>>> Did you guys see any issues with the ledger auditor? >>> >>>> The active writer can't guarantee it writing entries to WQ because it >> can >>>> crash during retrying adding entries to (WQ - AQ) bookies. >>> >>> The need to repair AQ replicated entries is clear and the auditor is one >>> such strategy. Ivan has also worked on a self-healing bookie strategy >> where >>> each bookie itself is able to detect these holes and is able to obtain >> the >>> missing entries itself. The detection of these holes using this strategy >> is >>> more efficient as it only requires network calls for the ledger metadata >>> scanning (to zk) and the missing entry reads (to other bookies). The >>> auditor as I understand it, reads all entries of all ledgers from all >>> bookies (of an entries ensemble) meaning these entries cross the network. >>> Using the auditor approach is likely to be run less frequently due to the >>> network cost. >>> >>> I do also wonder if the writer, on performing an ensemble change, should >>> replay "AQ but not WQ" entries, this would just leave writer failures >>> causing these AQ replicated entries. >>> >>>> Regarding recovery reads, recovery read doesn't need to be >> deterministic. >>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger), >>>> either including it or excluding it in the sealed ledger is correct >>>> behavior. The bookkeeper client guarantees that once a ledger is sealed, >>>> the entries in the sealed ledger can always be read and can be read >>>> consistently. >>> >>>> I am not sure it is a problem unless I misunderstand it. >>> >>> It is true that it doesn't violate any safety property, but it is a >> strange >>> check to me. It looks like an implementation artefact rather than an >>> explicit protocol design choice. But not a huge deal. >>> >>> Jack >>> >>> >>> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <guosi...@gmail.com> wrote: >>> >>>> [ External sender. Exercise caution. ] >>>> >>>> Sorry for being late in this thread. >>>> >>>> If I understand this correctly, the main topic is about the "hole" when >> WQ >>>>> AQ. >>>> >>>>> This leaves a "hole" as the entry is now replicated only to 2 bookies, >>>> >>>> We do have one hole when ensemble change is enabled and WQ > AQ. That >> was a >>>> known behavior. But the hole will be repaired by the ledger auditor as >> JV >>>> said. Did you guys see any issues with the ledger auditor? >>>> >>>>> I'd think that we guarantee that an entry that is acknowledged is >>>> eventually written WQ ways and that it is observable by readers when the >>>> ledger is closed. >>>> >>>> To Flavio's question, we don't guarantee (and can't guarantee) that the >>>> active writer will eventually write the entries to WQ. For the active >>>> writers, we only guarantee entries are written to AQ. The ledger >> auditor is >>>> to ensure all the entries are written to WQ. >>>> >>>> The active writer can't guarantee it writing entries to WQ because it >> can >>>> crash during retrying adding entries to (WQ - AQ) bookies. >>>> >>>>> A single successful read is enough. However >>>> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with >>>> explicit NoSuchEntry/Ledger, the read is considered failed and the >> ledger >>>> recovery process ends there. This means that given the responses >>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is >>>> considered successful is non-deterministic. >>>> >>>> Regarding recovery reads, recovery read doesn't need to be >> deterministic. >>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger), >>>> either including it or excluding it in the sealed ledger is correct >>>> behavior. The bookkeeper client guarantees that once a ledger is sealed, >>>> the entries in the sealed ledger can always be read and can be read >>>> consistently. >>>> >>>> I am not sure it is a problem unless I misunderstand it. >>>> >>>> - Sijie >>>> >>>> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly >>>> <jvanligh...@splunk.com.invalid> wrote: >>>> >>>>> Let's set up a call and create any issues from that. I have already >>>> created >>>>> the patches in our (Splunk) fork and it might be easiest or not to wait >>>>> until we re-sync up with the open source repo. We can include the fixes >>>> in >>>>> the discussion. >>>>> >>>>> Jack >>>>> >>>>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <f...@apache.org> >> wrote: >>>>> >>>>>> [ External sender. Exercise caution. ] >>>>>> >>>>>> Hi Jack, >>>>>> >>>>>> Thanks for getting back. >>>>>> >>>>>>> What's the best way to share the TLA+ findings? >>>>>> >>>>>> Would you be able to share the spec? I'm ok with reading TLA+. >>>>>> >>>>>> As for sharing your specific findings, I'd suggest one of the >>>> following: >>>>>> >>>>>> 1- Create an email thread describing the scenarios that trigger a bug. >>>>>> 2- Create issues, one for each problem you found. >>>>>> 3- Create a discussion on the project Slack, perhaps a channel >> specific >>>>>> for it. >>>>>> 4- Set up a zoom call to present and discuss with the community. >>>>>> >>>>>> Option 2 is ideal from a community perspective, but we can also set up >>>> a >>>>>> call inviting everyone and create issues out of that discussion. We >> can >>>>> in >>>>>> fact set up a call even if we create the issues ahead of time. >>>>>> >>>>>> Does it make sense? >>>>>> >>>>>> -Flavio >>>>>> >>>>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanligh...@splunk.com >>>>> .INVALID> >>>>>> wrote: >>>>>>> >>>>>>> Hi Flavio, >>>>>>> >>>>>>>>> This is an example of a scenario corresponding to what we suspect >>>> is >>>>> a >>>>>>> bug introduced earlier, but Enrico is arguing that this is not the >>>>>> intended >>>>>>> behavior, and at this point, I agree. >>>>>>> >>>>>>>>> By the time a successful callback is received, the client might >>>> only >>>>>>> have replicated AQ ways, so the guarantee can only be at that point >>>> of >>>>>>> being able to tolerate AQ - 1 crashes. The ledger configuration >>>> states >>>>>> that >>>>>>> the application wants to have WQ copies >> of each entry, though. I'd >>>>>>> expect a ledger to have WQ copies of each entry up to the final entry >>>>>>> number when it is closed. Do you see it differently? >>>>>>> >>>>>>> I also agree and was pretty surprised when I discovered the >>>> behaviour. >>>>> It >>>>>>> is not something that users expect and I think we need to correct it. >>>>> So >>>>>>> I'm with you. >>>>>>> >>>>>>> What's the best way to share the TLA+ findings? >>>>>>> >>>>>>> Jack >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <f...@apache.org> >>>>> wrote: >>>>>>> >>>>>>>> [ External sender. Exercise caution. ] >>>>>>>> >>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and >>>> the >>>>>>>>> confirm callback to the client is called and the LAC is set to >>>>> 100.Now >>>>>>>> the >>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending >>>>> adds >>>>>>>> that >>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning >>>> that >>>>>> the >>>>>>>>> entry e100 is not replayed to another bookie, causing this entry to >>>>>> meet >>>>>>>>> the rep factor of only AQ. >>>>>>>> >>>>>>>> This is an example of a scenario corresponding to what we suspect >>>> is a >>>>>> bug >>>>>>>> introduced earlier, but Enrico is arguing that this is not the >>>>> intended >>>>>>>> behavior, and at this point, I agree. >>>>>>>> >>>>>>>>> This is alluded to in the docs as they state >>>>>>>>> that AQ is also the minimum guaranteed replication factor. >>>>>>>> >>>>>>>> By the time a successful callback is received, the client might only >>>>>> have >>>>>>>> replicated AQ ways, so the guarantee can only be at that point of >>>>> being >>>>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states >>>> that >>>>>> the >>>>>>>> application wants to have WQ copies of each entry, though. I'd >>>> expect >>>>> a >>>>>>>> ledger to have WQ copies of each entry up to the final entry number >>>>>> when it >>>>>>>> is closed. Do you see it differently? >>>>>>>> >>>>>>>>> I'd be happy to set up a meeting to discuss the spec and its >>>>> findings. >>>>>>>> >>>>>>>> >>>>>>>> That'd be great, I'm interested. >>>>>>>> >>>>>>>> -Flavio >>>>>>>> >>>>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanligh...@splunk.com >>>>>> .INVALID> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> No you cannot miss data, if the client is not able to find a >>>> bookie >>>>>> that >>>>>>>>> is >>>>>>>>>> able to answer with the entry it receives an error. >>>>>>>>> >>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and >>>> the >>>>>>>>> confirm callback to the client is called and the LAC is set to 100. >>>>> Now >>>>>>>> the >>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending >>>>> adds >>>>>>>> that >>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning >>>> that >>>>>> the >>>>>>>>> entry e100 is not replayed to another bookie, causing this entry to >>>>>> meet >>>>>>>>> the rep factor of only AQ. This is alluded to in the docs as they >>>>> state >>>>>>>>> that AQ is also the minimum guaranteed replication factor. >>>>>>>>> >>>>>>>>>> The recovery read fails if it is not possible to read every entry >>>>> from >>>>>>>> at >>>>>>>>>> least AQ bookies (that is it allows WQ-QA read failures), >>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it >>>> does >>>>>> not >>>>>>>>>> find enough bookies. >>>>>>>>> >>>>>>>>> This is not quite accurate. A single successful read is enough. >>>>> However >>>>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail >>>>> with >>>>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and the >>>>>> ledger >>>>>>>>> recovery process ends there. This means that given the responses >>>>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read >>>> is >>>>>>>>> considered successful is non-deterministic. If the response from b1 >>>>> is >>>>>>>>> received last, then the read is already considered failed, >>>> otherwise >>>>>> the >>>>>>>>> read succeeds. >>>>>>>>> >>>>>>>>> I have come to the above conclusions through my reverse engineering >>>>>>>> process >>>>>>>>> for creating the TLA+ specification. I still have pending to >>>>>>>>> reproduce the AQ rep factor behaviour via some tests, but have >>>>> verified >>>>>>>> via >>>>>>>>> tests the conclusion about ledger recovery reads. >>>>>>>>> >>>>>>>>> Note that I have found two defects with the BookKeeper protocol, >>>> most >>>>>>>>> notably data loss due to that fencing does not prevent further >>>>>> successful >>>>>>>>> adds. Currently the specification and associated documentation is >>>> on >>>>> a >>>>>>>>> private Splunk repo, but I'd be happy to set up a meeting to >>>> discuss >>>>>> the >>>>>>>>> spec and its findings. >>>>>>>>> >>>>>>>>> Best >>>>>>>>> Jack >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli < >>>>> eolive...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> [ External sender. Exercise caution. ] >>>>>>>>>> >>>>>>>>>> Jonathan, >>>>>>>>>> >>>>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis < >>>>>>>>>> jbel...@apache.org> >>>>>>>>>> ha scritto: >>>>>>>>>> >>>>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can >>>>>> confirm >>>>>>>>>>> that >>>>>>>>>>>> once confirmed, that an entry is not replayed to another bookie. >>>>>> This >>>>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2 >>>> bookies, >>>>>>>>>>> however, >>>>>>>>>>>> the new data integrity check that Ivan worked on, when run >>>>>>>> periodically >>>>>>>>>>>> will be able to repair that hole. >>>>>>>>>>> >>>>>>>>>>> Can I read from the bookie with a hole in the meantime, and >>>>> silently >>>>>>>> miss >>>>>>>>>>> data that it doesn't know about? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> No you cannot miss data, if the client is not able to find a >>>> bookie >>>>>>>> that is >>>>>>>>>> able to answer with the entry it receives an error. >>>>>>>>>> >>>>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this >>>> value >>>>> is >>>>>>>>>> stored on ledger metadata once the ledger is "closed". >>>>>>>>>> >>>>>>>>>> When the ledger is still open, that is when the writer is writing >>>> to >>>>>> it, >>>>>>>>>> the reader is allowed to read only up to the LastAddConfirmed >>>> entry >>>>>>>>>> this LAC value is returned to the reader using a piggyback >>>>> mechanism, >>>>>>>>>> without reading from metadata. >>>>>>>>>> The reader cannot read beyond the latest position that has been >>>>>>>> confirmed >>>>>>>>>> to the writer by AQ bookies. >>>>>>>>>> >>>>>>>>>> We have a third case, the 'recovery read'. >>>>>>>>>> A reader starts a "recovery read" when you want to recover a >>>> ledger >>>>>> that >>>>>>>>>> has been abandoned by a dead writer >>>>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want to >>>>>> fence >>>>>>>> out >>>>>>>>>> the old leader. >>>>>>>>>> In this case the reader merges the current status of the ledger on >>>>> ZK >>>>>>>> with >>>>>>>>>> the result of a scan of the whole ledger. >>>>>>>>>> Basically it reads the ledger from the beginning up to the tail, >>>>> until >>>>>>>> it >>>>>>>>>> is able to "read" entries, this way it is setting the 'fenced' >>>> flag >>>>> on >>>>>>>> the >>>>>>>>>> ledger >>>>>>>>>> on every bookie and also it is able to detect the actual tail of >>>> the >>>>>>>> ledger >>>>>>>>>> (because the writer died and it was not able to flush metadata to >>>>> ZK). >>>>>>>>>> >>>>>>>>>> The recovery read fails if it is not possible to read every entry >>>>> from >>>>>>>> at >>>>>>>>>> least AQ bookies (that is it allows WQ-QA read failures), >>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it >>>> does >>>>>> not >>>>>>>>>> find enough bookies. >>>>>>>>>> >>>>>>>>>> I hope that helps >>>>>>>>>> Enrico >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >> >>