Re: Unbounded memory usage for WQ > AQ ?

Jack Vanlightly Fri, 15 Jan 2021 07:43:48 -0800

Let's set up a call and create any issues from that. I have already created
the patches in our (Splunk) fork and it might be easiest or not to wait
until we re-sync up with the open source repo. We can include the fixes in
the discussion.


Jack

On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <f...@apache.org> wrote:

> [ External sender. Exercise caution. ]
>
> Hi Jack,
>
> Thanks for getting back.
>
> > What's the best way to share the TLA+ findings?
>
> Would you be able to share the spec? I'm ok with reading TLA+.
>
> As for sharing your specific findings, I'd suggest one of the following:
>
> 1- Create an email thread describing the scenarios that trigger a bug.
> 2- Create issues, one for each problem you found.
> 3- Create a discussion on the project Slack, perhaps a channel specific
> for it.
> 4- Set up a zoom call to present and discuss with the community.
>
> Option 2 is ideal from a community perspective, but we can also set up a
> call inviting everyone and create issues out of that discussion. We can in
> fact set up a call even if we create the issues ahead of time.
>
> Does it make sense?
>
> -Flavio
>
> > On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanligh...@splunk.com.INVALID>
> wrote:
> >
> > Hi Flavio,
> >
> >>> This is an example of a scenario corresponding to what we suspect is a
> > bug introduced earlier, but Enrico is arguing that this is not the
> intended
> > behavior, and at this point, I agree.
> >
> >>> By the time a successful callback is received, the client might only
> > have replicated AQ ways, so the guarantee can only be at that point of
> > being able to tolerate AQ - 1 crashes. The ledger configuration states
> that
> > the application wants to have WQ copies >> of each entry, though. I'd
> > expect a ledger to have WQ copies of each entry up to the final entry
> > number when it is closed. Do you see it differently?
> >
> > I also agree and was pretty surprised when I discovered the behaviour. It
> > is not something that users expect and I think we need to correct it. So
> > I'm with you.
> >
> > What's the best way to share the TLA+ findings?
> >
> > Jack
> >
> >
> >
> >
> >
> > On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <f...@apache.org> wrote:
> >
> >> [ External sender. Exercise caution. ]
> >>
> >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> >>> confirm callback to the client is called and the LAC is set to 100.Now
> >> the
> >>> 3rd bookie times out. Ensemble change is executed and all pending adds
> >> that
> >>> are above the LAC of 100 are replayed to another bookie, meaning that
> the
> >>> entry e100 is not replayed to another bookie, causing this entry to
> meet
> >>> the rep factor of only AQ.
> >>
> >> This is an example of a scenario corresponding to what we suspect is a
> bug
> >> introduced earlier, but Enrico is arguing that this is not the intended
> >> behavior, and at this point, I agree.
> >>
> >>> This is alluded to in the docs as they state
> >>> that AQ is also the minimum guaranteed replication factor.
> >>
> >> By the time a successful callback is received, the client might only
> have
> >> replicated AQ ways, so the guarantee can only be at that point of being
> >> able to tolerate AQ - 1 crashes. The ledger configuration states that
> the
> >> application wants to have WQ copies of each entry, though. I'd expect a
> >> ledger to have WQ copies of each entry up to the final entry number
> when it
> >> is closed. Do you see it differently?
> >>
> >>> I'd be happy to set up a meeting to discuss the spec and its findings.
> >>
> >>
> >> That'd be great, I'm interested.
> >>
> >> -Flavio
> >>
> >>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanligh...@splunk.com
> .INVALID>
> >> wrote:
> >>>
> >>>> No you cannot miss data, if the client is not able to find a bookie
> that
> >>> is
> >>>> able to answer with the entry it receives an error.
> >>>
> >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> >>> confirm callback to the client is called and the LAC is set to 100. Now
> >> the
> >>> 3rd bookie times out. Ensemble change is executed and all pending adds
> >> that
> >>> are above the LAC of 100 are replayed to another bookie, meaning that
> the
> >>> entry e100 is not replayed to another bookie, causing this entry to
> meet
> >>> the rep factor of only AQ. This is alluded to in the docs as they state
> >>> that AQ is also the minimum guaranteed replication factor.
> >>>
> >>>> The recovery read fails if it is not possible to read every entry from
> >> at
> >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> >>>> and it does not hazard to "repair" (truncate) the ledger if it does
> not
> >>>> find enough bookies.
> >>>
> >>> This is not quite accurate. A single successful read is enough. However
> >>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
> >>> explicit NoSuchEntry/Ledger, the read is considered failed and the
> ledger
> >>> recovery process ends there. This means that given the responses
> >>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> >>> considered successful is non-deterministic. If the response from b1 is
> >>> received last, then the read is already considered failed, otherwise
> the
> >>> read succeeds.
> >>>
> >>> I have come to the above conclusions through my reverse engineering
> >> process
> >>> for creating the TLA+ specification. I still have pending to
> >>> reproduce the AQ rep factor behaviour via some tests, but have verified
> >> via
> >>> tests the conclusion about ledger recovery reads.
> >>>
> >>> Note that I have found two defects with the BookKeeper protocol, most
> >>> notably data loss due to that fencing does not prevent further
> successful
> >>> adds. Currently the specification and associated documentation is on a
> >>> private Splunk repo, but I'd be happy to set up a meeting to discuss
> the
> >>> spec and its findings.
> >>>
> >>> Best
> >>> Jack
> >>>
> >>>
> >>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <eolive...@gmail.com>
> >>> wrote:
> >>>
> >>>> [ External sender. Exercise caution. ]
> >>>>
> >>>> Jonathan,
> >>>>
> >>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> >>>> jbel...@apache.org>
> >>>> ha scritto:
> >>>>
> >>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
> confirm
> >>>>> that
> >>>>>> once confirmed, that an entry is not replayed to another bookie.
> This
> >>>>>> leaves a "hole" as the entry is now replicated only to 2 bookies,
> >>>>> however,
> >>>>>> the new data integrity check that Ivan worked on, when run
> >> periodically
> >>>>>> will be able to repair that hole.
> >>>>>
> >>>>> Can I read from the bookie with a hole in the meantime, and silently
> >> miss
> >>>>> data that it doesn't know about?
> >>>>>
> >>>>
> >>>> No you cannot miss data, if the client is not able to find a bookie
> >> that is
> >>>> able to answer with the entry it receives an error.
> >>>>
> >>>> The ledger has a known tail (LastAddConfirmed entry) and this value is
> >>>> stored on ledger metadata once the ledger is "closed".
> >>>>
> >>>> When the ledger is still open, that is when the writer is writing to
> it,
> >>>> the reader is allowed to read only up to the LastAddConfirmed entry
> >>>> this LAC value is returned to the reader using a piggyback mechanism,
> >>>> without reading from metadata.
> >>>> The reader cannot read beyond the latest position that has been
> >> confirmed
> >>>> to the writer by AQ bookies.
> >>>>
> >>>> We have a third case, the 'recovery read'.
> >>>> A reader starts a "recovery read" when you want to recover a ledger
> that
> >>>> has been abandoned by a dead writer
> >>>> or when you are a new leader (Pulsar Bundle Owner) or you want to
> fence
> >> out
> >>>> the old leader.
> >>>> In this case the reader merges the current status of the ledger on ZK
> >> with
> >>>> the result of a scan of the whole ledger.
> >>>> Basically it reads the ledger from the beginning up to the tail, until
> >> it
> >>>> is able to "read" entries, this way it is setting the 'fenced' flag on
> >> the
> >>>> ledger
> >>>> on every bookie and also it is able to detect the actual tail of the
> >> ledger
> >>>> (because the writer died and it was not able to flush metadata to ZK).
> >>>>
> >>>> The recovery read fails if it is not possible to read every entry from
> >> at
> >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> >>>> and it does not hazard to "repair" (truncate) the ledger if it does
> not
> >>>> find enough bookies.
> >>>>
> >>>> I hope that helps
> >>>> Enrico
> >>>>
> >>
> >>
> >>
>
>
>

Re: Unbounded memory usage for WQ > AQ ?

Reply via email to