Re: Unbounded memory usage for WQ > AQ ?

Jack Vanlightly Fri, 15 Jan 2021 07:14:26 -0800

Hi Flavio,

>> This is an example of a scenario corresponding to what we suspect is a
bug introduced earlier, but Enrico is arguing that this is not the intended
behavior, and at this point, I agree.


>> By the time a successful callback is received, the client might only
have replicated AQ ways, so the guarantee can only be at that point of
being able to tolerate AQ - 1 crashes. The ledger configuration states that
the application wants to have WQ copies >> of each entry, though. I'd
expect a ledger to have WQ copies of each entry up to the final entry
number when it is closed. Do you see it differently?

I also agree and was pretty surprised when I discovered the behaviour. It
is not something that users expect and I think we need to correct it. So
I'm with you.

What's the best way to share the TLA+ findings?

Jack





On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <f...@apache.org> wrote:

> [ External sender. Exercise caution. ]
>
> > Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> > confirm callback to the client is called and the LAC is set to 100.Now
> the
> > 3rd bookie times out. Ensemble change is executed and all pending adds
> that
> > are above the LAC of 100 are replayed to another bookie, meaning that the
> > entry e100 is not replayed to another bookie, causing this entry to meet
> > the rep factor of only AQ.
>
> This is an example of a scenario corresponding to what we suspect is a bug
> introduced earlier, but Enrico is arguing that this is not the intended
> behavior, and at this point, I agree.
>
> > This is alluded to in the docs as they state
> > that AQ is also the minimum guaranteed replication factor.
>
> By the time a successful callback is received, the client might only have
> replicated AQ ways, so the guarantee can only be at that point of being
> able to tolerate AQ - 1 crashes. The ledger configuration states that the
> application wants to have WQ copies of each entry, though. I'd expect a
> ledger to have WQ copies of each entry up to the final entry number when it
> is closed. Do you see it differently?
>
> >  I'd be happy to set up a meeting to discuss the spec and its findings.
>
>
> That'd be great, I'm interested.
>
> -Flavio
>
> > On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanligh...@splunk.com.INVALID>
> wrote:
> >
> >> No you cannot miss data, if the client is not able to find a bookie that
> > is
> >> able to answer with the entry it receives an error.
> >
> > Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> > confirm callback to the client is called and the LAC is set to 100. Now
> the
> > 3rd bookie times out. Ensemble change is executed and all pending adds
> that
> > are above the LAC of 100 are replayed to another bookie, meaning that the
> > entry e100 is not replayed to another bookie, causing this entry to meet
> > the rep factor of only AQ. This is alluded to in the docs as they state
> > that AQ is also the minimum guaranteed replication factor.
> >
> >> The recovery read fails if it is not possible to read every entry from
> at
> >> least AQ bookies  (that is it allows WQ-QA read failures),
> >> and it does not hazard to "repair" (truncate) the ledger if it does not
> >> find enough bookies.
> >
> > This is not quite accurate. A single successful read is enough. However
> > there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
> > explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
> > recovery process ends there. This means that given the responses
> > b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> > considered successful is non-deterministic. If the response from b1 is
> > received last, then the read is already considered failed, otherwise the
> > read succeeds.
> >
> > I have come to the above conclusions through my reverse engineering
> process
> > for creating the TLA+ specification. I still have pending to
> > reproduce the AQ rep factor behaviour via some tests, but have verified
> via
> > tests the conclusion about ledger recovery reads.
> >
> > Note that I have found two defects with the BookKeeper protocol, most
> > notably data loss due to that fencing does not prevent further successful
> > adds. Currently the specification and associated documentation is on a
> > private Splunk repo, but I'd be happy to set up a meeting to discuss the
> > spec and its findings.
> >
> > Best
> > Jack
> >
> >
> > On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <eolive...@gmail.com>
> > wrote:
> >
> >> [ External sender. Exercise caution. ]
> >>
> >> Jonathan,
> >>
> >> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> >> jbel...@apache.org>
> >> ha scritto:
> >>
> >>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> >>>> Hi,
> >>>>
> >>>> I've recently modelled the BookKeeper protocol in TLA+ and can confirm
> >>> that
> >>>> once confirmed, that an entry is not replayed to another bookie. This
> >>>> leaves a "hole" as the entry is now replicated only to 2 bookies,
> >>> however,
> >>>> the new data integrity check that Ivan worked on, when run
> periodically
> >>>> will be able to repair that hole.
> >>>
> >>> Can I read from the bookie with a hole in the meantime, and silently
> miss
> >>> data that it doesn't know about?
> >>>
> >>
> >> No you cannot miss data, if the client is not able to find a bookie
> that is
> >> able to answer with the entry it receives an error.
> >>
> >> The ledger has a known tail (LastAddConfirmed entry) and this value is
> >> stored on ledger metadata once the ledger is "closed".
> >>
> >> When the ledger is still open, that is when the writer is writing to it,
> >> the reader is allowed to read only up to the LastAddConfirmed entry
> >> this LAC value is returned to the reader using a piggyback mechanism,
> >> without reading from metadata.
> >> The reader cannot read beyond the latest position that has been
> confirmed
> >> to the writer by AQ bookies.
> >>
> >> We have a third case, the 'recovery read'.
> >> A reader starts a "recovery read" when you want to recover a ledger that
> >> has been abandoned by a dead writer
> >> or when you are a new leader (Pulsar Bundle Owner) or you want to fence
> out
> >> the old leader.
> >> In this case the reader merges the current status of the ledger on ZK
> with
> >> the result of a scan of the whole ledger.
> >> Basically it reads the ledger from the beginning up to the tail, until
> it
> >> is able to "read" entries, this way it is setting the 'fenced' flag on
> the
> >> ledger
> >> on every bookie and also it is able to detect the actual tail of the
> ledger
> >> (because the writer died and it was not able to flush metadata to ZK).
> >>
> >> The recovery read fails if it is not possible to read every entry from
> at
> >> least AQ bookies  (that is it allows WQ-QA read failures),
> >> and it does not hazard to "repair" (truncate) the ledger if it does not
> >> find enough bookies.
> >>
> >> I hope that helps
> >> Enrico
> >>
>
>
>

Re: Unbounded memory usage for WQ > AQ ?

Reply via email to