Re: Unbounded memory usage for WQ > AQ ?

Flavio Junqueira Fri, 15 Jan 2021 06:59:17 -0800

> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> confirm callback to the client is called and the LAC is set to 100.Now the
> 3rd bookie times out. Ensemble change is executed and all pending adds that
> are above the LAC of 100 are replayed to another bookie, meaning that the
> entry e100 is not replayed to another bookie, causing this entry to meet
> the rep factor of only AQ.


This is an example of a scenario corresponding to what we suspect is a bug 
introduced earlier, but Enrico is arguing that this is not the intended 
behavior, and at this point, I agree. 

> This is alluded to in the docs as they state
> that AQ is also the minimum guaranteed replication factor.

By the time a successful callback is received, the client might only have 
replicated AQ ways, so the guarantee can only be at that point of being able to 
tolerate AQ - 1 crashes. The ledger configuration states that the application 
wants to have WQ copies of each entry, though. I'd expect a ledger to have WQ 
copies of each entry up to the final entry number when it is closed. Do you see 
it differently?

>  I'd be happy to set up a meeting to discuss the spec and its findings. 


That'd be great, I'm interested.

-Flavio

> On 15 Jan 2021, at 15:30, Jack Vanlightly <[email protected]> 
> wrote:
> 
>> No you cannot miss data, if the client is not able to find a bookie that
> is
>> able to answer with the entry it receives an error.
> 
> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> confirm callback to the client is called and the LAC is set to 100. Now the
> 3rd bookie times out. Ensemble change is executed and all pending adds that
> are above the LAC of 100 are replayed to another bookie, meaning that the
> entry e100 is not replayed to another bookie, causing this entry to meet
> the rep factor of only AQ. This is alluded to in the docs as they state
> that AQ is also the minimum guaranteed replication factor.
> 
>> The recovery read fails if it is not possible to read every entry from at
>> least AQ bookies  (that is it allows WQ-QA read failures),
>> and it does not hazard to "repair" (truncate) the ledger if it does not
>> find enough bookies.
> 
> This is not quite accurate. A single successful read is enough. However
> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
> recovery process ends there. This means that given the responses
> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> considered successful is non-deterministic. If the response from b1 is
> received last, then the read is already considered failed, otherwise the
> read succeeds.
> 
> I have come to the above conclusions through my reverse engineering process
> for creating the TLA+ specification. I still have pending to
> reproduce the AQ rep factor behaviour via some tests, but have verified via
> tests the conclusion about ledger recovery reads.
> 
> Note that I have found two defects with the BookKeeper protocol, most
> notably data loss due to that fencing does not prevent further successful
> adds. Currently the specification and associated documentation is on a
> private Splunk repo, but I'd be happy to set up a meeting to discuss the
> spec and its findings.
> 
> Best
> Jack
> 
> 
> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <[email protected]>
> wrote:
> 
>> [ External sender. Exercise caution. ]
>> 
>> Jonathan,
>> 
>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
>> [email protected]>
>> ha scritto:
>> 
>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
>>>> Hi,
>>>> 
>>>> I've recently modelled the BookKeeper protocol in TLA+ and can confirm
>>> that
>>>> once confirmed, that an entry is not replayed to another bookie. This
>>>> leaves a "hole" as the entry is now replicated only to 2 bookies,
>>> however,
>>>> the new data integrity check that Ivan worked on, when run periodically
>>>> will be able to repair that hole.
>>> 
>>> Can I read from the bookie with a hole in the meantime, and silently miss
>>> data that it doesn't know about?
>>> 
>> 
>> No you cannot miss data, if the client is not able to find a bookie that is
>> able to answer with the entry it receives an error.
>> 
>> The ledger has a known tail (LastAddConfirmed entry) and this value is
>> stored on ledger metadata once the ledger is "closed".
>> 
>> When the ledger is still open, that is when the writer is writing to it,
>> the reader is allowed to read only up to the LastAddConfirmed entry
>> this LAC value is returned to the reader using a piggyback mechanism,
>> without reading from metadata.
>> The reader cannot read beyond the latest position that has been confirmed
>> to the writer by AQ bookies.
>> 
>> We have a third case, the 'recovery read'.
>> A reader starts a "recovery read" when you want to recover a ledger that
>> has been abandoned by a dead writer
>> or when you are a new leader (Pulsar Bundle Owner) or you want to fence out
>> the old leader.
>> In this case the reader merges the current status of the ledger on ZK with
>> the result of a scan of the whole ledger.
>> Basically it reads the ledger from the beginning up to the tail, until it
>> is able to "read" entries, this way it is setting the 'fenced' flag on the
>> ledger
>> on every bookie and also it is able to detect the actual tail of the ledger
>> (because the writer died and it was not able to flush metadata to ZK).
>> 
>> The recovery read fails if it is not possible to read every entry from at
>> least AQ bookies  (that is it allows WQ-QA read failures),
>> and it does not hazard to "repair" (truncate) the ledger if it does not
>> find enough bookies.
>> 
>> I hope that helps
>> Enrico
>>

Re: Unbounded memory usage for WQ > AQ ?

Reply via email to