Hi Jack,

Thanks for getting back.

> What's the best way to share the TLA+ findings?

Would you be able to share the spec? I'm ok with reading TLA+. 

As for sharing your specific findings, I'd suggest one of the following:

1- Create an email thread describing the scenarios that trigger a bug.
2- Create issues, one for each problem you found.
3- Create a discussion on the project Slack, perhaps a channel specific for it.
4- Set up a zoom call to present and discuss with the community. 

Option 2 is ideal from a community perspective, but we can also set up a call 
inviting everyone and create issues out of that discussion. We can in fact set 
up a call even if we create the issues ahead of time.

Does it make sense?

-Flavio

> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanligh...@splunk.com.INVALID> 
> wrote:
> 
> Hi Flavio,
> 
>>> This is an example of a scenario corresponding to what we suspect is a
> bug introduced earlier, but Enrico is arguing that this is not the intended
> behavior, and at this point, I agree.
> 
>>> By the time a successful callback is received, the client might only
> have replicated AQ ways, so the guarantee can only be at that point of
> being able to tolerate AQ - 1 crashes. The ledger configuration states that
> the application wants to have WQ copies >> of each entry, though. I'd
> expect a ledger to have WQ copies of each entry up to the final entry
> number when it is closed. Do you see it differently?
> 
> I also agree and was pretty surprised when I discovered the behaviour. It
> is not something that users expect and I think we need to correct it. So
> I'm with you.
> 
> What's the best way to share the TLA+ findings?
> 
> Jack
> 
> 
> 
> 
> 
> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <f...@apache.org> wrote:
> 
>> [ External sender. Exercise caution. ]
>> 
>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
>>> confirm callback to the client is called and the LAC is set to 100.Now
>> the
>>> 3rd bookie times out. Ensemble change is executed and all pending adds
>> that
>>> are above the LAC of 100 are replayed to another bookie, meaning that the
>>> entry e100 is not replayed to another bookie, causing this entry to meet
>>> the rep factor of only AQ.
>> 
>> This is an example of a scenario corresponding to what we suspect is a bug
>> introduced earlier, but Enrico is arguing that this is not the intended
>> behavior, and at this point, I agree.
>> 
>>> This is alluded to in the docs as they state
>>> that AQ is also the minimum guaranteed replication factor.
>> 
>> By the time a successful callback is received, the client might only have
>> replicated AQ ways, so the guarantee can only be at that point of being
>> able to tolerate AQ - 1 crashes. The ledger configuration states that the
>> application wants to have WQ copies of each entry, though. I'd expect a
>> ledger to have WQ copies of each entry up to the final entry number when it
>> is closed. Do you see it differently?
>> 
>>> I'd be happy to set up a meeting to discuss the spec and its findings.
>> 
>> 
>> That'd be great, I'm interested.
>> 
>> -Flavio
>> 
>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanligh...@splunk.com.INVALID>
>> wrote:
>>> 
>>>> No you cannot miss data, if the client is not able to find a bookie that
>>> is
>>>> able to answer with the entry it receives an error.
>>> 
>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
>>> confirm callback to the client is called and the LAC is set to 100. Now
>> the
>>> 3rd bookie times out. Ensemble change is executed and all pending adds
>> that
>>> are above the LAC of 100 are replayed to another bookie, meaning that the
>>> entry e100 is not replayed to another bookie, causing this entry to meet
>>> the rep factor of only AQ. This is alluded to in the docs as they state
>>> that AQ is also the minimum guaranteed replication factor.
>>> 
>>>> The recovery read fails if it is not possible to read every entry from
>> at
>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>> and it does not hazard to "repair" (truncate) the ledger if it does not
>>>> find enough bookies.
>>> 
>>> This is not quite accurate. A single successful read is enough. However
>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
>>> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
>>> recovery process ends there. This means that given the responses
>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
>>> considered successful is non-deterministic. If the response from b1 is
>>> received last, then the read is already considered failed, otherwise the
>>> read succeeds.
>>> 
>>> I have come to the above conclusions through my reverse engineering
>> process
>>> for creating the TLA+ specification. I still have pending to
>>> reproduce the AQ rep factor behaviour via some tests, but have verified
>> via
>>> tests the conclusion about ledger recovery reads.
>>> 
>>> Note that I have found two defects with the BookKeeper protocol, most
>>> notably data loss due to that fencing does not prevent further successful
>>> adds. Currently the specification and associated documentation is on a
>>> private Splunk repo, but I'd be happy to set up a meeting to discuss the
>>> spec and its findings.
>>> 
>>> Best
>>> Jack
>>> 
>>> 
>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <eolive...@gmail.com>
>>> wrote:
>>> 
>>>> [ External sender. Exercise caution. ]
>>>> 
>>>> Jonathan,
>>>> 
>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
>>>> jbel...@apache.org>
>>>> ha scritto:
>>>> 
>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can confirm
>>>>> that
>>>>>> once confirmed, that an entry is not replayed to another bookie. This
>>>>>> leaves a "hole" as the entry is now replicated only to 2 bookies,
>>>>> however,
>>>>>> the new data integrity check that Ivan worked on, when run
>> periodically
>>>>>> will be able to repair that hole.
>>>>> 
>>>>> Can I read from the bookie with a hole in the meantime, and silently
>> miss
>>>>> data that it doesn't know about?
>>>>> 
>>>> 
>>>> No you cannot miss data, if the client is not able to find a bookie
>> that is
>>>> able to answer with the entry it receives an error.
>>>> 
>>>> The ledger has a known tail (LastAddConfirmed entry) and this value is
>>>> stored on ledger metadata once the ledger is "closed".
>>>> 
>>>> When the ledger is still open, that is when the writer is writing to it,
>>>> the reader is allowed to read only up to the LastAddConfirmed entry
>>>> this LAC value is returned to the reader using a piggyback mechanism,
>>>> without reading from metadata.
>>>> The reader cannot read beyond the latest position that has been
>> confirmed
>>>> to the writer by AQ bookies.
>>>> 
>>>> We have a third case, the 'recovery read'.
>>>> A reader starts a "recovery read" when you want to recover a ledger that
>>>> has been abandoned by a dead writer
>>>> or when you are a new leader (Pulsar Bundle Owner) or you want to fence
>> out
>>>> the old leader.
>>>> In this case the reader merges the current status of the ledger on ZK
>> with
>>>> the result of a scan of the whole ledger.
>>>> Basically it reads the ledger from the beginning up to the tail, until
>> it
>>>> is able to "read" entries, this way it is setting the 'fenced' flag on
>> the
>>>> ledger
>>>> on every bookie and also it is able to detect the actual tail of the
>> ledger
>>>> (because the writer died and it was not able to flush metadata to ZK).
>>>> 
>>>> The recovery read fails if it is not possible to read every entry from
>> at
>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>> and it does not hazard to "repair" (truncate) the ledger if it does not
>>>> find enough bookies.
>>>> 
>>>> I hope that helps
>>>> Enrico
>>>> 
>> 
>> 
>> 

Reply via email to