Thanks for the feedback, Sijie:

> On 18 Jan 2021, at 19:53, Sijie Guo <guosi...@gmail.com> wrote:
> 
>> One concern for me in this thread is case (3). I'd expect a client that
>> doesn't crash to not give up, and eventually replace the bookie if it is
>> unresponsive.
>> 
> The current implementation doesn't retry replacing a bookie if an entry is
> already acknowledged (receiving AQ responses). It relies on inspection to
> repair the hole.

Ok, that's good information, let me think a bit more about it. I'd like to 
understand why we can't keep a pending add op reference until it is fully 
replicated, which as I understand would enable bookie replacement for entries 
with fewer than WQ acks.

> 
> So the memory pressure is not coming from retrying. It is straight that the
> bookkeeper client references the sendBuffers until it receives any
> responses from the slow bookie. The bookkeeper client allows enqueuing
> addEntry operations because the operations meet the AQ requirements.

I see, the entries queuing in the bookie client are inducing memory pressure in 
the presence of a slow bookie.

> Pulsar
> does add `maxPendingPublishdRequestsPerConnection` mechanism to throttle
> the add operations. But this won't work as bookkeeper will notify the
> callbacks once the operations meet the AQ requirements. But there is a huge
> amount of memory (throughput * timeout period) referenced by a slow bookie.
> Hence we have to add a memory-based throttling mechanism as Matteo
> suggested.

Thanks for pointing to Mateo's message, I reviewed it again. He actually makes 
two observations:

1- It is difficult to throttle from outside the bookkeeper client because the 
application using it does not have visibility into what has been fully 
replicated. A back pressure mechanism internal to the bookie (and possibly 
configurable) might be necessary.
2- There is some Pulsar work (PIP-74) that could be leveraged to throttle from 
outside the bookkeeper client based on memory limits.

> 
> If we want to add the retry logic to replace a bookie, this will add more
> pressure to the memory. But it can still be solved by a memory-based
> back-pressure mechansim.
> 

I don't know much about (2), but I'll have a look to form an opinion. At a high 
level, it seems reasonable. We might still want to consider doing (1) to 
simplify the job of the application.

-Flavio  

> Thanks,
> Sijie
> 
> On Mon, Jan 18, 2021 at 8:10 AM Flavio Junqueira <f...@apache.org> wrote:
> 
>> In the scenario that WQ > AQ, a client acknowledges the add of an entry e
>> to the application once it receives AQ bookie acks. Say now that the client
>> is not able to write a copy of e to at least one bookie b, it could be
>> because:
>> 
>> 1- The client crashed before it is able to do it
>> 2- Bookie b crashed
>> 3- The client gave up trying
>> 
>> In case (1), the client crashed and the ledger will be recovered by some
>> reader. For all entries that have been acknowledged, including e, I'd
>> expect them to be readable from the closed ledger. Each one of these
>> entries that haven't been written to bookie b should be written there as
>> part of the recovery process.
>> 
>> In case (2), the client is not able to write entry e to the crashed bookie
>> b, so it will replace the bookie and write e to the new bookie. I see in
>> this discussion that there is an option to disable bookie replacement, I'm
>> ignoring that for this discussion.
>> 
>> In case (3), the client say discards the entry after adding successfully
>> to AQ bookies, and gives up at some point because it can't reach the
>> bookie. The client maybe replaces bookie b or bookie b eventually comes
>> back and the client proceeds with the adds. In either case, there is a hole
>> that can only be fixed by inspecting the ledger.
>> 
>> One concern for me in this thread is case (3). I'd expect a client that
>> doesn't crash to not give up, and eventually replace the bookie if it is
>> unresponsive. But, that certainly leads to the memory pressure problem that
>> was also mentioned in the thread, for which one potential direction also
>> mentioned is to apply back pressure.
>> 
>> Thanks,
>> -Flavio
>> 
>>> On 18 Jan 2021, at 12:20, Jack Vanlightly <jvanligh...@splunk.com.INVALID>
>> wrote:
>>> 
>>>> Did you guys see any issues with the ledger auditor?
>>> 
>>>> The active writer can't guarantee it writing entries to WQ because it
>> can
>>>> crash during retrying adding entries to (WQ - AQ) bookies.
>>> 
>>> The need to repair AQ replicated entries is clear and the auditor is one
>>> such strategy. Ivan has also worked on a self-healing bookie strategy
>> where
>>> each bookie itself is able to detect these holes and is able to obtain
>> the
>>> missing entries itself. The detection of these holes using this strategy
>> is
>>> more efficient as it only requires network calls for the ledger metadata
>>> scanning (to zk) and the missing entry reads (to other bookies). The
>>> auditor as I understand it, reads all entries of all ledgers from all
>>> bookies (of an entries ensemble) meaning these entries cross the network.
>>> Using the auditor approach is likely to be run less frequently due to the
>>> network cost.
>>> 
>>> I do also wonder if the writer, on performing an ensemble change, should
>>> replay "AQ but not WQ" entries, this would just leave writer failures
>>> causing these AQ replicated entries.
>>> 
>>>> Regarding recovery reads, recovery read doesn't need to be
>> deterministic.
>>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>>> either including it or excluding it in the sealed ledger is correct
>>>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>>>> the entries in the sealed ledger can always be read and can be read
>>>> consistently.
>>> 
>>>> I am not sure it is a problem unless I misunderstand it.
>>> 
>>> It is true that it doesn't violate any safety property, but it is a
>> strange
>>> check to me. It looks like an implementation artefact rather than an
>>> explicit protocol design choice. But not a huge deal.
>>> 
>>> Jack
>>> 
>>> 
>>> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <guosi...@gmail.com> wrote:
>>> 
>>>> [ External sender. Exercise caution. ]
>>>> 
>>>> Sorry for being late in this thread.
>>>> 
>>>> If I understand this correctly, the main topic is about the "hole" when
>> WQ
>>>>> AQ.
>>>> 
>>>>> This leaves a "hole" as the entry is now replicated only to 2 bookies,
>>>> 
>>>> We do have one hole when ensemble change is enabled and WQ > AQ. That
>> was a
>>>> known behavior. But the hole will be repaired by the ledger auditor as
>> JV
>>>> said. Did you guys see any issues with the ledger auditor?
>>>> 
>>>>> I'd think that we guarantee that an entry that is acknowledged is
>>>> eventually written WQ ways and that it is observable by readers when the
>>>> ledger is closed.
>>>> 
>>>> To Flavio's question, we don't guarantee (and can't guarantee) that the
>>>> active writer will eventually write the entries to WQ. For the active
>>>> writers, we only guarantee entries are written to AQ. The ledger
>> auditor is
>>>> to ensure all the entries are written to WQ.
>>>> 
>>>> The active writer can't guarantee it writing entries to WQ because it
>> can
>>>> crash during retrying adding entries to (WQ - AQ) bookies.
>>>> 
>>>>> A single successful read is enough. However
>>>> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
>>>> explicit NoSuchEntry/Ledger, the read is considered failed and the
>> ledger
>>>> recovery process ends there. This means that given the responses
>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
>>>> considered successful is non-deterministic.
>>>> 
>>>> Regarding recovery reads, recovery read doesn't need to be
>> deterministic.
>>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>>> either including it or excluding it in the sealed ledger is correct
>>>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>>>> the entries in the sealed ledger can always be read and can be read
>>>> consistently.
>>>> 
>>>> I am not sure it is a problem unless I misunderstand it.
>>>> 
>>>> - Sijie
>>>> 
>>>> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
>>>> <jvanligh...@splunk.com.invalid> wrote:
>>>> 
>>>>> Let's set up a call and create any issues from that. I have already
>>>> created
>>>>> the patches in our (Splunk) fork and it might be easiest or not to wait
>>>>> until we re-sync up with the open source repo. We can include the fixes
>>>> in
>>>>> the discussion.
>>>>> 
>>>>> Jack
>>>>> 
>>>>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <f...@apache.org>
>> wrote:
>>>>> 
>>>>>> [ External sender. Exercise caution. ]
>>>>>> 
>>>>>> Hi Jack,
>>>>>> 
>>>>>> Thanks for getting back.
>>>>>> 
>>>>>>> What's the best way to share the TLA+ findings?
>>>>>> 
>>>>>> Would you be able to share the spec? I'm ok with reading TLA+.
>>>>>> 
>>>>>> As for sharing your specific findings, I'd suggest one of the
>>>> following:
>>>>>> 
>>>>>> 1- Create an email thread describing the scenarios that trigger a bug.
>>>>>> 2- Create issues, one for each problem you found.
>>>>>> 3- Create a discussion on the project Slack, perhaps a channel
>> specific
>>>>>> for it.
>>>>>> 4- Set up a zoom call to present and discuss with the community.
>>>>>> 
>>>>>> Option 2 is ideal from a community perspective, but we can also set up
>>>> a
>>>>>> call inviting everyone and create issues out of that discussion. We
>> can
>>>>> in
>>>>>> fact set up a call even if we create the issues ahead of time.
>>>>>> 
>>>>>> Does it make sense?
>>>>>> 
>>>>>> -Flavio
>>>>>> 
>>>>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanligh...@splunk.com
>>>>> .INVALID>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi Flavio,
>>>>>>> 
>>>>>>>>> This is an example of a scenario corresponding to what we suspect
>>>> is
>>>>> a
>>>>>>> bug introduced earlier, but Enrico is arguing that this is not the
>>>>>> intended
>>>>>>> behavior, and at this point, I agree.
>>>>>>> 
>>>>>>>>> By the time a successful callback is received, the client might
>>>> only
>>>>>>> have replicated AQ ways, so the guarantee can only be at that point
>>>> of
>>>>>>> being able to tolerate AQ - 1 crashes. The ledger configuration
>>>> states
>>>>>> that
>>>>>>> the application wants to have WQ copies >> of each entry, though. I'd
>>>>>>> expect a ledger to have WQ copies of each entry up to the final entry
>>>>>>> number when it is closed. Do you see it differently?
>>>>>>> 
>>>>>>> I also agree and was pretty surprised when I discovered the
>>>> behaviour.
>>>>> It
>>>>>>> is not something that users expect and I think we need to correct it.
>>>>> So
>>>>>>> I'm with you.
>>>>>>> 
>>>>>>> What's the best way to share the TLA+ findings?
>>>>>>> 
>>>>>>> Jack
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <f...@apache.org>
>>>>> wrote:
>>>>>>> 
>>>>>>>> [ External sender. Exercise caution. ]
>>>>>>>> 
>>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>>>> the
>>>>>>>>> confirm callback to the client is called and the LAC is set to
>>>>> 100.Now
>>>>>>>> the
>>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>>>> adds
>>>>>>>> that
>>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>>>> that
>>>>>> the
>>>>>>>>> entry e100 is not replayed to another bookie, causing this entry to
>>>>>> meet
>>>>>>>>> the rep factor of only AQ.
>>>>>>>> 
>>>>>>>> This is an example of a scenario corresponding to what we suspect
>>>> is a
>>>>>> bug
>>>>>>>> introduced earlier, but Enrico is arguing that this is not the
>>>>> intended
>>>>>>>> behavior, and at this point, I agree.
>>>>>>>> 
>>>>>>>>> This is alluded to in the docs as they state
>>>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>>> 
>>>>>>>> By the time a successful callback is received, the client might only
>>>>>> have
>>>>>>>> replicated AQ ways, so the guarantee can only be at that point of
>>>>> being
>>>>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states
>>>> that
>>>>>> the
>>>>>>>> application wants to have WQ copies of each entry, though. I'd
>>>> expect
>>>>> a
>>>>>>>> ledger to have WQ copies of each entry up to the final entry number
>>>>>> when it
>>>>>>>> is closed. Do you see it differently?
>>>>>>>> 
>>>>>>>>> I'd be happy to set up a meeting to discuss the spec and its
>>>>> findings.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> That'd be great, I'm interested.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanligh...@splunk.com
>>>>>> .INVALID>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> No you cannot miss data, if the client is not able to find a
>>>> bookie
>>>>>> that
>>>>>>>>> is
>>>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>>> 
>>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>>>> the
>>>>>>>>> confirm callback to the client is called and the LAC is set to 100.
>>>>> Now
>>>>>>>> the
>>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>>>> adds
>>>>>>>> that
>>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>>>> that
>>>>>> the
>>>>>>>>> entry e100 is not replayed to another bookie, causing this entry to
>>>>>> meet
>>>>>>>>> the rep factor of only AQ. This is alluded to in the docs as they
>>>>> state
>>>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>>>> 
>>>>>>>>>> The recovery read fails if it is not possible to read every entry
>>>>> from
>>>>>>>> at
>>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>>>> does
>>>>>> not
>>>>>>>>>> find enough bookies.
>>>>>>>>> 
>>>>>>>>> This is not quite accurate. A single successful read is enough.
>>>>> However
>>>>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
>>>>> with
>>>>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and the
>>>>>> ledger
>>>>>>>>> recovery process ends there. This means that given the responses
>>>>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
>>>> is
>>>>>>>>> considered successful is non-deterministic. If the response from b1
>>>>> is
>>>>>>>>> received last, then the read is already considered failed,
>>>> otherwise
>>>>>> the
>>>>>>>>> read succeeds.
>>>>>>>>> 
>>>>>>>>> I have come to the above conclusions through my reverse engineering
>>>>>>>> process
>>>>>>>>> for creating the TLA+ specification. I still have pending to
>>>>>>>>> reproduce the AQ rep factor behaviour via some tests, but have
>>>>> verified
>>>>>>>> via
>>>>>>>>> tests the conclusion about ledger recovery reads.
>>>>>>>>> 
>>>>>>>>> Note that I have found two defects with the BookKeeper protocol,
>>>> most
>>>>>>>>> notably data loss due to that fencing does not prevent further
>>>>>> successful
>>>>>>>>> adds. Currently the specification and associated documentation is
>>>> on
>>>>> a
>>>>>>>>> private Splunk repo, but I'd be happy to set up a meeting to
>>>> discuss
>>>>>> the
>>>>>>>>> spec and its findings.
>>>>>>>>> 
>>>>>>>>> Best
>>>>>>>>> Jack
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
>>>>> eolive...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> [ External sender. Exercise caution. ]
>>>>>>>>>> 
>>>>>>>>>> Jonathan,
>>>>>>>>>> 
>>>>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
>>>>>>>>>> jbel...@apache.org>
>>>>>>>>>> ha scritto:
>>>>>>>>>> 
>>>>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
>>>>>> confirm
>>>>>>>>>>> that
>>>>>>>>>>>> once confirmed, that an entry is not replayed to another bookie.
>>>>>> This
>>>>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2
>>>> bookies,
>>>>>>>>>>> however,
>>>>>>>>>>>> the new data integrity check that Ivan worked on, when run
>>>>>>>> periodically
>>>>>>>>>>>> will be able to repair that hole.
>>>>>>>>>>> 
>>>>>>>>>>> Can I read from the bookie with a hole in the meantime, and
>>>>> silently
>>>>>>>> miss
>>>>>>>>>>> data that it doesn't know about?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> No you cannot miss data, if the client is not able to find a
>>>> bookie
>>>>>>>> that is
>>>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>>>> 
>>>>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this
>>>> value
>>>>> is
>>>>>>>>>> stored on ledger metadata once the ledger is "closed".
>>>>>>>>>> 
>>>>>>>>>> When the ledger is still open, that is when the writer is writing
>>>> to
>>>>>> it,
>>>>>>>>>> the reader is allowed to read only up to the LastAddConfirmed
>>>> entry
>>>>>>>>>> this LAC value is returned to the reader using a piggyback
>>>>> mechanism,
>>>>>>>>>> without reading from metadata.
>>>>>>>>>> The reader cannot read beyond the latest position that has been
>>>>>>>> confirmed
>>>>>>>>>> to the writer by AQ bookies.
>>>>>>>>>> 
>>>>>>>>>> We have a third case, the 'recovery read'.
>>>>>>>>>> A reader starts a "recovery read" when you want to recover a
>>>> ledger
>>>>>> that
>>>>>>>>>> has been abandoned by a dead writer
>>>>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want to
>>>>>> fence
>>>>>>>> out
>>>>>>>>>> the old leader.
>>>>>>>>>> In this case the reader merges the current status of the ledger on
>>>>> ZK
>>>>>>>> with
>>>>>>>>>> the result of a scan of the whole ledger.
>>>>>>>>>> Basically it reads the ledger from the beginning up to the tail,
>>>>> until
>>>>>>>> it
>>>>>>>>>> is able to "read" entries, this way it is setting the 'fenced'
>>>> flag
>>>>> on
>>>>>>>> the
>>>>>>>>>> ledger
>>>>>>>>>> on every bookie and also it is able to detect the actual tail of
>>>> the
>>>>>>>> ledger
>>>>>>>>>> (because the writer died and it was not able to flush metadata to
>>>>> ZK).
>>>>>>>>>> 
>>>>>>>>>> The recovery read fails if it is not possible to read every entry
>>>>> from
>>>>>>>> at
>>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>>>> does
>>>>>> not
>>>>>>>>>> find enough bookies.
>>>>>>>>>> 
>>>>>>>>>> I hope that helps
>>>>>>>>>> Enrico
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>> 

Reply via email to