Re: [DISCUSS] KIP-966: Eligible Leader Replicas

Jun Rao Thu, 07 Sep 2023 11:45:21 -0700

Hi, Calvin,

Thanks for the KIP. A few comments below.


10. "The High Watermark forwarding still requires a quorum within the ISR."
Should it say that it requires the full ISR?

11. "remove the duplicate member in both ISR and ELR from ELR." Hmm, not
sure that I follow since the KIP says that ELR doesn't overlap with ISR.

12. T1: In this case, min ISR is 3. Why would HWM advance to 2 with only 2
members in ISR?

13. The KIP says "Note that, if maximal ISR > ISR, the message should be
replicated to the maximal ISR before covering the message under HWM. The
proposal does not change this behavior." and "Currently, we would advance
HWM because it replicated to 2 brokers (the ones in Maximal ISR), but in
the new protocol we wait until the controller updates ISR=[0,2] to avoid
advancing HWM beyond what ELR=[1] has." They seem a bit inconsistent since
one says no change and the other describes a change.

14. "If there are no ELR members. If the
unclean.recovery.strategy=balanced, the controller will do the unclean
recovery. Otherwise, unclean.recovery.strategy=Manual, the controller will
not attempt to elect a leader. Waiting for the user operations." What
happens with unclean.recovery.strategy=Proactive?

15. "In Balance mode, all the LastKnownELR members have replied." In
Proactive, we wait for all replicas within a fixed amount of time. Balance
should do the same since it's designed to preserve more data, right?

16. "The URM will query all the replicas including the fenced replicas."
Why include the fenced replicas? Could a fenced replica be elected as the
leader?

17. Once unclean.recovery.strategy is enabled, new metadata records could
be written to the metadata log. At that point, is the broker downgradable?
It would be useful to document that.

18. Since LastKnownELR can have more than 1 member, should it be
LastKnownELRs?

19. BrokerRegistration.BrokerEpoch: "The broker's assigned epoch or the
epoch before a clean shutdown." How do we tell whether the value is for the
current or the previous epoch? Does it matter?

20. DescribeTopicRequest: Who issues that request? Who can serve that
request? Is it only the controller or any broker?

21. DesiredLeaders: Does the ordering matter?

22. GetReplicaLogInfo only uses topicId while DescribeTopicRequest uses
both topicId and name. Should they be consistent?

23. --election-type: The description mentions unclean, but that option
doesn't exist. Also, could we describe what DESIGNATION means?

24. kafka-leader-election.sh has minimalReplicas, but ElectLeadersRequest
doesn't seem to have a corresponding field?

25. kafka.replication.paused_partitions_count: paused doesn't seem to match
the meaning of the metric. Should this be leaderless_partitions_count?

26. kafka.replication.unclean_recovery_partitions_count: When is it set?
Does it ever get unset?

27. "min.insync.replicas now applies to the replication of all kinds of
messages." Not sure that I follow. Could you explain a bit more?

28. "But later, the current leader will put the follower into the pending
ISR" : It would be useful to clarify this is after the network partitioning
is gone.

29. "last known leader" behavior. Our current behavior is to preserve the
last known ISR, right?

30. For all new requests, it would be useful to document the corresponding
ACL.

Jun

On Wed, Sep 6, 2023 at 11:21 AM Calvin Liu <ca...@confluent.io.invalid>
wrote:

> Hi Jack
> Thanks for the comment.
> I have updated the reassignment part. Now the reassignment can only be
> completed/canceled if the final ISR size is larger than min ISR.
> Thanks to your efforts of the TLA+! It has been a great help to the KIP!
>
> On Wed, Sep 6, 2023 at 6:32 AM Jack Vanlightly <vanligh...@apache.org>
> wrote:
>
> > Hi Calvin,
> >
> > Regarding partition reassignment, I have two comments.
> >
> > I notice the KIP says "The AlterPartitionReassignments will not change
> the
> > ELR" however, when a reassignment completes (or reverts) any replicas
> > removed from the replica set would be removed from the ELR. Sounds
> obvious
> > but I figured we should be explicit about that.
> >
> > Reassignment should also respect min.insync.replicas because currently a
> > reassignment can complete as long as the ISR is not empty and all added
> > replicas are members. However, my TLA+ specification, which now includes
> > reassignment, finds single broker failures that can cause committed data
> > loss - despite the added protection of the ELR and min.insync.replicas=2.
> > These scenarios are limited to shrinking the size of the replica set. If
> we
> > modify the PartitionChangeBuilder to add the completion condition that
> the
> > target ISR >= min.insync.replicas, then that closes this last
> > single-broker-failure data loss case.
> >
> > With the above modification, the TLA+ specification of the ELR part of
> the
> > design is standing up to all safety and liveness checks. The only thing
> > that is not modeled is the unclean recovery though I may leave that as
> the
> > specification is already very large.
> >
> > Jack
> >
> > On 2023/09/01 22:27:10 Calvin Liu wrote:
> > > Hi Justine
> > > 1. With the new proposal, in order to let the consumer consume a
> message
> > > when only 1 replica commits it to its log, the min ISR has to be set to
> > 1.
> > > 2. Yes, we will need an unclean recovery if the leader has an unclean
> > > shutdown.
> > > 3. If the min ISR config is changed to a larger value, the ISR and ELR
> > will
> > > not be updated. ISR members are always valid no matter how min ISR
> > changes.
> > > If ELR is not empty, then the HWM can't advance as well after the min
> ISR
> > > increase, so the ELR members are safe to stay.
> > > 4. I will highlight the explanation. Thanks.
> > >
> > > On Thu, Aug 31, 2023 at 4:35 PM Justine Olshan
> > <jols...@confluent.io.invalid>
> > > wrote:
> > >
> > > > Hey Calvin,
> > > >
> > > > Thanks for the responses. I think I understood most of it, but had a
> > few
> > > > follow up questions
> > > >
> > > > 1. For the acks=1 case, I was wondering if there is any way to
> continue
> > > > with the current behavior (ie -- we only need one ack to produce to
> > the log
> > > > and consider the request complete.) My understanding is that we can
> > also
> > > > consume from such topics at that point.
> > > > If users wanted this lower durability could they set
> > min.insync.replicas to
> > > > 1?
> > > >
> > > > 2. For the case where we elect a leader that was unknowingly offline.
> > Say
> > > > this replica was the only one in ELR. My understanding is that we
> would
> > > > promote it to ISR and remove it from ELR when it is the leader, but
> > then we
> > > > would remove it from ISR and have no brokers in ISR or ELR. From
> there
> > we
> > > > would need to do unclean recovery right?
> > > >
> > > > 3. Did we address the case where dynamically min isr is increased?
> > > >
> > > > 4. I think my comment was more about confusion on the KIP. It was not
> > clear
> > > > to me that the section was describing points if one was done before
> the
> > > > other. But I now see the sentence explaining that. I think I skipped
> > from
> > > > "delivery plan" to the bullet points.
> > > >
> > > > Justine
> > > >
> > > > On Thu, Aug 31, 2023 at 4:04 PM Calvin Liu
> <ca...@confluent.io.invalid
> > >
> > > > wrote:
> > > >
> > > > > Hi Justine
> > > > > Thanks for the questions!
> > > > >   *a. For my understanding, will we block replication? Or just the
> > high
> > > > > watermark advancement?*
> > > > >   - The replication will not be blocked. The followers are free to
> > > > > replicate messages above the HWM. Only HWM advancement is blocked.
> > > > >
> > > > >   b. *Also in the acks=1 case, if folks want to continue the
> previous
> > > > > behavior, they also need to set min.insync.replicas to 1, correct?*
> > > > >   - If the clients only send ack=1 messages and minISR=2. The HWM
> > > > behavior
> > > > > will only be different when there is 1 replica in the ISR. In this
> > case,
> > > > > the min ISR does not do much in the current system. It is kind of a
> > > > > trade-off but we think it is ok.
> > > > >
> > > > >   c. *The KIP seems to suggest that we remove from ELR when we
> start
> > up
> > > > > again and notice we do not have the clean shutdown file. Is there a
> > > > chance
> > > > > we have an offline broker in ELR that had an unclean shutdown that
> we
> > > > elect
> > > > > as a leader before we get the change to realize the shutdown was
> > > > unclean?*
> > > > > *  - *The controller will only elect an unfenced(online) replica as
> > the
> > > > > leader. If a broker has an unclean shutdown, it should register to
> > the
> > > > > controller first(where it has to declair whether it is a
> > clean/unclean
> > > > > shutdown) and then start to serve broker requests. So
> > > > >      1. If the broker has an unclean shutdown before the controller
> > is
> > > > > aware that the replica is offline, then the broker can become the
> > leader
> > > > > temporarily. But it can't serve any Fetch requests before it
> > registers
> > > > > again, and that's when the controller will re-elect a leader.
> > > > >      2. If the controller knows the replica is offline(missing
> > heartbeats
> > > > > from the broker for a while) before the broker re-registers, the
> > broker
> > > > > can't be elected as a leader.
> > > > >
> > > > > d. *Would this be the case for strictly a smaller min ISR?*
> > > > > - Yes, only when we have a smaller min ISR. Once the leader is
> aware
> > of
> > > > the
> > > > > minISR change, the HWM can advance and make the current ELR
> > obsolete. So
> > > > > the controller should clear the ELR if the ISR >= the new min ISR.
> > > > >
> > > > > e. *I thought we said the above "Last Leader” behavior can’t be
> > > > maintained
> > > > > with an empty ISR and it should be removed."*
> > > > > -  As the Kip is a big one, we have to consider delivering it in
> > phases.
> > > > If
> > > > > only the Unclean Recovery is delivered, we do not touch the ISR
> then
> > the
> > > > > ISR behavior will be the same as the current. I am open to the
> > proposal
> > > > > that directly starting unclean recovery if the last leader fails.
> > Let's
> > > > see
> > > > > if other folks hope to have more if Unclean Recover delivers first.
> > > > >
> > > > > On Tue, Aug 29, 2023 at 4:53 PM Justine Olshan
> > > > > <jols...@confluent.io.invalid>
> > > > > wrote:
> > > > >
> > > > > > Hey Calvin,
> > > > > >
> > > > > > Thanks for the KIP. This will close some of the gaps in leader
> > > > election!
> > > > > I
> > > > > > has a few questions:
> > > > > >
> > > > > > *>* *High Watermark can only advance if the ISR size is larger or
> > equal
> > > > > > to min.insync.replicas*.
> > > > > >
> > > > > > For my understanding, will we block replication? Or just the high
> > > > > watermark
> > > > > > advancement?
> > > > > > Also in the acks=1 case, if folks want to continue the previous
> > > > behavior,
> > > > > > they also need to set min.insync.replicas to 1, correct? It seems
> > like
> > > > > this
> > > > > > change takes some control away from clients when it comes to
> > durability
> > > > > vs
> > > > > > availability.
> > > > > >
> > > > > > *> *
> > > > > > *ELR + ISR size will not be dropped below the min ISR unless the
> > > > > controller
> > > > > > discovers an ELR member has an unclean shutdown. *
> > > > > > The KIP seems to suggest that we remove from ELR when we start up
> > again
> > > > > and
> > > > > > notice we do not have the clean shutdown file. Is there a chance
> we
> > > > have
> > > > > an
> > > > > > offline broker in ELR that had an unclean shutdown that we elect
> > as a
> > > > > > leader before we get the change to realize the shutdown was
> > unclean?
> > > > > > This seems like it could cause some problems. I may have missed
> > how we
> > > > > > avoid this scenario though.
> > > > > >
> > > > > > *> When updating the config **min.insync.replicas, *
> > > > > > *if the new min ISR <= current ISR, the ELR will be
> removed.*Would
> > this
> > > > > be
> > > > > > the case for strictly a smaller min ISR? I suppose if we increase
> > the
> > > > > ISR,
> > > > > > we can't reason about ELR. Can we reason about high water mark in
> > this
> > > > > > case--seems like we will have the broker out of ISR not in ISR or
> > ELR?
> > > > > > (Forgive me if we can't increase min ISR if the increase will put
> > us
> > > > > under
> > > > > > it)
> > > > > >
> > > > > > *> Unclean recovery. *
> > > > > >
> > > > > >    - *The unclean leader election will be replaced by the unclean
> > > > > > recovery.*
> > > > > >    - *unclean.leader.election.enable will only be replaced by
> > > > > >    the unclean.recovery.strategy after ELR is delivered.*
> > > > > >    - *As there is no change to the ISR, the "last known leader"
> > > > behavior
> > > > > is
> > > > > >    maintained.*
> > > > > >
> > > > > > What does "last known leader behavior maintained" mean here? I
> > thought
> > > > we
> > > > > > said *"*The above “*Last Leader” behavior can’t be maintained
> with
> > an
> > > > > empty
> > > > > > ISR and it should be removed." *My understanding is once metadata
> > > > version
> > > > > > is updated we will always take the more thoughtful unclean
> election
> > > > > process
> > > > > > (ie, inspect the logs)
> > > > > >
> > > > > > Overall though, the general KIP is pretty solid. Looking at the
> > > > rejected
> > > > > > alternatives, it looks like a lot was considered, so it's nice to
> > see
> > > > the
> > > > > > final proposal.
> > > > > >
> > > > > > Justine
> > > > > >
> > > > > > On Mon, Aug 14, 2023 at 8:50 AM Calvin Liu
> > <ca...@confluent.io.invalid
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > >    1. Yes, the new protocol requires 2 things to advance the
> > HWM. a)
> > > > > The
> > > > > > >    messages have been replicated to the controller-committed
> ISR
> > > > > members.
> > > > > > > b)
> > > > > > >    The number of ISR members should be at least the min ISR.
> > > > > > >    2. With the current protocol, we are not able to select
> > broker 1
> > > > as
> > > > > > the
> > > > > > >    leader. If we first imply we have the new HWM requirement in
> > > > place,
> > > > > > then
> > > > > > >    broker 1 is a good candidate to choose. The following part
> of
> > the
> > > > > KIP
> > > > > > > (ELR)
> > > > > > >    part will explain a new mechanism to enable us to choose
> > broker 1.
> > > > > > > Note, if
> > > > > > >    both HWM and ELR are in place, broker 1 will be actually
> > elected
> > > > in
> > > > > > T3.
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Aug 11, 2023 at 10:05 AM Jeff Kim
> > > > > <jeff....@confluent.io.invalid
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Calvin,
> > > > > > > >
> > > > > > > > Thanks for the KIP! I'm still digesting it but I have two
> > > > questions:
> > > > > > > >
> > > > > > > > > In the scenario raised in the motivation section, the
> server
> > may
> > > > > > > receive
> > > > > > > > ack=1 messages during T1 and advance High Watermark when the
> > leader
> > > > > > > > is the only one in ISR.
> > > > > > > >
> > > > > > > > To confirm, the current protocol allows advancing the HWM if
> > all
> > > > > > brokers
> > > > > > > in
> > > > > > > > the ISR append to their logs (in this case only the leader).
> > And
> > > > > we're
> > > > > > > > proposing
> > > > > > > > to advance the HWM only when <at least min.insync.replicas>
> > brokers
> > > > > > > > replicate. Is this correct?
> > > > > > > >
> > > > > > > > > Then, if we elect broker 1 as the leader at T4, though we
> can
> > > > > > guarantee
> > > > > > > > the safety of ack=all messages, the High Watermark may move
> > > > backward
> > > > > > > > which causes further impacts on the consumers.
> > > > > > > >
> > > > > > > > How can broker 1 become the leader if it was ineligible in
> T3?
> > Or
> > > > are
> > > > > > > > you referring to broker 2?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jeff
> > > > > > > >
> > > > > > > > On Thu, Aug 10, 2023 at 6:48 PM Calvin Liu
> > > > > <ca...@confluent.io.invalid
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi everyone,
> > > > > > > > > I'd like to discuss a series of enhancement to the
> > replication
> > > > > > > protocol.
> > > > > > > > >
> > > > > > > > > A partition replica can experience local data loss in
> unclean
> > > > > > shutdown
> > > > > > > > > scenarios where unflushed data in the OS page cache is
> lost -
> > > > such
> > > > > as
> > > > > > > an
> > > > > > > > > availability zone power outage or a server error. The Kafka
> > > > > > replication
> > > > > > > > > protocol is designed to handle these situations by removing
> > such
> > > > > > > replicas
> > > > > > > > > from the ISR and only re-adding them once they have caught
> > up and
> > > > > > > > therefore
> > > > > > > > > recovered any lost data. This prevents replicas that lost
> an
> > > > > > arbitrary
> > > > > > > > log
> > > > > > > > > suffix, which included committed data, from being elected
> > leader.
> > > > > > > > > However, there is a "last replica standing" state which
> when
> > > > > combined
> > > > > > > > with
> > > > > > > > > a data loss unclean shutdown event can turn a local data
> loss
> > > > > > scenario
> > > > > > > > into
> > > > > > > > > a global data loss scenario, i.e., committed data can be
> > removed
> > > > > from
> > > > > > > all
> > > > > > > > > replicas. When the last replica in the ISR experiences an
> > unclean
> > > > > > > > shutdown
> > > > > > > > > and loses committed data, it will be reelected leader after
> > > > > starting
> > > > > > up
> > > > > > > > > again, causing rejoining followers to truncate their logs
> and
> > > > > thereby
> > > > > > > > > removing the last copies of the committed records which the
> > > > leader
> > > > > > lost
> > > > > > > > > initially.
> > > > > > > > >
> > > > > > > > > The new KIP will maximize the protection and provides
> > MinISR-1
> > > > > > > tolerance
> > > > > > > > to
> > > > > > > > > data loss unclean shutdown events.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-966: Eligible Leader Replicas

Reply via email to