Re: [DISCUSS] KIP-926: introducing acks=min.insync.replicas config

Luke Chen Fri, 12 May 2023 01:33:28 -0700

Hi Alexandre,

Thanks for the thoughts.
I've thought about it, and think I would choose to have a new leader
election method to fix the problem we encountered, not this "backup-only"
replica solution.
But this is still an interesting idea. Like what you've said, this solution
can bring many benefits.
So maybe you can create a proposal for it?


Thank you.
Luke

On Fri, May 12, 2023 at 4:21 PM Luke Chen <[email protected]> wrote:

> Hi Haruki,
>
> Yes, this scenario could happen.
> I'm thinking we can fix it in step 6, when controller tried to get LEO
> from B,C replicas, the B,C replica should stop fetcher for this partition
> immediately, before returning the LEO.
> About if we need quorum-based or not, We can discuss in another KIP. I'm
> still thinking about it.
>
> Thank you.
> Luke
>
>
> On Fri, May 12, 2023 at 3:59 PM Luke Chen <[email protected]> wrote:
>
>> Hi David,
>>
>> > It can't be in another KIP as it is required for your proposal to work.
>> This is also an important part to discuss as it requires the controller to
>> do more operations on leader changes.
>>
>> Yes, I know this is a requirement for this KIP to work, and need a lot of
>> discussion.
>> So that's why I think it'd be better to have a separate KIP to write the
>> content and discussion.
>> I've put the status of this KIP as "pending" and added a note on the top
>> of this KIP:
>>
>> Note: This KIP requires leader election change, which will be proposed in
>> another KIP.
>>
>> Thanks.
>> Luke
>>
>> On Thu, May 11, 2023 at 11:43 PM Alexandre Dupriez <
>> [email protected]> wrote:
>>
>>> Hi, Luke,
>>>
>>> Thanks for your reply.
>>>
>>> 102. Whether such a replica could become a leader depends on what the
>>> end-user wants to use it for and what tradeoffs they wish to make down
>>> the line.
>>>
>>> There are cases, for instance with heterogeneous or interregional
>>> networks, where the difference in latency between subsets of brokers
>>> can be high enough for the "slow replicas" to have a detrimental
>>> impact on the ISR traffic they take part in. This can justify
>>> permanently segregating them from ISR traffic by design. And, an
>>> end-user could still prefer to have these "slow replicas" versus
>>> alternative approaches such as mirroring for the benefits they can
>>> bring, for instance: a) they belong to the same cluster with no added
>>> admin and ops, b) benefit from a direct, simpler replication path, c)
>>> require less infrastructure than a mirrored solution, d) could become
>>> unclean leaders for failovers under disaster scenarios such as a
>>> regional service outages.
>>>
>>> Thanks,
>>> Alexandre
>>>
>>> Le jeu. 11 mai 2023 à 14:57, Haruki Okada <[email protected]> a écrit
>>> :
>>> >
>>> > Hi, Luke.
>>> >
>>> > Though this proposal definitely looks interesting, as others pointed
>>> out,
>>> > the leader election implementation would be the hard part.
>>> >
>>> > And I think even LEO-based-election is not safe, which could cause
>>> silent
>>> > committed-data loss easily.
>>> >
>>> > Let's say we have replicas A,B,C and A is the leader initially, and
>>> > min.insync.replicas = 2.
>>> >
>>> > - 1. Initial
>>> >     * A(leo=0), B(leo=0), C(leo=0)
>>> > - 2. Produce a message to A
>>> >     * A(leo=1), B(leo=0), C(leo=0)
>>> > - 3. Another producer produces a message to A (i.e. as the different
>>> batch)
>>> >     * A(leo=2), B(leo=0), C(leo=0)
>>> > - 4. C replicates the first batch. offset=1 is committed (by
>>> > acks=min.insync.replicas)
>>> >     * A(leo=2), B(leo=0), C(leo=1)
>>> > - 5. A loses ZK session (or broker session timeout in KRaft)
>>> > - 6. Controller (regardless ZK/KRaft) doesn't store LEO in itself, so
>>> it
>>> > needs to interact with each replica. It detects C has the largest LEO
>>> and
>>> > decided to elect C as the new leader
>>> > - 7. Before leader-election is performed, B replicates offset=1,2 from
>>> A.
>>> > offset=2 is committed
>>> >     * This is possible because even if A lost ZK session, A could
>>> handle
>>> > fetch requests for a while.
>>> > - 8. Controller elects C as the new leader. B truncates its offset.
>>> > offset=2 is lost silently.
>>> >
>>> > I have a feeling that we need quorum-based data replication? as Divij
>>> > pointed out.
>>> >
>>> >
>>> > 2023年5月11日(木) 22:33 David Jacot <[email protected]>:
>>> >
>>> > > Hi Luke,
>>> > >
>>> > > > Yes, on second thought, I think the new leader election is
>>> required to
>>> > > work
>>> > > for this new acks option. I'll think about it and open another KIP
>>> for it.
>>> > >
>>> > > It can't be in another KIP as it is required for your proposal to
>>> work.
>>> > > This is also an important part to discuss as it requires the
>>> controller to
>>> > > do more operations on leader changes.
>>> > >
>>> > > Cheers,
>>> > > David
>>> > >
>>> > > On Thu, May 11, 2023 at 2:44 PM Luke Chen <[email protected]> wrote:
>>> > >
>>> > > > Hi Ismael,
>>> > > > Yes, on second thought, I think the new leader election is
>>> required to
>>> > > work
>>> > > > for this new acks option. I'll think about it and open another KIP
>>> for
>>> > > it.
>>> > > >
>>> > > > Hi Divij,
>>> > > > Yes, I agree with all of them. I'll think about it and let you
>>> know how
>>> > > we
>>> > > > can work together.
>>> > > >
>>> > > > Hi Alexandre,
>>> > > > > 100. The KIP makes one statement which may be considered
>>> critical:
>>> > > > "Note that in acks=min.insync.replicas case, the slow follower
>>> might
>>> > > > be easier to become out of sync than acks=all.". Would you have
>>> some
>>> > > > data on that behaviour when using the new ack semantic? It would be
>>> > > > interesting to analyse and especially look at the percentage of
>>> time
>>> > > > the number of replicas in ISR is reduced to the configured
>>> > > > min.insync.replicas.
>>> > > >
>>> > > > The comparison data would be interesting. I can have a test when
>>> > > available.
>>> > > > But this KIP will be deprioritized because there should be a
>>> > > pre-requisite
>>> > > > KIP for it.
>>> > > >
>>> > > > > A (perhaps naive) hypothesis would be that the
>>> > > > new ack semantic indeed provides better produce latency, but at the
>>> > > > cost of precipitating the slowest replica(s) out of the ISR?
>>> > > >
>>> > > > Yes, it could be.
>>> > > >
>>> > > > > 101. I understand the impact on produce latency, but I am not
>>> sure
>>> > > > about the impact on durability. Is your durability model built
>>> against
>>> > > > the replication factor or the number of min insync replicas?
>>> > > >
>>> > > > Yes, and also the new LEO-based leader election (not proposed yet).
>>> > > >
>>> > > > > 102. Could a new type of replica which would not be allowed to
>>> enter
>>> > > > the ISR be an alternative? Such replica could attempt replication
>>> on a
>>> > > > best-effort basis and would provide the permanent guarantee not to
>>> > > > interfere with foreground traffic.
>>> > > >
>>> > > > You mean a backup replica, which will never become leader
>>> (in-sync),
>>> > > right?
>>> > > > That's an interesting thought and might be able to become a
>>> workaround
>>> > > with
>>> > > > the existing leader election. Let me think about it.
>>> > > >
>>> > > > Hi qiangLiu,
>>> > > >
>>> > > > > It's a good point that add this config and get better P99
>>> latency, but
>>> > > is
>>> > > > this changing the meaning of "in sync replicas"? consider a
>>> situation
>>> > > with
>>> > > > "replica=3 acks=2", when two broker fail and left only the broker
>>> that
>>> > > > does't have the message, it is in sync, so will be elected as
>>> leader,
>>> > > will
>>> > > > it cause a NOT NOTICED lost of acked messages?
>>> > > >
>>> > > > Yes, it will, so the `min.insync.replicas` config in the
>>> broker/topic
>>> > > level
>>> > > > should be set correctly. In your example, it should be set to 2,
>>> so that
>>> > > > when 2 replicas down, no new message write will be successful.
>>> > > >
>>> > > >
>>> > > > Thank you.
>>> > > > Luke
>>> > > >
>>> > > >
>>> > > > On Thu, May 11, 2023 at 12:16 PM 67 <[email protected]> wrote:
>>> > > >
>>> > > > > Hi Luke,
>>> > > > >
>>> > > > >
>>> > > > > It's a good point that add this config and get better P99
>>> latency, but
>>> > > is
>>> > > > > this changing the meaning of "in sync replicas"? consider a
>>> situation
>>> > > > with
>>> > > > > "replica=3 acks=2", when two broker fail and left only the
>>> broker that
>>> > > > > does't have the message, it is in sync, so will be elected as
>>> leader,
>>> > > > will
>>> > > > > it cause a *NOT NOTICED* lost of acked messages?
>>> > > > >
>>> > > > > qiangLiu
>>> > > > >
>>> > > > >
>>> > > > > 在2023年05月10 12时58分，"Ismael Juma"<[email protected]>写道：
>>> > > > >
>>> > > > >
>>> > > > > Hi Luke,
>>> > > > >
>>> > > > > As discussed in the other KIP, there are some subtleties when it
>>> comes
>>> > > to
>>> > > > > the semantics of the system if we don't wait for all members of
>>> the isr
>>> > > > > before we ack. I don't understand why you say the leader election
>>> > > > question
>>> > > > > is out of scope - it seems to be a core aspect to me.
>>> > > > >
>>> > > > > Ismael
>>> > > > >
>>> > > > >
>>> > > > > On Wed, May 10, 2023, 8:50 AM Luke Chen <[email protected]>
>>> wrote:
>>> > > > >
>>> > > > > > Hi Ismael,
>>> > > > > >
>>> > > > > > No, I didn't know about this similar KIP! I hope I've known
>>> that so
>>> > > > that I
>>> > > > > > don't need to spend time to write it again! :(
>>> > > > > > I checked the KIP and all the discussions (here
>>> > > > > > <
>>> https://lists.apache.org/[email protected]:gte=100d:KIP-250
>>> > > >).
>>> > > > I
>>> > > > > > think the consensus is that adding a client config to
>>> `acks=quorum`
>>> > > is
>>> > > > > > fine.
>>> > > > > > This comment
>>> > > > > > <
>>> https://lists.apache.org/thread/p77pym5sxpn91r8j364kmmf3qp5g65rn>
>>> > > > from
>>> > > > > > Guozhang pretty much concluded what I'm trying to do.
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > *1. Add one more value to client-side acks config:   0: no acks
>>> > > needed
>>> > > > at
>>> > > > > > all.   1: ack from the leader.   all: ack from ALL the ISR
>>> replicas
>>> > > > > >  quorum: this is the new value, it requires ack from enough
>>> number of
>>> > > > ISR
>>> > > > > > replicas no smaller than majority of the replicas AND no
>>> smaller
>>> > > > > > than{min.isr}.2. Clarify in the docs that if a user wants to
>>> > > tolerate X
>>> > > > > > failures, she needs to set client acks=all or acks=quorum
>>> (better
>>> > > tail
>>> > > > > > latency than "all") with broker {min.sir} to be X+1; however,
>>> "all"
>>> > > is
>>> > > > not
>>> > > > > > necessarily stronger than "quorum".*
>>> > > > > >
>>> > > > > > Concerns from KIP-250 are:
>>> > > > > > 1. Introducing a new leader LEO based election method. This is
>>> not
>>> > > > clear in
>>> > > > > > the KIP-250 and needs more discussion
>>> > > > > > 2. The KIP-250 also tried to optimize the consumer latency to
>>> read
>>> > > > messages
>>> > > > > > beyond high watermark, which also has some discussion about
>>> how to
>>> > > > achieve
>>> > > > > > that, and no conclusion
>>> > > > > >
>>> > > > > > Both of the above 2 concerns are out of the scope of my
>>> current KIP.
>>> > > > > > So, I think it's good to provide this `acks=quorum` or
>>> > > > > > `acks=min.insync.replicas` option to users to give them another
>>> > > choice.
>>> > > > > >
>>> > > > > >
>>> > > > > > Thank you.
>>> > > > > > Luke
>>> > > > > >
>>> > > > > >
>>> > > > > > On Wed, May 10, 2023 at 8:54 AM Ismael Juma <[email protected]
>>> >
>>> > > wrote:
>>> > > > > >
>>> > > > > > > Hi Luke,
>>> > > > > > >
>>> > > > > > > Are you aware of
>>> > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > >
>>> > >
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-250+Add+Support+for+Quorum-based+Producer+Acknowledgment
>>> > > > > > > ?
>>> > > > > > >
>>> > > > > > > Ismael
>>> > > > > > >
>>> > > > > > > On Tue, May 9, 2023 at 10:14 PM Luke Chen <[email protected]
>>> >
>>> > > wrote:
>>> > > > > > >
>>> > > > > > > > Hi all,
>>> > > > > > > >
>>> > > > > > > > I'd like to start a discussion for the KIP-926: introducing
>>> > > > > > > > acks=min.insync.replicas config. This KIP is to introduce
>>> > > > > > > > `acks=min.insync.replicas` config value in producer, to
>>> improve
>>> > > the
>>> > > > > > write
>>> > > > > > > > throughput and still guarantee high durability.
>>> > > > > > > >
>>> > > > > > > > Please check the link for more detail:
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > >
>>> > >
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-926%3A+introducing+acks%3Dmin.insync.replicas+config
>>> > > > > > > >
>>> > > > > > > > Any feedback is welcome.
>>> > > > > > > >
>>> > > > > > > > Thank you.
>>> > > > > > > > Luke
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> >
>>> > --
>>> > ========================
>>> > Okada Haruki
>>> > [email protected]
>>> > ========================
>>>
>>

Re: [DISCUSS] KIP-926: introducing acks=min.insync.replicas config

Reply via email to