Hi Alexandre, Thanks for the thoughts. I've thought about it, and think I would choose to have a new leader election method to fix the problem we encountered, not this "backup-only" replica solution. But this is still an interesting idea. Like what you've said, this solution can bring many benefits. So maybe you can create a proposal for it?
Thank you. Luke On Fri, May 12, 2023 at 4:21 PM Luke Chen <show...@gmail.com> wrote: > Hi Haruki, > > Yes, this scenario could happen. > I'm thinking we can fix it in step 6, when controller tried to get LEO > from B,C replicas, the B,C replica should stop fetcher for this partition > immediately, before returning the LEO. > About if we need quorum-based or not, We can discuss in another KIP. I'm > still thinking about it. > > Thank you. > Luke > > > On Fri, May 12, 2023 at 3:59 PM Luke Chen <show...@gmail.com> wrote: > >> Hi David, >> >> > It can't be in another KIP as it is required for your proposal to work. >> This is also an important part to discuss as it requires the controller to >> do more operations on leader changes. >> >> Yes, I know this is a requirement for this KIP to work, and need a lot of >> discussion. >> So that's why I think it'd be better to have a separate KIP to write the >> content and discussion. >> I've put the status of this KIP as "pending" and added a note on the top >> of this KIP: >> >> Note: This KIP requires leader election change, which will be proposed in >> another KIP. >> >> Thanks. >> Luke >> >> On Thu, May 11, 2023 at 11:43 PM Alexandre Dupriez < >> alexandre.dupr...@gmail.com> wrote: >> >>> Hi, Luke, >>> >>> Thanks for your reply. >>> >>> 102. Whether such a replica could become a leader depends on what the >>> end-user wants to use it for and what tradeoffs they wish to make down >>> the line. >>> >>> There are cases, for instance with heterogeneous or interregional >>> networks, where the difference in latency between subsets of brokers >>> can be high enough for the "slow replicas" to have a detrimental >>> impact on the ISR traffic they take part in. This can justify >>> permanently segregating them from ISR traffic by design. And, an >>> end-user could still prefer to have these "slow replicas" versus >>> alternative approaches such as mirroring for the benefits they can >>> bring, for instance: a) they belong to the same cluster with no added >>> admin and ops, b) benefit from a direct, simpler replication path, c) >>> require less infrastructure than a mirrored solution, d) could become >>> unclean leaders for failovers under disaster scenarios such as a >>> regional service outages. >>> >>> Thanks, >>> Alexandre >>> >>> Le jeu. 11 mai 2023 à 14:57, Haruki Okada <ocadar...@gmail.com> a écrit >>> : >>> > >>> > Hi, Luke. >>> > >>> > Though this proposal definitely looks interesting, as others pointed >>> out, >>> > the leader election implementation would be the hard part. >>> > >>> > And I think even LEO-based-election is not safe, which could cause >>> silent >>> > committed-data loss easily. >>> > >>> > Let's say we have replicas A,B,C and A is the leader initially, and >>> > min.insync.replicas = 2. >>> > >>> > - 1. Initial >>> > * A(leo=0), B(leo=0), C(leo=0) >>> > - 2. Produce a message to A >>> > * A(leo=1), B(leo=0), C(leo=0) >>> > - 3. Another producer produces a message to A (i.e. as the different >>> batch) >>> > * A(leo=2), B(leo=0), C(leo=0) >>> > - 4. C replicates the first batch. offset=1 is committed (by >>> > acks=min.insync.replicas) >>> > * A(leo=2), B(leo=0), C(leo=1) >>> > - 5. A loses ZK session (or broker session timeout in KRaft) >>> > - 6. Controller (regardless ZK/KRaft) doesn't store LEO in itself, so >>> it >>> > needs to interact with each replica. It detects C has the largest LEO >>> and >>> > decided to elect C as the new leader >>> > - 7. Before leader-election is performed, B replicates offset=1,2 from >>> A. >>> > offset=2 is committed >>> > * This is possible because even if A lost ZK session, A could >>> handle >>> > fetch requests for a while. >>> > - 8. Controller elects C as the new leader. B truncates its offset. >>> > offset=2 is lost silently. >>> > >>> > I have a feeling that we need quorum-based data replication? as Divij >>> > pointed out. >>> > >>> > >>> > 2023年5月11日(木) 22:33 David Jacot <dja...@confluent.io.invalid>: >>> > >>> > > Hi Luke, >>> > > >>> > > > Yes, on second thought, I think the new leader election is >>> required to >>> > > work >>> > > for this new acks option. I'll think about it and open another KIP >>> for it. >>> > > >>> > > It can't be in another KIP as it is required for your proposal to >>> work. >>> > > This is also an important part to discuss as it requires the >>> controller to >>> > > do more operations on leader changes. >>> > > >>> > > Cheers, >>> > > David >>> > > >>> > > On Thu, May 11, 2023 at 2:44 PM Luke Chen <show...@gmail.com> wrote: >>> > > >>> > > > Hi Ismael, >>> > > > Yes, on second thought, I think the new leader election is >>> required to >>> > > work >>> > > > for this new acks option. I'll think about it and open another KIP >>> for >>> > > it. >>> > > > >>> > > > Hi Divij, >>> > > > Yes, I agree with all of them. I'll think about it and let you >>> know how >>> > > we >>> > > > can work together. >>> > > > >>> > > > Hi Alexandre, >>> > > > > 100. The KIP makes one statement which may be considered >>> critical: >>> > > > "Note that in acks=min.insync.replicas case, the slow follower >>> might >>> > > > be easier to become out of sync than acks=all.". Would you have >>> some >>> > > > data on that behaviour when using the new ack semantic? It would be >>> > > > interesting to analyse and especially look at the percentage of >>> time >>> > > > the number of replicas in ISR is reduced to the configured >>> > > > min.insync.replicas. >>> > > > >>> > > > The comparison data would be interesting. I can have a test when >>> > > available. >>> > > > But this KIP will be deprioritized because there should be a >>> > > pre-requisite >>> > > > KIP for it. >>> > > > >>> > > > > A (perhaps naive) hypothesis would be that the >>> > > > new ack semantic indeed provides better produce latency, but at the >>> > > > cost of precipitating the slowest replica(s) out of the ISR? >>> > > > >>> > > > Yes, it could be. >>> > > > >>> > > > > 101. I understand the impact on produce latency, but I am not >>> sure >>> > > > about the impact on durability. Is your durability model built >>> against >>> > > > the replication factor or the number of min insync replicas? >>> > > > >>> > > > Yes, and also the new LEO-based leader election (not proposed yet). >>> > > > >>> > > > > 102. Could a new type of replica which would not be allowed to >>> enter >>> > > > the ISR be an alternative? Such replica could attempt replication >>> on a >>> > > > best-effort basis and would provide the permanent guarantee not to >>> > > > interfere with foreground traffic. >>> > > > >>> > > > You mean a backup replica, which will never become leader >>> (in-sync), >>> > > right? >>> > > > That's an interesting thought and might be able to become a >>> workaround >>> > > with >>> > > > the existing leader election. Let me think about it. >>> > > > >>> > > > Hi qiangLiu, >>> > > > >>> > > > > It's a good point that add this config and get better P99 >>> latency, but >>> > > is >>> > > > this changing the meaning of "in sync replicas"? consider a >>> situation >>> > > with >>> > > > "replica=3 acks=2", when two broker fail and left only the broker >>> that >>> > > > does't have the message, it is in sync, so will be elected as >>> leader, >>> > > will >>> > > > it cause a NOT NOTICED lost of acked messages? >>> > > > >>> > > > Yes, it will, so the `min.insync.replicas` config in the >>> broker/topic >>> > > level >>> > > > should be set correctly. In your example, it should be set to 2, >>> so that >>> > > > when 2 replicas down, no new message write will be successful. >>> > > > >>> > > > >>> > > > Thank you. >>> > > > Luke >>> > > > >>> > > > >>> > > > On Thu, May 11, 2023 at 12:16 PM 67 <6...@gd67.com> wrote: >>> > > > >>> > > > > Hi Luke, >>> > > > > >>> > > > > >>> > > > > It's a good point that add this config and get better P99 >>> latency, but >>> > > is >>> > > > > this changing the meaning of "in sync replicas"? consider a >>> situation >>> > > > with >>> > > > > "replica=3 acks=2", when two broker fail and left only the >>> broker that >>> > > > > does't have the message, it is in sync, so will be elected as >>> leader, >>> > > > will >>> > > > > it cause a *NOT NOTICED* lost of acked messages? >>> > > > > >>> > > > > qiangLiu >>> > > > > >>> > > > > >>> > > > > 在2023年05月10 12时58分,"Ismael Juma"<ism...@juma.me.uk>写道: >>> > > > > >>> > > > > >>> > > > > Hi Luke, >>> > > > > >>> > > > > As discussed in the other KIP, there are some subtleties when it >>> comes >>> > > to >>> > > > > the semantics of the system if we don't wait for all members of >>> the isr >>> > > > > before we ack. I don't understand why you say the leader election >>> > > > question >>> > > > > is out of scope - it seems to be a core aspect to me. >>> > > > > >>> > > > > Ismael >>> > > > > >>> > > > > >>> > > > > On Wed, May 10, 2023, 8:50 AM Luke Chen <show...@gmail.com> >>> wrote: >>> > > > > >>> > > > > > Hi Ismael, >>> > > > > > >>> > > > > > No, I didn't know about this similar KIP! I hope I've known >>> that so >>> > > > that I >>> > > > > > don't need to spend time to write it again! :( >>> > > > > > I checked the KIP and all the discussions (here >>> > > > > > < >>> https://lists.apache.org/list?dev@kafka.apache.org:gte=100d:KIP-250 >>> > > >). >>> > > > I >>> > > > > > think the consensus is that adding a client config to >>> `acks=quorum` >>> > > is >>> > > > > > fine. >>> > > > > > This comment >>> > > > > > < >>> https://lists.apache.org/thread/p77pym5sxpn91r8j364kmmf3qp5g65rn> >>> > > > from >>> > > > > > Guozhang pretty much concluded what I'm trying to do. >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > *1. Add one more value to client-side acks config: 0: no acks >>> > > needed >>> > > > at >>> > > > > > all. 1: ack from the leader. all: ack from ALL the ISR >>> replicas >>> > > > > > quorum: this is the new value, it requires ack from enough >>> number of >>> > > > ISR >>> > > > > > replicas no smaller than majority of the replicas AND no >>> smaller >>> > > > > > than{min.isr}.2. Clarify in the docs that if a user wants to >>> > > tolerate X >>> > > > > > failures, she needs to set client acks=all or acks=quorum >>> (better >>> > > tail >>> > > > > > latency than "all") with broker {min.sir} to be X+1; however, >>> "all" >>> > > is >>> > > > not >>> > > > > > necessarily stronger than "quorum".* >>> > > > > > >>> > > > > > Concerns from KIP-250 are: >>> > > > > > 1. Introducing a new leader LEO based election method. This is >>> not >>> > > > clear in >>> > > > > > the KIP-250 and needs more discussion >>> > > > > > 2. The KIP-250 also tried to optimize the consumer latency to >>> read >>> > > > messages >>> > > > > > beyond high watermark, which also has some discussion about >>> how to >>> > > > achieve >>> > > > > > that, and no conclusion >>> > > > > > >>> > > > > > Both of the above 2 concerns are out of the scope of my >>> current KIP. >>> > > > > > So, I think it's good to provide this `acks=quorum` or >>> > > > > > `acks=min.insync.replicas` option to users to give them another >>> > > choice. >>> > > > > > >>> > > > > > >>> > > > > > Thank you. >>> > > > > > Luke >>> > > > > > >>> > > > > > >>> > > > > > On Wed, May 10, 2023 at 8:54 AM Ismael Juma <ism...@juma.me.uk >>> > >>> > > wrote: >>> > > > > > >>> > > > > > > Hi Luke, >>> > > > > > > >>> > > > > > > Are you aware of >>> > > > > > > >>> > > > > > > >>> > > > > > >>> > > > >>> > > >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-250+Add+Support+for+Quorum-based+Producer+Acknowledgment >>> > > > > > > ? >>> > > > > > > >>> > > > > > > Ismael >>> > > > > > > >>> > > > > > > On Tue, May 9, 2023 at 10:14 PM Luke Chen <show...@gmail.com >>> > >>> > > wrote: >>> > > > > > > >>> > > > > > > > Hi all, >>> > > > > > > > >>> > > > > > > > I'd like to start a discussion for the KIP-926: introducing >>> > > > > > > > acks=min.insync.replicas config. This KIP is to introduce >>> > > > > > > > `acks=min.insync.replicas` config value in producer, to >>> improve >>> > > the >>> > > > > > write >>> > > > > > > > throughput and still guarantee high durability. >>> > > > > > > > >>> > > > > > > > Please check the link for more detail: >>> > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > >>> > > >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-926%3A+introducing+acks%3Dmin.insync.replicas+config >>> > > > > > > > >>> > > > > > > > Any feedback is welcome. >>> > > > > > > > >>> > > > > > > > Thank you. >>> > > > > > > > Luke >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > >>> > > >>> > >>> > >>> > -- >>> > ======================== >>> > Okada Haruki >>> > ocadar...@gmail.com >>> > ======================== >>> >>