Hi Haruki,

Yes, this scenario could happen.
I'm thinking we can fix it in step 6, when controller tried to get LEO from
B,C replicas, the B,C replica should stop fetcher for this partition
immediately, before returning the LEO.
About if we need quorum-based or not, We can discuss in another KIP. I'm
still thinking about it.

Thank you.
Luke


On Fri, May 12, 2023 at 3:59 PM Luke Chen <show...@gmail.com> wrote:

> Hi David,
>
> > It can't be in another KIP as it is required for your proposal to work.
> This is also an important part to discuss as it requires the controller to
> do more operations on leader changes.
>
> Yes, I know this is a requirement for this KIP to work, and need a lot of
> discussion.
> So that's why I think it'd be better to have a separate KIP to write the
> content and discussion.
> I've put the status of this KIP as "pending" and added a note on the top
> of this KIP:
>
> Note: This KIP requires leader election change, which will be proposed in
> another KIP.
>
> Thanks.
> Luke
>
> On Thu, May 11, 2023 at 11:43 PM Alexandre Dupriez <
> alexandre.dupr...@gmail.com> wrote:
>
>> Hi, Luke,
>>
>> Thanks for your reply.
>>
>> 102. Whether such a replica could become a leader depends on what the
>> end-user wants to use it for and what tradeoffs they wish to make down
>> the line.
>>
>> There are cases, for instance with heterogeneous or interregional
>> networks, where the difference in latency between subsets of brokers
>> can be high enough for the "slow replicas" to have a detrimental
>> impact on the ISR traffic they take part in. This can justify
>> permanently segregating them from ISR traffic by design. And, an
>> end-user could still prefer to have these "slow replicas" versus
>> alternative approaches such as mirroring for the benefits they can
>> bring, for instance: a) they belong to the same cluster with no added
>> admin and ops, b) benefit from a direct, simpler replication path, c)
>> require less infrastructure than a mirrored solution, d) could become
>> unclean leaders for failovers under disaster scenarios such as a
>> regional service outages.
>>
>> Thanks,
>> Alexandre
>>
>> Le jeu. 11 mai 2023 à 14:57, Haruki Okada <ocadar...@gmail.com> a écrit :
>> >
>> > Hi, Luke.
>> >
>> > Though this proposal definitely looks interesting, as others pointed
>> out,
>> > the leader election implementation would be the hard part.
>> >
>> > And I think even LEO-based-election is not safe, which could cause
>> silent
>> > committed-data loss easily.
>> >
>> > Let's say we have replicas A,B,C and A is the leader initially, and
>> > min.insync.replicas = 2.
>> >
>> > - 1. Initial
>> >     * A(leo=0), B(leo=0), C(leo=0)
>> > - 2. Produce a message to A
>> >     * A(leo=1), B(leo=0), C(leo=0)
>> > - 3. Another producer produces a message to A (i.e. as the different
>> batch)
>> >     * A(leo=2), B(leo=0), C(leo=0)
>> > - 4. C replicates the first batch. offset=1 is committed (by
>> > acks=min.insync.replicas)
>> >     * A(leo=2), B(leo=0), C(leo=1)
>> > - 5. A loses ZK session (or broker session timeout in KRaft)
>> > - 6. Controller (regardless ZK/KRaft) doesn't store LEO in itself, so it
>> > needs to interact with each replica. It detects C has the largest LEO
>> and
>> > decided to elect C as the new leader
>> > - 7. Before leader-election is performed, B replicates offset=1,2 from
>> A.
>> > offset=2 is committed
>> >     * This is possible because even if A lost ZK session, A could handle
>> > fetch requests for a while.
>> > - 8. Controller elects C as the new leader. B truncates its offset.
>> > offset=2 is lost silently.
>> >
>> > I have a feeling that we need quorum-based data replication? as Divij
>> > pointed out.
>> >
>> >
>> > 2023年5月11日(木) 22:33 David Jacot <dja...@confluent.io.invalid>:
>> >
>> > > Hi Luke,
>> > >
>> > > > Yes, on second thought, I think the new leader election is required
>> to
>> > > work
>> > > for this new acks option. I'll think about it and open another KIP
>> for it.
>> > >
>> > > It can't be in another KIP as it is required for your proposal to
>> work.
>> > > This is also an important part to discuss as it requires the
>> controller to
>> > > do more operations on leader changes.
>> > >
>> > > Cheers,
>> > > David
>> > >
>> > > On Thu, May 11, 2023 at 2:44 PM Luke Chen <show...@gmail.com> wrote:
>> > >
>> > > > Hi Ismael,
>> > > > Yes, on second thought, I think the new leader election is required
>> to
>> > > work
>> > > > for this new acks option. I'll think about it and open another KIP
>> for
>> > > it.
>> > > >
>> > > > Hi Divij,
>> > > > Yes, I agree with all of them. I'll think about it and let you know
>> how
>> > > we
>> > > > can work together.
>> > > >
>> > > > Hi Alexandre,
>> > > > > 100. The KIP makes one statement which may be considered critical:
>> > > > "Note that in acks=min.insync.replicas case, the slow follower might
>> > > > be easier to become out of sync than acks=all.". Would you have some
>> > > > data on that behaviour when using the new ack semantic? It would be
>> > > > interesting to analyse and especially look at the percentage of time
>> > > > the number of replicas in ISR is reduced to the configured
>> > > > min.insync.replicas.
>> > > >
>> > > > The comparison data would be interesting. I can have a test when
>> > > available.
>> > > > But this KIP will be deprioritized because there should be a
>> > > pre-requisite
>> > > > KIP for it.
>> > > >
>> > > > > A (perhaps naive) hypothesis would be that the
>> > > > new ack semantic indeed provides better produce latency, but at the
>> > > > cost of precipitating the slowest replica(s) out of the ISR?
>> > > >
>> > > > Yes, it could be.
>> > > >
>> > > > > 101. I understand the impact on produce latency, but I am not sure
>> > > > about the impact on durability. Is your durability model built
>> against
>> > > > the replication factor or the number of min insync replicas?
>> > > >
>> > > > Yes, and also the new LEO-based leader election (not proposed yet).
>> > > >
>> > > > > 102. Could a new type of replica which would not be allowed to
>> enter
>> > > > the ISR be an alternative? Such replica could attempt replication
>> on a
>> > > > best-effort basis and would provide the permanent guarantee not to
>> > > > interfere with foreground traffic.
>> > > >
>> > > > You mean a backup replica, which will never become leader (in-sync),
>> > > right?
>> > > > That's an interesting thought and might be able to become a
>> workaround
>> > > with
>> > > > the existing leader election. Let me think about it.
>> > > >
>> > > > Hi qiangLiu,
>> > > >
>> > > > > It's a good point that add this config and get better P99
>> latency, but
>> > > is
>> > > > this changing the meaning of "in sync replicas"? consider a
>> situation
>> > > with
>> > > > "replica=3 acks=2", when two broker fail and left only the broker
>> that
>> > > > does't have the message, it is in sync, so will be elected as
>> leader,
>> > > will
>> > > > it cause a NOT NOTICED lost of acked messages?
>> > > >
>> > > > Yes, it will, so the `min.insync.replicas` config in the
>> broker/topic
>> > > level
>> > > > should be set correctly. In your example, it should be set to 2, so
>> that
>> > > > when 2 replicas down, no new message write will be successful.
>> > > >
>> > > >
>> > > > Thank you.
>> > > > Luke
>> > > >
>> > > >
>> > > > On Thu, May 11, 2023 at 12:16 PM 67 <6...@gd67.com> wrote:
>> > > >
>> > > > > Hi Luke,
>> > > > >
>> > > > >
>> > > > > It's a good point that add this config and get better P99
>> latency, but
>> > > is
>> > > > > this changing the meaning of "in sync replicas"? consider a
>> situation
>> > > > with
>> > > > > "replica=3 acks=2", when two broker fail and left only the broker
>> that
>> > > > > does't have the message, it is in sync, so will be elected as
>> leader,
>> > > > will
>> > > > > it cause a *NOT NOTICED* lost of acked messages?
>> > > > >
>> > > > > qiangLiu
>> > > > >
>> > > > >
>> > > > > 在2023年05月10 12时58分,"Ismael Juma"<ism...@juma.me.uk>写道:
>> > > > >
>> > > > >
>> > > > > Hi Luke,
>> > > > >
>> > > > > As discussed in the other KIP, there are some subtleties when it
>> comes
>> > > to
>> > > > > the semantics of the system if we don't wait for all members of
>> the isr
>> > > > > before we ack. I don't understand why you say the leader election
>> > > > question
>> > > > > is out of scope - it seems to be a core aspect to me.
>> > > > >
>> > > > > Ismael
>> > > > >
>> > > > >
>> > > > > On Wed, May 10, 2023, 8:50 AM Luke Chen <show...@gmail.com>
>> wrote:
>> > > > >
>> > > > > > Hi Ismael,
>> > > > > >
>> > > > > > No, I didn't know about this similar KIP! I hope I've known
>> that so
>> > > > that I
>> > > > > > don't need to spend time to write it again! :(
>> > > > > > I checked the KIP and all the discussions (here
>> > > > > > <
>> https://lists.apache.org/list?dev@kafka.apache.org:gte=100d:KIP-250
>> > > >).
>> > > > I
>> > > > > > think the consensus is that adding a client config to
>> `acks=quorum`
>> > > is
>> > > > > > fine.
>> > > > > > This comment
>> > > > > > <
>> https://lists.apache.org/thread/p77pym5sxpn91r8j364kmmf3qp5g65rn>
>> > > > from
>> > > > > > Guozhang pretty much concluded what I'm trying to do.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > *1. Add one more value to client-side acks config:   0: no acks
>> > > needed
>> > > > at
>> > > > > > all.   1: ack from the leader.   all: ack from ALL the ISR
>> replicas
>> > > > > >  quorum: this is the new value, it requires ack from enough
>> number of
>> > > > ISR
>> > > > > > replicas no smaller than majority of the replicas AND no smaller
>> > > > > > than{min.isr}.2. Clarify in the docs that if a user wants to
>> > > tolerate X
>> > > > > > failures, she needs to set client acks=all or acks=quorum
>> (better
>> > > tail
>> > > > > > latency than "all") with broker {min.sir} to be X+1; however,
>> "all"
>> > > is
>> > > > not
>> > > > > > necessarily stronger than "quorum".*
>> > > > > >
>> > > > > > Concerns from KIP-250 are:
>> > > > > > 1. Introducing a new leader LEO based election method. This is
>> not
>> > > > clear in
>> > > > > > the KIP-250 and needs more discussion
>> > > > > > 2. The KIP-250 also tried to optimize the consumer latency to
>> read
>> > > > messages
>> > > > > > beyond high watermark, which also has some discussion about how
>> to
>> > > > achieve
>> > > > > > that, and no conclusion
>> > > > > >
>> > > > > > Both of the above 2 concerns are out of the scope of my current
>> KIP.
>> > > > > > So, I think it's good to provide this `acks=quorum` or
>> > > > > > `acks=min.insync.replicas` option to users to give them another
>> > > choice.
>> > > > > >
>> > > > > >
>> > > > > > Thank you.
>> > > > > > Luke
>> > > > > >
>> > > > > >
>> > > > > > On Wed, May 10, 2023 at 8:54 AM Ismael Juma <ism...@juma.me.uk>
>> > > wrote:
>> > > > > >
>> > > > > > > Hi Luke,
>> > > > > > >
>> > > > > > > Are you aware of
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > >
>> > >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-250+Add+Support+for+Quorum-based+Producer+Acknowledgment
>> > > > > > > ?
>> > > > > > >
>> > > > > > > Ismael
>> > > > > > >
>> > > > > > > On Tue, May 9, 2023 at 10:14 PM Luke Chen <show...@gmail.com>
>> > > wrote:
>> > > > > > >
>> > > > > > > > Hi all,
>> > > > > > > >
>> > > > > > > > I'd like to start a discussion for the KIP-926: introducing
>> > > > > > > > acks=min.insync.replicas config. This KIP is to introduce
>> > > > > > > > `acks=min.insync.replicas` config value in producer, to
>> improve
>> > > the
>> > > > > > write
>> > > > > > > > throughput and still guarantee high durability.
>> > > > > > > >
>> > > > > > > > Please check the link for more detail:
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > >
>> > >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-926%3A+introducing+acks%3Dmin.insync.replicas+config
>> > > > > > > >
>> > > > > > > > Any feedback is welcome.
>> > > > > > > >
>> > > > > > > > Thank you.
>> > > > > > > > Luke
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> > --
>> > ========================
>> > Okada Haruki
>> > ocadar...@gmail.com
>> > ========================
>>
>

Reply via email to