Re: [DISCUSS] Road to Kafka 4.0

Luke Chen Tue, 19 Dec 2023 04:34:00 -0800

Hi Justine,

Thanks for your reply.


> I think that for folks that want to prioritize availability over
durability, the aggressive recovery strategy from KIP-966 should be
preferable to the old unclean leader election configuration.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas#KIP966:EligibleLeaderReplicas-Uncleanrecovery

Yes, I'm aware that we're going to implement the new way of leader election
in KIP-966.
But obviously, KIP-966 is not included in v3.7.0.
What I'm worried about is the users who prioritize availability over
durability and enable the unclean leader election in ZK mode.
Once they migrate to KRaft, there will be availability impact when unclean
leader election is needed.
And like you said, they can run unclean leader election via CLI, but again,
the availability is already impacted, which might be unacceptable in some
cases.

IMO, we should prioritize this missing feature and include it in 3.x
release.
Including in 3.x release means users can migrate to KRaft in dual-write
mode, and run it for a while to make sure everything works fine, before
they decide to upgrade to 4.0.

Does that make sense?

Thanks.
Luke

On Tue, Dec 19, 2023 at 12:15 AM Justine Olshan
<jols...@confluent.io.invalid> wrote:

> Hey Luke --
>
> There were some previous discussions on the mailing list about this but
> looks like we didn't file the ticket
> https://lists.apache.org/thread/sqsssos1d9whgmo92vdn81n9r5woy1wk
>
> When I asked some of the folks who worked on Kraft about this, they
> communicated to me that it was intentional to make unclean leader election
> a manual action.
>
> I think that for folks that want to prioritize availability over
> durability, the aggressive recovery strategy from KIP-966 should be
> preferable to the old unclean leader election configuration.
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas#KIP966:EligibleLeaderReplicas-Uncleanrecovery
>
> Let me know if we don't think this is sufficient.
>
> Justine
>
> On Mon, Dec 18, 2023 at 4:39 AM Luke Chen <show...@gmail.com> wrote:
>
> > Hi all,
> >
> > We found that currently (the latest trunk branch), the unclean leader
> > election is not supported in KRaft mode.
> > That is, when users enable `unclean.leader.election.enable` in KRaft
> mode,
> > the config won't take effect and just behave like
> > `unclean.leader.election.enable` is disabled.
> > KAFKA-12670 <https://issues.apache.org/jira/browse/KAFKA-12670> was
> opened
> > for this and is still not resolved.
> >
> > I think this is a regression issue in KRaft mode, and we should complete
> > this missing feature in 3.x release, instead of adding it in 4.0.
> > Does anyone know what's status for this issue?
> >
> > Thanks.
> > Luke
> >
> >
> >
> > On Mon, Nov 27, 2023 at 4:38 PM Colin McCabe <cmcc...@apache.org> wrote:
> >
> > > On Fri, Nov 24, 2023, at 03:47, Anton Agestam wrote:
> > > > In your last message you wrote:
> > > >
> > > > > But, on the KRaft side, I still maintain that nothing is missing
> > except
> > > > > JBOD, which we already have a plan for.
> > > >
> > > > But earlier in this thread you mentioned an issue with "torn writes",
> > > > possibly missing tests, as well as the fact that the recommended
> method
> > > of
> > > > replacing controller nodes is undocumented. Would you mind clarifying
> > > what
> > > > your stance is on these three issues? Do you think that they are
> > > important
> > > > enablers of upgrade paths or not?
> > >
> > > Hi Anton,
> > >
> > > There shouldn't be anything blocking controller disk replacement now.
> > From
> > > memory (not looking at the code now), we do log recovery on our single
> > log
> > > directory every time we start the controller, so it should handle
> partial
> > > records there. I do agree that a test would be good, and some
> > > documentation. I'll probably take a look at that this week if I get
> some
> > > time.
> > >
> > > > > Well, the line was drawn in KIP-833. If we redraw it, what is to
> stop
> > > us
> > > > > from redrawing it again and again?
> > > >
> > > > I'm fairly new to the Kafka community so please forgive me if I'm
> > missing
> > > > things that have been said in earlier discussions, but reading up on
> > that
> > > > KIP I see it has language like "Note: this timeline is very rough and
> > > > subject to change." in the section of versions, but it also says "As
> > > > outlined above, we expect to close these gaps soon" with relation to
> > the
> > > > outstanding features. From my perspective this doesn't really look
> like
> > > an
> > > > agreement that dynamic quorum membership changes shall not be a
> blocker
> > > for
> > > > 4.0.
> > >
> > > The timeline was rough because we wrote that in 2022, trying to look
> > > forward multiple releases. The gaps that were discussed have all been
> > > closed -- except for JBOD, which we are working on this quarter.
> > >
> > > The set of features needed for 4.0 is very clearly described in
> KIP-833.
> > > There's no uncertainty on that point.
> > >
> > > >
> > > > To answer the specific question you pose here, "what is to stop us
> from
> > > > redrawing it again and again?", wouldn't the suggestion of parallel
> > work
> > > > lanes brought up by Josep address this concern?
> > > >
> > >
> > > It's very important not to fragment the community by supporting
> multiple
> > > long-running branch lines. At the end of the day, once branch 3's time
> > has
> > > come, it needs to fade away, just like JDK 6 support or the old Scala
> > > producer.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > > BR,
> > > > Anton
> > > >
> > > > Den tors 23 nov. 2023 kl 05:48 skrev Colin McCabe <
> cmcc...@apache.org
> > >:
> > > >
> > > >> On Tue, Nov 21, 2023, at 19:30, Luke Chen wrote:
> > > >> > Yes, KIP-853 and disk failure support are both very important
> > missing
> > > >> > features. For the disk failure support, I don't think this is a
> > > >> > "good-to-have-feature", it should be a "must-have" IMO. We can't
> > > announce
> > > >> > the 4.0 release without a good solution for disk failure in KRaft.
> > > >>
> > > >> Hi Luke,
> > > >>
> > > >> Thanks for the reply.
> > > >>
> > > >> Controller disk failure support is not missing from KRaft. I
> described
> > > how
> > > >> to handle controller disk failures earlier in this thread.
> > > >>
> > > >> I should note here that the broker in ZooKeeper mode also requires
> > > manual
> > > >> handling of disk failures. Restarting a broker with the same ID, but
> > an
> > > >> empty disk, breaks the invariants of replication when in ZK mode.
> > > Consider:
> > > >>
> > > >> 1. Broker 1 goes down. A ZK state change notification for /brokers
> > fires
> > > >> and goes on the controller queue.
> > > >>
> > > >> 2. Broker 1 comes back up with an empty disk.
> > > >>
> > > >> 3. The controller processes the zk state change notification for
> > > /brokers.
> > > >> Since broker 1 is up no action is taken.
> > > >>
> > > >> 4. Now broker 1 is in the ISR for any partitions it was previously,
> > but
> > > >> has no data. If it is or becomes leader for any partitions,
> > irreversable
> > > >> data loss will occur.
> > > >>
> > > >> This problem is more than theoretical. We at Confluent have observed
> > it
> > > in
> > > >> production and put in place special workarounds for the ZK clusters
> we
> > > >> still have.
> > > >>
> > > >> KRaft has never had this problem because brokers are removed from
> ISRs
> > > >> when a new incarnation of the broker registers.
> > > >>
> > > >> So perhaps ZK mode is not ready for production for Aiven? Since disk
> > > >> failures do in fact require special handling there. (And/or bringing
> > up
> > > new
> > > >> nodes with empty disks, which seems to be their main concern.)
> > > >>
> > > >> >
> > > >> > It’s also worth thinking about how Apache Kafka users who depend
> on
> > > JBOD
> > > >> > might look at the risks of not having a 3.8 release. JBOD support
> on
> > > >> KRaft
> > > >> > is planned to be added in 3.7, and is still in progress so far. So
> > > it’s
> > > >> > hard to say it’s a blocker or not. But in practice, even if the
> > > feature
> > > >> is
> > > >> > made into 3.7 in time, a lot of new code for this feature is
> > unlikely
> > > to
> > > >> be
> > > >> > entirely bug free. We need to maintain the confidence of those
> > users,
> > > and
> > > >> > forcing them to migrate through 3.7 where this new code is hardly
> > > >> > battle-tested doesn’t appear to do that.
> > > >> >
> > > >>
> > > >> As Ismael said, if there are JBOD bugs in 3.7, we will do follow-on
> > > point
> > > >> releases to address them.
> > > >>
> > > >> > Our goal for 4.0 should be that all the “main” features in KRaft
> are
> > > in
> > > >> > production ready state. To reach the goal, I think having one more
> > > >> release
> > > >> > makes sense. We can have different opinions about what the “main
> > > >> features”
> > > >> > in KRaft are, but we should all agree, JBOD is one of them.
> > > >>
> > > >> The current plan is for JBOD to be production-ready in the 3.7
> branch.
> > > >>
> > > >> The other features of KRaft have been in production-ready state
> since
> > > the
> > > >> 3.3 release. (Well, except for delegation tokens and SCRAM, which
> were
> > > >> implemented in 3.5 and 3.6)
> > > >>
> > > >> > I totally agree with you. We can keep delaying the 4.0 release
> > > forever.
> > > >> I'd
> > > >> > also like to draw a line to it. So, in my opinion, the 3.8 release
> > is
> > > the
> > > >> > line. No 3.9, 3.10 releases after that. If this is the decision,
> > will
> > > >> your
> > > >> > concern about this infinite loop disappear?
> > > >>
> > > >> Well, the line was drawn in KIP-833. If we redraw it, what is to
> stop
> > us
> > > >> from redrawing it again and again?
> > > >>
> > > >> >
> > > >> > Final note: Speaking of the missing features, I can always
> cooperate
> > > with
> > > >> > you and all other community contributors to make them happen, like
> > we
> > > >> have
> > > >> > discussed earlier. Just let me know.
> > > >> >
> > > >>
> > > >> Thanks, Luke. I appreciate the offer.
> > > >>
> > > >> But, on the KRaft side, I still maintain that nothing is missing
> > except
> > > >> JBOD, which we already have a plan for.
> > > >>
> > > >> best,
> > > >> Colin
> > > >>
> > > >>
> > > >> > Thank you.
> > > >> > Luke
> > > >> >
> > > >> > On Wed, Nov 22, 2023 at 2:54 AM Colin McCabe <cmcc...@apache.org>
> > > wrote:
> > > >> >
> > > >> >> On Tue, Nov 21, 2023, at 03:47, Josep Prat wrote:
> > > >> >> > Hi Colin,
> > > >> >> >
> > > >> >> > I think it's great that Confluent runs KRaft clusters in
> > > production,
> > > >> >> > and it means that it is production ready for Confluent and it's
> > > users.
> > > >> >> > But luckily for Kafka, the community is bigger than this (self
> > > managed
> > > >> >> > in the cloud or in-prem, or customers of other SaaS companies).
> > > >> >>
> > > >> >> Hi Josep,
> > > >> >>
> > > >> >> Confluent is not the only company using or developing KRaft. Most
> > of
> > > the
> > > >> >> big organizations developing Kafka are involved. I mentioned
> > > Confluent's
> > > >> >> deployments because I wanted to be clear that KRaft mode is not
> > > >> >> experimental or new. Talking about software in production is a
> good
> > > way
> > > >> to
> > > >> >> clear up these misconceptions.
> > > >> >>
> > > >> >> Indeed, KRaft mode is many years old. It started around 2020, and
> > > became
> > > >> >> production-ready in AK 3.5 in 2022. ZK mode was deprecated in AK
> > 3.5,
> > > >> which
> > > >> >> was released June 2023. If we release AK 4.0 around April (or
> > maybe a
> > > >> month
> > > >> >> or two later) then that will be almost a full year between
> > > deprecation
> > > >> and
> > > >> >> removal of ZK mode. We've talked about this a lot, in KIPs, in
> > Apache
> > > >> blog
> > > >> >> posts, at conferences, and so forth.
> > > >> >>
> > > >> >> > We've heard at least from 1 SaaS company, Aiven (disclaimer, it
> > is
> > > my
> > > >> >> > employer) where the current feature set makes it not trivial to
> > > >> >> > migrate. This same issue might happen not only at Aiven but
> with
> > > any
> > > >> >> > user of Kafka who uses immutable infrastructure.
> > > >> >>
> > > >> >> Can you discuss why you feel it is "not trivial to migrate"? From
> > the
> > > >> >> discussion above, the main gap is that we should improve the
> > > >> documentation
> > > >> >> for handling failed disks.
> > > >> >>
> > > >> >> > Another case is for
> > > >> >> > users that have hundreds (or more) of clusters and more than
> 100k
> > > >> nodes
> > > >> >> > experience node failures multiple times during a single day. In
> > > this
> > > >> >> > situation, not having KIP 853 makes these power users unable to
> > > join
> > > >> >> > the game as  introducing a new error-prone manual (or needed to
> > > >> >> > automate) operation is usually a huge no-go.
> > > >> >>
> > > >> >> We have thousands of KRaft clusters in production and haven't
> seen
> > > these
> > > >> >> problems, as I described above.
> > > >> >>
> > > >> >> best,
> > > >> >> Colin
> > > >> >>
> > > >> >> >
> > > >> >> > But I hear the concerns of delaying 4.0 for another 3 to 4
> > months.
> > > >> >> > Would it help if we would aim at shortening the timeline for
> > 3.8.0
> > > and
> > > >> >> > start with the 4.0.0 a bit earlier help?
> > > >> >> > Maybe we could work on 3.8.0 almost in parallel with 4.0.0:
> > > >> >> > - Start with 3.8.0 release process
> > > >> >> > - After a small time (let's say a week) create the release
> branch
> > > >> >> > - Start with 4.0.0 release process as usual
> > > >> >> > - Cherry pick KRaft related issues to 3.8.0
> > > >> >> > - Release 3.8.0
> > > >> >> > I suspect 4.0.0 will need a bit more time than usual to ensure
> > the
> > > >> code
> > > >> >> > is cleaned up of deprecated classes and methods on top of the
> > usual
> > > >> >> > work we have. For this reason I think there would be enough
> time
> > > >> >> > between releasing 3.8.0 and 4.0.0.
> > > >> >> >
> > > >> >> > What do you all think?
> > > >> >> >
> > > >> >> > Best,
> > > >> >> > Josep Prat
> > > >> >> >
> > > >> >> > On 2023/11/20 20:03:18 Colin McCabe wrote:
> > > >> >> >> Hi Josep,
> > > >> >> >>
> > > >> >> >> I think there is some confusion here. Quorum reconfiguration
> is
> > > not
> > > >> >> needed for KRaft to become production ready. Confluent runs
> > > thousands of
> > > >> >> KRaft clusters without quorum reconfiguration, and has for years.
> > > While
> > > >> >> dynamic quorum reconfiguration is a nice feature, it doesn't
> block
> > > >> >> anything: not migration, not deployment. As best as I understand
> > it,
> > > the
> > > >> >> use-case Aiven has isn't even reconfiguration per se, just
> wiping a
> > > >> disk.
> > > >> >> There are ways to handle this -- I discussed some earlier in the
> > > >> thread. I
> > > >> >> think it would be productive to continue that discussion --
> > > especially
> > > >> the
> > > >> >> part around documentation and testing of these cases.
> > > >> >> >>
> > > >> >> >> A lot of people have done a lot of work to get Kafka 4.0
> ready.
> > I
> > > >> would
> > > >> >> not want to delay that because we want an additional feature. And
> > we
> > > >> will
> > > >> >> always want additional features. So I am concerned we will end up
> > in
> > > an
> > > >> >> infinite loop of people asking for "just one more feature" before
> > > they
> > > >> >> migrate.
> > > >> >> >>
> > > >> >> >> best,
> > > >> >> >> Colin
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> On Mon, Nov 20, 2023, at 04:15, Josep Prat wrote:
> > > >> >> >> > Hi all,
> > > >> >> >> >
> > > >> >> >> > I wanted to share my opinion regarding this topic. I know
> some
> > > >> >> >> > discussions happened some time ago (over a year) but I
> believe
> > > it's
> > > >> >> >> > wise to reflect and re-evaluate if those decisions are still
> > > valid.
> > > >> >> >> > KRaft, as of Kafka 3.6.x and 3.7.x, has not yet feature
> parity
> > > with
> > > >> >> >> > Zookeeper. By dropping Zookeeper altogether before achieving
> > > such
> > > >> >> >> > parity, we are opening the door to leaving a chunk of Apache
> > > Kafka
> > > >> >> >> > users without an easy way to upgrade to 4.0.
> > > >> >> >> > In pro of making upgrades as smooth as possible, I propose
> to
> > > have
> > > >> a
> > > >> >> >> > Kafka version where KIP-853 is merged and Zookeeper still is
> > > >> >> supported.
> > > >> >> >> > This will enable community members who can't migrate yet to
> > > KRaft
> > > >> to
> > > >> >> do
> > > >> >> >> > so in a safe way (rolling back is something goes wrong).
> > > >> >> Additionally,
> > > >> >> >> > this will give us more confidence on having KRaft replacing
> > > >> >> >> > successfully Zookeeper without any big problems by
> discovering
> > > and
> > > >> >> >> > fixing bugs or by confirming that KRaft works as expected.
> > > >> >> >> > For this I strongly believe we should have a 3.8.x version
> > > before
> > > >> >> 4.0.x.
> > > >> >> >> >
> > > >> >> >> > What do other think in this regard?
> > > >> >> >> >
> > > >> >> >> > Best,
> > > >> >> >> >
> > > >> >> >> > On 2023/11/14 20:47:10 Colin McCabe wrote:
> > > >> >> >> >> On Tue, Nov 14, 2023, at 04:37, Anton Agestam wrote:
> > > >> >> >> >> > Hi Colin,
> > > >> >> >> >> >
> > > >> >> >> >> > Thank you for your thoughtful and comprehensive response.
> > > >> >> >> >> >
> > > >> >> >> >> >> KIP-853 is not a blocker for either 3.7 or 4.0. We
> > discussed
> > > >> this
> > > >> >> in
> > > >> >> >> >> >> several KIPs that happened this year and last year. The
> > most
> > > >> >> notable was
> > > >> >> >> >> >> probably KIP-866, which was approved in May 2022.
> > > >> >> >> >> >
> > > >> >> >> >> > I understand this is the case, I'm raising my concern
> > > because I
> > > >> was
> > > >> >> >> >> > foreseeing some major pain points as a consequence of
> this
> > > >> >> decision. Just
> > > >> >> >> >> > to make it clear though: I am not asking for anyone to do
> > > work
> > > >> for
> > > >> >> me, and
> > > >> >> >> >> > I understand the limitations of resources available to
> > > implement
> > > >> >> features.
> > > >> >> >> >> > What I was asking is rather to consider the implications
> of
> > > >> >> _removing_
> > > >> >> >> >> > features before there exists a replacement for them.
> > > >> >> >> >> >
> > > >> >> >> >> > I understand that the timeframe for 3.7 isn't feasible,
> and
> > > >> >> because of that
> > > >> >> >> >> > I think what I was asking is rather: can we make sure
> that
> > > there
> > > >> >> are more
> > > >> >> >> >> > 3.x releases until controller quorum online resizing is
> > > >> >> implemented?
> > > >> >> >> >> >
> > > >> >> >> >> > From your response, I gather that your stance is that
> it's
> > > >> >> important to
> > > >> >> >> >> > drop ZK support sooner rather than later and that the
> > > necessary
> > > >> >> pieces for
> > > >> >> >> >> > doing so are already in place.
> > > >> >> >> >>
> > > >> >> >> >> Hi Anton,
> > > >> >> >> >>
> > > >> >> >> >> Yes. I'm basically just repeating what we agreed upon in
> 2022
> > > as
> > > >> >> part of KIP-833.
> > > >> >> >> >>
> > > >> >> >> >> >
> > > >> >> >> >> > ---
> > > >> >> >> >> >
> > > >> >> >> >> > I want to make sure I've understood your suggested
> sequence
> > > for
> > > >> >> controller
> > > >> >> >> >> > node replacement. I hope the mentions of Kubernetes are
> > > rather
> > > >> for
> > > >> >> examples
> > > >> >> >> >> > of how to carry things out, rather than saying "this is
> > only
> > > >> >> supported on
> > > >> >> >> >> > Kubernetes"?
> > > >> >> >> >>
> > > >> >> >> >> Apache Kafka is supported in lots of environments,
> including
> > > >> non-k8s
> > > >> >> ones. I was just pointing out that using k8s means that you
> control
> > > your
> > > >> >> own DNS resolution, which simplifies matters. If you don't
> control
> > > DNS
> > > >> >> there are some extra steps for changing the quorum voters.
> > > >> >> >> >>
> > > >> >> >> >> >
> > > >> >> >> >> > Given we have three existing nodes as such:
> > > >> >> >> >> >
> > > >> >> >> >> > - a.local -> 192.168.0.100
> > > >> >> >> >> > - b.local -> 192.168.0.101
> > > >> >> >> >> > - c.local -> 192.168.0.102
> > > >> >> >> >> >
> > > >> >> >> >> > As well as a candidate node 192.168.0.103 that we want to
> > > >> replace
> > > >> >> for the
> > > >> >> >> >> > role of c.local.
> > > >> >> >> >> >
> > > >> >> >> >> > 1. Shut down controller process on node .102 (to make
> sure
> > we
> > > >> >> don't "go
> > > >> >> >> >> > back in time").
> > > >> >> >> >> > 2. rsync state from leader to .103.
> > > >> >> >> >> > 3. Start controller process on .103.
> > > >> >> >> >> > 4. Point the c.local entry at .103.
> > > >> >> >> >> >
> > > >> >> >> >> > I have a few questions about this sequence:
> > > >> >> >> >> >
> > > >> >> >> >> > 1. Would this sequence be safe against leadership
> changes?
> > > >> >> >> >> >
> > > >> >> >> >>
> > > >> >> >> >> If the leader changes, the new leader should have all of
> the
> > > >> >> committed entries that the old leader had.
> > > >> >> >> >>
> > > >> >> >> >> > 2. Does it work
> > > >> >> >> >>
> > > >> >> >> >> Probably the biggest issue is dealing with "torn writes"
> that
> > > >> happen
> > > >> >> because you're copying the current log segment while it's being
> > > written
> > > >> to.
> > > >> >> The system should be robust against this. However, we don't
> > > regularly do
> > > >> >> this, so there hasn't been a lot of testing.
> > > >> >> >> >>
> > > >> >> >> >> I think Jose had a PR for improving the handling of this
> > which
> > > we
> > > >> >> might want to dig up. We'd want the system to auto-truncate the
> > > partial
> > > >> >> record at the end of the log, if there is one.
> > > >> >> >> >>
> > > >> >> >> >> > 3. By "state", do we mean `metadata.log.dir`? Something
> > else?
> > > >> >> >> >>
> > > >> >> >> >> Yes, the state of the metadata.log.dir. Keep in mind you
> will
> > > need
> > > >> >> to change the node ID in meta.properties after copying, of
> course.
> > > >> >> >> >>
> > > >> >> >> >> > 4. What are the effects on cluster availability? (I think
> > > this
> > > >> is
> > > >> >> the same
> > > >> >> >> >> > as asking what happens if a or b crashes during the
> > process,
> > > or
> > > >> if
> > > >> >> network
> > > >> >> >> >> > partitions occur).
> > > >> >> >> >>
> > > >> >> >> >> Cluster metadata state tends to be pretty small. typically
> a
> > > >> hundred
> > > >> >> megabytes or so. Therefore, I do not think it will take more
> than a
> > > >> second
> > > >> >> or two to copy from one node to another. However, if you do
> > > experience a
> > > >> >> crash when one node out of three is down, then you will be
> > > unavailable
> > > >> >> until you can bring up a second node to regain a majority.
> > > >> >> >> >>
> > > >> >> >> >> >
> > > >> >> >> >> > ---
> > > >> >> >> >> >
> > > >> >> >> >> > If this is considered the official way of handling
> > controller
> > > >> node
> > > >> >> >> >> > replacements, does it make sense to improve documentation
> > in
> > > >> this
> > > >> >> area? Is
> > > >> >> >> >> > there already a plan for this documentation layed out in
> > some
> > > >> >> KIPs? This is
> > > >> >> >> >> > something I'd be happy to contribute to.
> > > >> >> >> >> >
> > > >> >> >> >>
> > > >> >> >> >> Yes, I think we should have official documentation about
> > this.
> > > >> We'd
> > > >> >> be happy to review anything in that area.
> > > >> >> >> >>
> > > >> >> >> >> >> To circle back to KIP-853, I think it stands a good
> chance
> > > of
> > > >> >> making it
> > > >> >> >> >> >> into AK 4.0.
> > > >> >> >> >> >
> > > >> >> >> >> > This sounds good, but the point I was making was if we
> > could
> > > >> have
> > > >> >> a release
> > > >> >> >> >> > with both KRaft and ZK supporting this feature to ease
> the
> > > >> >> migration out of
> > > >> >> >> >> > ZK.
> > > >> >> >> >> >
> > > >> >> >> >>
> > > >> >> >> >> The problem is, supporting multiple controller
> > implementations
> > > is
> > > >> a
> > > >> >> huge burden. So we don't want to extend the 3.x release past the
> > > point
> > > >> >> that's needed to complete all the must-dos (SCRAM, delegation
> > tokens,
> > > >> JBOD)
> > > >> >> >> >>
> > > >> >> >> >> best,
> > > >> >> >> >> Colin
> > > >> >> >> >>
> > > >> >> >> >>
> > > >> >> >> >> > BR,
> > > >> >> >> >> > Anton
> > > >> >> >> >> >
> > > >> >> >> >> > Den tors 9 nov. 2023 kl 23:04 skrev Colin McCabe <
> > > >> >> cmcc...@apache.org>:
> > > >> >> >> >> >
> > > >> >> >> >> >> Hi Anton,
> > > >> >> >> >> >>
> > > >> >> >> >> >> It rarely makes sense to scale up and down the number of
> > > >> >> controller nodes
> > > >> >> >> >> >> in the cluster. Only one controller node will be active
> at
> > > any
> > > >> >> given time.
> > > >> >> >> >> >> The main reason to use 5 nodes would be to be able to
> > > tolerate
> > > >> 2
> > > >> >> failures
> > > >> >> >> >> >> instead of 1.
> > > >> >> >> >> >>
> > > >> >> >> >> >> At Confluent, we generally run KRaft with 3 controllers.
> > We
> > > >> have
> > > >> >> not seen
> > > >> >> >> >> >> problems with this setup, even with thousands of
> clusters.
> > > We
> > > >> have
> > > >> >> >> >> >> discussed using 5 node controller clusters on certain
> very
> > > big
> > > >> >> clusters,
> > > >> >> >> >> >> but we haven't done that yet. This is all very similar
> to
> > > ZK,
> > > >> >> where most
> > > >> >> >> >> >> deployments were 3 nodes as well.
> > > >> >> >> >> >>
> > > >> >> >> >> >> KIP-853 is not a blocker for either 3.7 or 4.0. We
> > discussed
> > > >> this
> > > >> >> in
> > > >> >> >> >> >> several KIPs that happened this year and last year. The
> > most
> > > >> >> notable was
> > > >> >> >> >> >> probably KIP-866, which was approved in May 2022.
> > > >> >> >> >> >>
> > > >> >> >> >> >> Many users these days run in a Kubernetes environment
> > where
> > > >> >> Kubernetes
> > > >> >> >> >> >> actually controls the DNS. This makes changing the set
> of
> > > >> voters
> > > >> >> less
> > > >> >> >> >> >> important than it was historically.
> > > >> >> >> >> >>
> > > >> >> >> >> >> For example, in a world with static DNS, you might have
> to
> > > >> change
> > > >> >> the
> > > >> >> >> >> >> controller.quorum.voters setting from:
> > > >> >> >> >> >>
> > > >> >> >> >> >> 100@a.local:9073,101@b.local:9073,102@c.local:9073
> > > >> >> >> >> >>
> > > >> >> >> >> >> to:
> > > >> >> >> >> >>
> > > >> >> >> >> >> 100@a.local:9073,101@b.local:9073,102@d.local:9073
> > > >> >> >> >> >>
> > > >> >> >> >> >> In a world with k8s controlling the DNS, you simply
> remap
> > > >> c.local
> > > >> >> to point
> > > >> >> >> >> >> ot the IP address of your new pod for controller 102,
> and
> > > >> you're
> > > >> >> done. No
> > > >> >> >> >> >> need to update controller.quorum.voters.
> > > >> >> >> >> >>
> > > >> >> >> >> >> Another question is whether you re-create the pod data
> > from
> > > >> >> scratch every
> > > >> >> >> >> >> time you add a new node. If you store the controller
> data
> > > on an
> > > >> >> EBS volume
> > > >> >> >> >> >> (or cloud-specific equivalent), you really only have to
> > > detach
> > > >> it
> > > >> >> from the
> > > >> >> >> >> >> previous pod and re-attach it to the new pod. k8s also
> > > handles
> > > >> >> this
> > > >> >> >> >> >> automatically, of course.
> > > >> >> >> >> >>
> > > >> >> >> >> >> If you want to reconstruct the full controller pod state
> > > each
> > > >> >> time you
> > > >> >> >> >> >> create a new pod (for example, so that you can use only
> > > >> instance
> > > >> >> storage),
> > > >> >> >> >> >> you should be able to rsync that state from the leader.
> In
> > > >> >> general, the
> > > >> >> >> >> >> invariant that we want to maintain is that the state
> > should
> > > not
> > > >> >> "go back in
> > > >> >> >> >> >> time" -- if controller 102 promised to hold all log data
> > up
> > > to
> > > >> >> offset X, it
> > > >> >> >> >> >> should come back with committed data at at least that
> > > offset.
> > > >> >> >> >> >>
> > > >> >> >> >> >> There are lots of new features we'd like to implement
> for
> > > >> KRaft,
> > > >> >> and Kafka
> > > >> >> >> >> >> in general. If you have some you really would like to
> > see, I
> > > >> >> think everyone
> > > >> >> >> >> >> in the community would be happy to work with you. The
> flip
> > > >> side,
> > > >> >> of course,
> > > >> >> >> >> >> is that since there are an unlimited number of features
> we
> > > >> could
> > > >> >> do, we
> > > >> >> >> >> >> can't really block the release for any one feature.
> > > >> >> >> >> >>
> > > >> >> >> >> >> To circle back to KIP-853, I think it stands a good
> chance
> > > of
> > > >> >> making it
> > > >> >> >> >> >> into AK 4.0. Jose, Alyssa, and some other people have
> > > worked on
> > > >> >> it. It
> > > >> >> >> >> >> definitely won't make it into 3.7, since we have only a
> > few
> > > >> weeks
> > > >> >> left
> > > >> >> >> >> >> before that release happens.
> > > >> >> >> >> >>
> > > >> >> >> >> >> best,
> > > >> >> >> >> >> Colin
> > > >> >> >> >> >>
> > > >> >> >> >> >>
> > > >> >> >> >> >> On Thu, Nov 9, 2023, at 00:20, Anton Agestam wrote:
> > > >> >> >> >> >> > Hi Luke,
> > > >> >> >> >> >> >
> > > >> >> >> >> >> > We have been looking into what switching from ZK to
> > KRaft
> > > >> will
> > > >> >> mean for
> > > >> >> >> >> >> > Aiven.
> > > >> >> >> >> >> >
> > > >> >> >> >> >> > We heavily depend on an “immutable infrastructure”
> model
> > > for
> > > >> >> deployments.
> > > >> >> >> >> >> > This means that, when we perform upgrades, we
> introduce
> > > new
> > > >> >> nodes to our
> > > >> >> >> >> >> > clusters, scale the cluster up to incorporate the new
> > > nodes,
> > > >> >> and then
> > > >> >> >> >> >> phase
> > > >> >> >> >> >> > the old ones out once all partitions are moved to the
> > new
> > > >> >> generation.
> > > >> >> >> >> >> This
> > > >> >> >> >> >> > allows us, and anyone else using a similar model, to
> do
> > > >> >> upgrades as well
> > > >> >> >> >> >> as
> > > >> >> >> >> >> > cluster resizing with zero downtime.
> > > >> >> >> >> >> >
> > > >> >> >> >> >> > Reading up on KRaft and the ZK-to-KRaft migration
> path,
> > > this
> > > >> is
> > > >> >> somewhat
> > > >> >> >> >> >> > worrying for us. It seems like, if KIP-853 is not
> > included
> > > >> >> prior to
> > > >> >> >> >> >> > dropping support for ZK, we will essentially have no
> > > >> satisfying
> > > >> >> upgrade
> > > >> >> >> >> >> > path. Even if KIP-853 is included in 4.0, I’m unsure
> if
> > > that
> > > >> >> would allow
> > > >> >> >> >> >> a
> > > >> >> >> >> >> > migration path for us, since a new cluster generation
> > > would
> > > >> not
> > > >> >> be able
> > > >> >> >> >> >> to
> > > >> >> >> >> >> > use ZK during the migration step.
> > > >> >> >> >> >> > On the other hand, if KIP-853 was released in a
> version
> > > prior
> > > >> >> to dropping
> > > >> >> >> >> >> > ZK support, because it allows online resizing of KRaft
> > > >> >> clusters, this
> > > >> >> >> >> >> would
> > > >> >> >> >> >> > allow us and others that use an immutable
> infrastructure
> > > >> >> deployment
> > > >> >> >> >> >> model,
> > > >> >> >> >> >> > to provide a zero downtime migration path.
> > > >> >> >> >> >> >
> > > >> >> >> >> >> > For that reason, we’d like to raise awareness around
> > this
> > > >> issue
> > > >> >> and
> > > >> >> >> >> >> > encourage considering the implementation of KIP-853 or
> > > >> >> equivalent a
> > > >> >> >> >> >> blocker
> > > >> >> >> >> >> > not only for 4.0, but for the last version prior to
> 4.0.
> > > >> >> >> >> >> >
> > > >> >> >> >> >> > BR,
> > > >> >> >> >> >> > Anton
> > > >> >> >> >> >> >
> > > >> >> >> >> >> > On 2023/10/11 12:17:23 Luke Chen wrote:
> > > >> >> >> >> >> >> Hi all,
> > > >> >> >> >> >> >>
> > > >> >> >> >> >> >> While Kafka 3.6.0 is released, I’d like to start the
> > > >> >> discussion for the
> > > >> >> >> >> >> >> “road to Kafka 4.0”. Based on the plan in KIP-833
> > > >> >> >> >> >> >> <
> > > >> >> >> >> >> >
> > > >> >> >> >> >>
> > > >> >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-Kafka3.7
> > > >> >> >> >> >> >>,
> > > >> >> >> >> >> >> the next release 3.7 will be the final release before
> > > moving
> > > >> >> to Kafka
> > > >> >> >> >> >> 4.0
> > > >> >> >> >> >> >> to remove the Zookeeper from Kafka. Before making
> this
> > > major
> > > >> >> change, I'd
> > > >> >> >> >> >> >> like to get consensus on the "must-have
> features/fixes
> > > for
> > > >> >> Kafka 4.0",
> > > >> >> >> >> >> to
> > > >> >> >> >> >> >> avoid some users being surprised when upgrading to
> > Kafka
> > > >> 4.0.
> > > >> >> The intent
> > > >> >> >> >> >> > is
> > > >> >> >> >> >> >> to have a clear communication about what to expect in
> > the
> > > >> >> following
> > > >> >> >> >> >> > months.
> > > >> >> >> >> >> >> In particular we should be signaling what features
> and
> > > >> >> configurations
> > > >> >> >> >> >> are
> > > >> >> >> >> >> >> not supported, or at risk (if no one is able to add
> > > support
> > > >> or
> > > >> >> fix known
> > > >> >> >> >> >> >> bugs).
> > > >> >> >> >> >> >>
> > > >> >> >> >> >> >> Here is the JIRA tickets list
> > > >> >> >> >> >> >> <
> > > >> >>
> > > https://issues.apache.org/jira/issues/?jql=labels%20%3D%204.0-blocker>
> > > >> >> >> >> >> I
> > > >> >> >> >> >> >> labeled for "4.0-blocker". The criteria I labeled as
> > > >> >> “4.0-blocker” are:
> > > >> >> >> >> >> >> 1. The feature is supported in Zookeeper Mode, but
> not
> > > >> >> supported in
> > > >> >> >> >> >> KRaft
> > > >> >> >> >> >> >> mode, yet (ex: KIP-858: JBOD in KRaft)
> > > >> >> >> >> >> >> 2. Critical bugs in KRaft, (ex: KAFKA-15489 : split
> > > brain in
> > > >> >> KRaft
> > > >> >> >> >> >> >> controller quorum)
> > > >> >> >> >> >> >>
> > > >> >> >> >> >> >> If you disagree with my current list, welcome to have
> > > >> >> discussion in the
> > > >> >> >> >> >> >> specific JIRA ticket. Or, if you think there are some
> > > >> tickets
> > > >> >> I missed,
> > > >> >> >> >> >> >> welcome to start a discussion in the JIRA ticket and
> > > ping me
> > > >> >> or other
> > > >> >> >> >> >> >> people. After we get the consensus, we can
> > label/unlabel
> > > it
> > > >> >> afterwards.
> > > >> >> >> >> >> >> Again, the goal is to have an open communication with
> > the
> > > >> >> community
> > > >> >> >> >> >> about
> > > >> >> >> >> >> >> what will be coming in 4.0.
> > > >> >> >> >> >> >>
> > > >> >> >> >> >> >> Below is the high level category of the list content:
> > > >> >> >> >> >> >>
> > > >> >> >> >> >> >> 1. Recovery from disk failure
> > > >> >> >> >> >> >> KIP-856
> > > >> >> >> >> >> >> <
> > > >> >> >> >> >> >
> > > >> >> >> >> >>
> > > >> >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-856:+KRaft+Disk+Failure+Recovery
> > > >> >> >> >> >> >>:
> > > >> >> >> >> >> >> KRaft Disk Failure Recovery
> > > >> >> >> >> >> >>
> > > >> >> >> >> >> >> 2. Prevote to support controllers more than 3
> > > >> >> >> >> >> >> KIP-650
> > > >> >> >> >> >> >> <
> > > >> >> >> >> >> >
> > > >> >> >> >> >>
> > > >> >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-650%3A+Enhance+Kafkaesque+Raft+semantics
> > > >> >> >> >> >> >>:
> > > >> >> >> >> >> >> Enhance Kafkaesque Raft semantics
> > > >> >> >> >> >> >>
> > > >> >> >> >> >> >> 3. JBOD support
> > > >> >> >> >> >> >> KIP-858
> > > >> >> >> >> >> >> <
> > > >> >> >> >> >> >
> > > >> >> >> >> >>
> > > >> >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft
> > > >> >> >> >> >> >>:
> > > >> >> >> >> >> >> Handle
> > > >> >> >> >> >> >> JBOD broker disk failure in KRaft
> > > >> >> >> >> >> >>
> > > >> >> >> >> >> >> 4. Scale up/down Controllers
> > > >> >> >> >> >> >> KIP-853
> > > >> >> >> >> >> >> <
> > > >> >> >> >> >> >
> > > >> >> >> >> >>
> > > >> >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes
> > > >> >> >> >> >> >>:
> > > >> >> >> >> >> >> KRaft Controller Membership Changes
> > > >> >> >> >> >> >>
> > > >> >> >> >> >> >> 5. Modifying dynamic configurations on the KRaft
> > > controller
> > > >> >> >> >> >> >>
> > > >> >> >> >> >> >> 6. Critical bugs in KRaft
> > > >> >> >> >> >> >>
> > > >> >> >> >> >> >> Does this make sense?
> > > >> >> >> >> >> >> Any feedback is welcomed.
> > > >> >> >> >> >> >>
> > > >> >> >> >> >> >> Thank you.
> > > >> >> >> >> >> >> Luke
> > > >> >> >> >> >> >>
> > > >> >> >> >> >>
> > > >> >> >> >>
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
>

Re: [DISCUSS] Road to Kafka 4.0

Reply via email to