Hi Colin
Thanks for the comments!

I did the following changes

   1.

   Simplified the API spec section to only include the diff.
   2.

   Reordered the HWM requirement section.
   3.

   Removed the URM implementation details to keep the necessary
   characteristics to perform the unclean recovery.
   1.

      When to perform the unclean recovery
      2.

      Under different config, how the unclean recovery finds the leader.
      3.

      How the config unclean.leader.election.enable and
      unclean.recovery.strategy are converted when users enable/disable the
      unclean recovery.
      4.

   More details about how we change admin client.
   5.

   API limits on the GetReplicaLogInfoRequest and DescribeTopicRequest.
   6.

   Two metrics added
   1.

      Kafka.controller.global_under_min_isr_partition_count
      2.

      kafka.controller.unclean_recovery_finished_count


On Wed, Sep 13, 2023 at 10:46 AM Colin McCabe <cmcc...@apache.org> wrote:

> On Tue, Sep 12, 2023, at 17:21, Calvin Liu wrote:
> > Hi Colin
> > Thanks for the comments!
> >
>
> Hi Calvin,
>
> Thanks again for the KIP.
>
> One meta-comment: it's usually better to just do a diff on a message spec
> file or java file if you're including changes to it in the KIP. This is
> easier to read than looking for "new fields begin" etc. in the text, and
> gracefully handles the case where existing fields were changed.
>
> > Rewrite the Additional High Watermark advancement requirement
> > There was feedback on this section that some readers may not be familiar
> > with HWM and Ack=0,1,all requests. This can help them understand the
> > proposal. I will rewrite this part for more readability.
> >
>
> To be clear, I wasn't suggesting dropping either section. I agree that
> they add useful background. I was just suggesting that we should discuss
> the "acks" setting AFTER discussing the new high watermark advancement
> conditions. We also should discuss acks=0. While it isn't conceptually much
> different than acks=1 here, its omission from this section is confusing.
>
> > Unclean recovery
> >
> > The plan is to replace the unclean.leader.election.enable with
> > unclean.recovery.strategy. If the Unclean Recovery is enabled then it
> deals
> > with the three options in the unclean.recovery.strategy.
> >
> >
> > Let’s refine the Unclean Recovery. We have already taken a lot of
> > suggestions and I hope to enhance the durability of Kafka to the next
> level
> > with this KIP.
>
> I am OK with doing the unclean leader recovery improvements in this KIP.
> However, I think we need to really work on the configuration settings.
>
> Configuration overrides are often quite messy. For example, the cases
> where we have log.roll.hours and log.roll.segment.ms, the user has to
> remember which one takes precedence, and it is not obvious. So, rather than
> creating a new configuration, why not add additional values to
> "unclean.leader.election.enable"? I think this will be simpler for people
> to understand, and simpler in the code as well.
>
> What if we continued to use "unclean.leader.election.enable" but extended
> it so that it took a string? Then the string could have these values:
>
> never
>     never automatically do an unclean leader election under any conditions
>
> false / default
>     only do an unclean leader election if there may be possible data loss
>
> true / always
>     always do an unclean leader election if we can't immediately elect a
> leader
>
> It's a bit awkward that false maps to default rather than to never. But
> this awkwardness exists if we use two different configuration keys as well.
> The reason for the awkwardness is that we simply don't want most of the
> people currently setting unclean.leader.election.enable=false to get the
> "never" behavior. We have to bite that bullet. Better to be clear and
> explicit than hide it.
>
> Another thing that's a bit awkward is having two different ways to do
> unclean leader election specified in the KIP. You descirbe two methods: the
> simple "choose the last leader" method, and the "unclean recovery manager"
> method. I understand why you did it this way -- "choose the last leader" is
> simple, and will help us deliver an implementation quickly, while the URM
> is preferable in the long term. My suggestion here is to separate the
> decision of HOW to do unclean leader election from the decision of WHEN to
> do it.
>
> So in other words, have "unclean.leader.election.enable" specify when we
> do unclean leader election, and have a new configuration like
> "unclean.recovery.manager.enable" to determine if we use the URM.
> Presumably the URM will take some time to get fully stable, so this can
> default to false for a while, and we can flip the default to true when we
> feel ready.
>
> The URM is somewhat under-described here. I think we need a few
> configurations here for it. For example, we need a configuration to specify
> how long it should wait for a broker to respond to its RPCs before moving
> on. We also need to understand how the URM interacts with
> unclean.leader.election.enable=always. I assume that with "always" we will
> just unconditionally use the URM rather than choosing randomly. But this
> should be spelled out in the KIP.
>
> >
> > DescribeTopicRequest
> >
> >    1.
> >    Yes, the plan is to replace the MetadataRequest with the
> >    DescribeTopicRequest for the admin clients. Will check the details.
>
> Sounds good. But as I said, you need to specify how AdminClient interacts
> with the new request. This will involve adding some fields to
> TopicDescription.java. And you need to specify the changes to the
> kafka-topics.sh command line tool. Otherwise we cannot use the tool to see
> the new information.
>
> The new requests, DescribeTopicRequest and GetReplicaLogInfoRequest, need
> to have limits placed on them so that their size can't be infinite. We
> don't want to propagate the current problems of MetadataRequest, where
> clients can request massive responses that can mess up the JVM when handled.
>
> Adding limits is simple for GetReplicaLogInfoRequest -- we can just say
> that only 2000 partitions at a time can be requested. For
> DescribeTopicRequest we can probably just limit to 20 topics or something
> like that, to avoid the complexity of doing pagination in this KIP.
>
> >    2.
> >    I can let the broker load the ELR info so that they can serve the
> >    DescribeTopicRequest as well.
>
> Yes, it's fine to add to MetadataCache. In fact, you'll be loading it
> anyway once it's added to PartitionImage.
>
> >    3.
> >    Yeah, it does not make sense to have the topic id if
> >    DescribeTopicRequest is only used by the admin client.
>
> OK. That makes things simpler. We can always create a new API later
> (hopefully not in this KIP!) to query by topic ID.
>
> >
> >
> > Metrics
> >
> > As for overall cluster health metrics, I think under-min-ISR is still a
> > useful one. ELR is more like a safety belt. When the ELR is used, the
> > cluster availability has already been impacted.
> >
> > Maybe we can have a metric to count the partitions that sum(ISR, ELR) <
> min
> > ISR. What do you think?
>
> How about:
>
> A.  a metric for the totoal number of under-min-isr partitions? We don't
> have that in Apache Kafka at the moment.
>
> B. a metric for the number of unclean leader elections we did (for
> simplicity, it can reset to 0 on controller restart: we expect people to
> monitor the change over time anyway)
>
> best,
> Colin
>
>
> >
> > Yeah, for the ongoing unclean recoveries, the controller can keep an
> > accurate count through failover because partition registration can
> indicate
> > whether a recovery is needed. However, for the happened ones, unless we
> > want to persist the number somewhere, we can only figure it out from the
> > log.
> >
> > On Tue, Sep 12, 2023 at 3:16 PM Colin McCabe <cmcc...@apache.org> wrote:
> >
> >> Also, we should have metrics that show what is going on with regard to
> the
> >> eligible replica set. I'm not sure exactly what to suggest, but
> something
> >> that could identify when things are going wrong in the clsuter.
> >>
> >> For example, maybe a metric for partitions containing replicas that are
> >> ineligible to be leader? That would show a spike when a broker had an
> >> unclean restart.
> >>
> >> Ideally, we'd also have a metric that indicates when an unclear leader
> >> election or a recovery happened. It's a bit tricky because the simple
> >> thing, of tracking it per controller, may be a bit confusing during
> >> failovers.
> >>
> >> best,
> >> Colin
> >>
> >>
> >> On Tue, Sep 12, 2023, at 14:25, Colin McCabe wrote:
> >> > Hi Calvin,
> >> >
> >> > Thanks for the KIP. I think this is a great improvement.
> >> >
> >> >> Additional High Watermark advance requirement
> >> >
> >> > Typo: change "advance" to "advancement"
> >> >
> >> >> A bit recap of some key concepts.
> >> >
> >> > Typo: change "bit" to "quick"
> >> >
> >> >> Ack=1/all produce request. It defines when the Kafka server should
> >> respond to the produce request
> >> >
> >> > I think this section would be clearer if we talked about the new high
> >> > watermark advancement requirement first, and THEN talked about its
> >> > impact on acks=0, acks=1, and     acks=all.  acks=all is of course the
> >> > main case we care about here, so it would be good to lead with that,
> >> > rather than delving into the technicalities of acks=0/1 first.
> >> >
> >> >> Unclean recovery
> >> >
> >> > So, here you are introducing a new configuration,
> >> > unclean.recovery.strategy. The difficult thing here is that there is a
> >> > lot of overlap with unclean.leader.election.enable. So we have 3
> >> > different settings for unclean.recovery.strategy, plus 2 different
> >> > settings for unclean.leader.election.enable, giving a cross product of
> >> > 6 different options. The following "unclean recovery manager" section
> >> > only applies to one fo those 6 different possibilities (I think?)
> >> >
> >> > I simply don't think we need so many different election types. Really
> >> > the use-cases we need are people who want NO unclean elections, people
> >> > who want "the reasonable thing" and people who want avaialbility at
> all
> >> > costs.
> >> >
> >> > Overall, I feel like the first half of the KIP is about the ELR, and
> >> > the second half is about reworking unclean leader election. It might
> be
> >> > better to move that second half to a separate KIP so that we can
> figure
> >> > it out fully. It should be fine to punt this until later and just have
> >> > the current behavior on empty ELR be waiting for the last known leader
> >> > to return. After all, that's what we do today.
> >> >
> >> >> DescribeTopicRequest
> >> >
> >> > Is the intention for AdminClient to use this RPC for
> >> > Admin#describeTopics ? If so, we need to describe all of the changes
> to
> >> > the admin client API, as well as changes to command-line tools like
> >> > kafka-topics.sh (if there are any). For example, you will probably
> need
> >> > changes to TopicDescription.java. You will also need to provide all of
> >> > the things that admin client needs -- for example,
> >> > TopicAuthorizedOperations.
> >> >
> >> > I also don't think the controller should serve this request. We want
> to
> >> > minimize load on the controller. Just like with the other metadata
> >> > requests like MetadataRequest, this should be served by brokers.
> >> >
> >> > It's a bit confusing why both topic ID and topic name are provided to
> >> > this API. Is the intention that callers should set one but not the
> >> > other? Or both? This needs to be clarified. Also, if we do want to
> >> > support lookups by UUID, that is another thing that needs to be added
> >> > to adminclient.
> >> >
> >> > In general, I feel like this should also probably be its own KIP since
> >> > it's fairly complex
> >> >
> >> > best,
> >> > Colin
> >> >
> >> >
> >> > On Thu, Aug 10, 2023, at 15:46, Calvin Liu wrote:
> >> >> Hi everyone,
> >> >> I'd like to discuss a series of enhancement to the replication
> protocol.
> >> >>
> >> >> A partition replica can experience local data loss in unclean
> shutdown
> >> >> scenarios where unflushed data in the OS page cache is lost - such
> as an
> >> >> availability zone power outage or a server error. The Kafka
> replication
> >> >> protocol is designed to handle these situations by removing such
> >> replicas
> >> >> from the ISR and only re-adding them once they have caught up and
> >> therefore
> >> >> recovered any lost data. This prevents replicas that lost an
> arbitrary
> >> log
> >> >> suffix, which included committed data, from being elected leader.
> >> >> However, there is a "last replica standing" state which when combined
> >> with
> >> >> a data loss unclean shutdown event can turn a local data loss
> scenario
> >> into
> >> >> a global data loss scenario, i.e., committed data can be removed from
> >> all
> >> >> replicas. When the last replica in the ISR experiences an unclean
> >> shutdown
> >> >> and loses committed data, it will be reelected leader after starting
> up
> >> >> again, causing rejoining followers to truncate their logs and thereby
> >> >> removing the last copies of the committed records which the leader
> lost
> >> >> initially.
> >> >>
> >> >> The new KIP will maximize the protection and provides MinISR-1
> >> tolerance to
> >> >> data loss unclean shutdown events.
> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas
> >>
>

Reply via email to