Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Colin McCabe Mon, 09 Sep 2019 07:53:34 -0700

On Sat, Sep 7, 2019, at 09:21, Harsha Chintalapani wrote:
> Hi Colin,
>           Can you give us more details on why you don't want this to be
> part of the Kafka core. You are proposing KIP-500 which will take away
> zookeeper and writing this interim tools to change the zookeeper 
> metadata doesn't make sense to me.


Hi Harsha,

The reassignment API described in KIP-455, which will be part of Kafka 2.4, 
doesn't rely on ZooKeeper.  This API will stay the same after KIP-500 is 
implemented.

> As George pointed out there are
> several benefits having it in the system itself instead of asking users
> to hack bunch of json files to deal with outage scenario.

In both cases, the user just has to run a shell command, right?  In both cases, 
the user has to remember to undo the command later when they want the broker to 
be treated normally again.  And in both cases, the user should probably be 
running an external rebalancing tool to avoid having to run these commands 
manually. :)

best,
Colin

> 
> Thanks,
> Harsha
> 
> On Fri, Sep 6, 2019 at 4:36 PM George Li <[email protected]>
> wrote:
> 
> >  Hi Colin,
> >
> > Thanks for the feedback.  The "separate set of metadata about blacklists"
> > in KIP-491 is just the list of broker ids. Usually 1 or 2 or a couple in
> > the cluster.  Should be easier than keeping json files?  e.g. what if we
> > first blacklist broker_id_1, then another broker_id_2 has issues, and we
> > need to write out another json file to restore later (and in which order)?
> >  Using blacklist, we can just add the broker_id_2 to the existing one. and
> > remove whatever broker_id returning to good state without worrying how(the
> > ordering of putting the broker to blacklist) to restore.
> >
> > For topic level config,  the blacklist will be tied to
> > topic/partition(e.g.  Configs:
> > topic.preferred.leader.blacklist=0:101,102;1:103    where 0 & 1 is the
> > partition#, 101,102,103 are the blacklist broker_ids), and easier to
> > update/remove, no need for external json files?
> >
> >
> > Thanks,
> > George
> >
> >     On Friday, September 6, 2019, 02:20:33 PM PDT, Colin McCabe <
> > [email protected]> wrote:
> >
> >  One possibility would be writing a new command-line tool that would
> > deprioritize a given replica using the new KIP-455 API.  Then it could
> > write out a JSON files containing the old priorities, which could be
> > restored when (or if) we needed to do so.  This seems like it might be
> > simpler and easier to maintain than a separate set of metadata about
> > blacklists.
> >
> > best,
> > Colin
> >
> >
> > On Fri, Sep 6, 2019, at 11:58, George Li wrote:
> > >  Hi,
> > >
> > > Just want to ping and bubble up the discussion of KIP-491.
> > >
> > > On a large scale of Kafka clusters with thousands of brokers in many
> > > clusters.  Frequent hardware failures are common, although the
> > > reassignments to change the preferred leaders is a workaround, it
> > > incurs unnecessary additional work than the proposed preferred leader
> > > blacklist in KIP-491, and hard to scale.
> > >
> > > I am wondering whether others using Kafka in a big scale running into
> > > same problem.
> > >
> > >
> > > Satish,
> > >
> > > Regarding your previous question about whether there is use-case for
> > > TopicLevel preferred leader "blacklist",  I thought about one
> > > use-case:  to improve rebalance/reassignment, the large partition will
> > > usually cause performance/stability issues, planning to change the say
> > > the New Replica will start with Leader's latest offset(this way the
> > > replica is almost instantly in the ISR and reassignment completed), and
> > > put this partition's NewReplica into Preferred Leader "Blacklist" at
> > > the Topic Level config for that partition. After sometime(retention
> > > time), this new replica has caught up and ready to serve traffic,
> > > update/remove the TopicConfig for this partition's preferred leader
> > > blacklist.
> > >
> > > I will update the KIP-491 later for this use case of Topic Level config
> > > for Preferred Leader Blacklist.
> > >
> > >
> > > Thanks,
> > > George
> > >
> > >    On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li
> > > <[email protected]> wrote:
> > >
> > >  Hi Colin,
> > >
> > > > In your example, I think we're comparing apples and oranges.  You
> > started by outlining a scenario where "an empty broker... comes up...
> > [without] any > leadership[s]."  But then you criticize using reassignment
> > to switch the order of preferred replicas because it "would not actually
> > switch the leader > automatically."  If the empty broker doesn't have any
> > leaderships, there is nothing to be switched, right?
> > >
> > > Let me explained in details of this particular use case example for
> > > comparing apples to apples.
> > >
> > > Let's say a healthy broker hosting 3000 partitions, and of which 1000
> > > are the preferred leaders (leader count is 1000). There is a hardware
> > > failure (disk/memory, etc.), and kafka process crashed. We swap this
> > > host with another host but keep the same broker.id, when this new
> > > broker coming up, it has no historical data, and we manage to have the
> > > current last offsets of all partitions set in
> > > the replication-offset-checkpoint (if we don't set them, it could cause
> > > crazy ReplicaFetcher pulling of historical data from other brokers and
> > > cause cluster high latency and other instabilities), so when Kafka is
> > > brought up, it is quickly catching up as followers in the ISR.  Note,
> > > we have auto.leader.rebalance.enable  disabled, so it's not serving any
> > > traffic as leaders (leader count = 0), even there are 1000 partitions
> > > that this broker is the Preferred Leader.
> > >
> > > We need to make this broker not serving traffic for a few hours or days
> > > depending on the SLA of the topic retention requirement until after
> > > it's having enough historical data.
> > >
> > >
> > > * The traditional way using the reassignments to move this broker in
> > > that 1000 partitions where it's the preferred leader to the end of
> > > assignment, this is O(N) operation. and from my experience, we can't
> > > submit all 1000 at the same time, otherwise cause higher latencies even
> > > the reassignment in this case can complete almost instantly.  After  a
> > > few hours/days whatever, this broker is ready to serve traffic,  we
> > > have to run reassignments again to restore that 1000 partitions
> > > preferred leaders for this broker: O(N) operation.  then run preferred
> > > leader election O(N) again.  So total 3 x O(N) operations.  The point
> > > is since the new empty broker is expected to be the same as the old one
> > > in terms of hosting partition/leaders, it would seem unnecessary to do
> > > reassignments (ordering of replica) during the broker catching up time.
> > >
> > >
> > >
> > > * The new feature Preferred Leader "Blacklist":  just need to put a
> > > dynamic config to indicate that this broker should be considered leader
> > > (preferred leader election or broker failover or unclean leader
> > > election) to the lowest priority. NO need to run any reassignments.
> > > After a few hours/days, when this broker is ready, remove the dynamic
> > > config, and run preferred leader election and this broker will serve
> > > traffic for that 1000 original partitions it was the preferred leader.
> > > So total  1 x O(N) operation.
> > >
> > >
> > > If auto.leader.rebalance.enable  is enabled,  the Preferred Leader
> > > "Blacklist" can be put it before Kafka is started to prevent this
> > > broker serving traffic.  In the traditional way of running
> > > reassignments, once the broker is up,
> > > with auto.leader.rebalance.enable  , if leadership starts going to this
> > > new empty broker, it might have to do preferred leader election after
> > > reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1)
> > > reassignment only change the ordering, 1 remains as the current leader,
> > > and needs prefer leader election to change to 2 after reassignment. so
> > > potentially one more O(N) operation.
> > >
> > > I hope the above example can show how easy to "blacklist" a broker
> > > serving leadership.  For someone managing Production Kafka cluster,
> > > it's important to react fast to certain alerts and mitigate/resolve
> > > some issues. As I listed the other use cases in KIP-291, I think this
> > > feature can make the Kafka product more easier to manage/operate.
> > >
> > > > In general, using an external rebalancing tool like Cruise Control is
> > a good idea to keep things balanced without having deal with manual
> > rebalancing.  > We expect more and more people who have a complex or large
> > cluster will start using tools like this.
> > > >
> > > > However, if you choose to do manual rebalancing, it shouldn't be that
> > bad.  You would save the existing partition ordering before making your
> > changes, then> make your changes (perhaps by running a simple command line
> > tool that switches the order of the replicas).  Then, once you felt like
> > the broker was ready to> serve traffic, you could just re-apply the old
> > ordering which you had saved.
> > >
> > >
> > > We do have our own rebalancing tool which has its own criteria like
> > > Rack diversity,  disk usage,  spread partitions/leaders across all
> > > brokers in the cluster per topic, leadership Bytes/BytesIn served per
> > > broker, etc.  We can run reassignments. The point is whether it's
> > > really necessary, and if there is more effective, easier, safer way to
> > > do it.
> > >
> > > take another use case example of taking leadership out of busy
> > > Controller to give it more power to serve metadata requests and other
> > > work. The controller can failover, with the preferred leader
> > > "blacklist",  it does not have to run reassignments again when
> > > controller failover, just change the blacklisted broker_id.
> > >
> > >
> > > > I was thinking about a PlacementPolicy filling the role of preventing
> > people from creating single-replica partitions on a node that we didn't
> > want to > ever be the leader.  I thought that it could also prevent people
> > from designating those nodes as preferred leaders during topic creation, or
> > Kafka from doing> itduring random topic creation.  I was assuming that the
> > PlacementPolicy would determine which nodes were which through static
> > configuration keys.  I agree> static configuration keys are somewhat less
> > flexible than dynamic configuration.
> > >
> > >
> > > I think single-replica partition might not be a good example.  There
> > > should not be any single-replica partition at all. If yes. it's
> > > probably because of trying to save disk space with less replicas.  I
> > > think at least minimum 2. The user purposely creating single-replica
> > > partition will take full responsibilities of data loss and
> > > unavailability when a broker fails or under maintenance.
> > >
> > >
> > > I think it would be better to use dynamic instead of static config.  I
> > > also think it would be better to have topic creation Policy enforced in
> > > Kafka server OR an external service. We have an external/central
> > > service managing topic creation/partition expansion which takes into
> > > account of rack-diversity, replication factor (2, 3 or 4 depending on
> > > cluster/topic type), Policy replicating the topic between kafka
> > > clusters, etc.
> > >
> > >
> > >
> > > Thanks,
> > > George
> > >
> > >
> > >    On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe
> > > <[email protected]> wrote:
> > >
> > >  On Wed, Aug 7, 2019, at 12:48, George Li wrote:
> > > >  Hi Colin,
> > > >
> > > > Thanks for your feedbacks.  Comments below:
> > > > > Even if you have a way of blacklisting an entire broker all at once,
> > you still would need to run a leader election > for each partition where
> > you want to move the leader off of the blacklisted broker.  So the
> > operation is still O(N) in > that sense-- you have to do something per
> > partition.
> > > >
> > > > For a failed broker and swapped with an empty broker, when it comes
> > up,
> > > > it will not have any leadership, and we would like it to remain not
> > > > having leaderships for a couple of hours or days. So there is no
> > > > preferred leader election needed which incurs O(N) operation in this
> > > > case.  Putting the preferred leader blacklist would safe guard this
> > > > broker serving traffic during that time. otherwise, if another broker
> > > > fails(if this broker is the 1st, 2nd in the assignment), or someone
> > > > runs preferred leader election, this new "empty" broker can still get
> > > > leaderships.
> > > >
> > > > Also running reassignment to change the ordering of preferred leader
> > > > would not actually switch the leader automatically.  e.g.  (1,2,3) =>
> > > > (2,3,1). unless preferred leader election is run to switch current
> > > > leader from 1 to 2.  So the operation is at least 2 x O(N).  and then
> > > > after the broker is back to normal, another 2 x O(N) to rollback.
> > >
> > > Hi George,
> > >
> > > Hmm.  I guess I'm still on the fence about this feature.
> > >
> > > In your example, I think we're comparing apples and oranges.  You
> > > started by outlining a scenario where "an empty broker... comes up...
> > > [without] any leadership[s]."  But then you criticize using
> > > reassignment to switch the order of preferred replicas because it
> > > "would not actually switch the leader automatically."  If the empty
> > > broker doesn't have any leaderships, there is nothing to be switched,
> > > right?
> > >
> > > >
> > > >
> > > > > In general, reassignment will get a lot easier and quicker once
> > KIP-455 is implemented.  > Reassignments that just change the order of
> > preferred replicas for a specific partition should complete pretty much
> > instantly.
> > > > >> I think it's simpler and easier just to have one source of truth
> > for what the preferred replica is for a partition, rather than two.  So
> > for> me, the fact that the replica assignment ordering isn't changed is
> > actually a big disadvantage of this KIP.  If you are a new user (or just>
> > an existing user that didn't read all of the documentation) and you just
> > look at the replica assignment, you might be confused by why> a particular
> > broker wasn't getting any leaderships, even  though it appeared like it
> > should.  More mechanisms mean more complexity> for users and developers
> > most of the time.
> > > >
> > > >
> > > > I would like stress the point that running reassignment to change the
> > > > ordering of the replica (putting a broker to the end of partition
> > > > assignment) is unnecessary, because after some time the broker is
> > > > caught up, it can start serving traffic and then need to run
> > > > reassignments again to "rollback" to previous states. As I mentioned
> > in
> > > > KIP-491, this is just tedious work.
> > >
> > > In general, using an external rebalancing tool like Cruise Control is a
> > > good idea to keep things balanced without having deal with manual
> > > rebalancing.  We expect more and more people who have a complex or
> > > large cluster will start using tools like this.
> > >
> > > However, if you choose to do manual rebalancing, it shouldn't be that
> > > bad.  You would save the existing partition ordering before making your
> > > changes, then make your changes (perhaps by running a simple command
> > > line tool that switches the order of the replicas).  Then, once you
> > > felt like the broker was ready to serve traffic, you could just
> > > re-apply the old ordering which you had saved.
> > >
> > > >
> > > > I agree this might introduce some complexities for users/developers.
> > > > But if this feature is good, and well documented, it is good for the
> > > > kafka product/community.  Just like KIP-460 enabling unclean leader
> > > > election to override TopicLevel/Broker Level config of
> > > > `unclean.leader.election.enable`
> > > >
> > > > > I agree that it would be nice if we could treat some brokers
> > differently for the purposes of placing replicas, selecting leaders, etc. >
> > Right now, we don't have any way of implementing that without forking the
> > broker.  I would support a new PlacementPolicy class that> would close this
> > gap.  But I don't think this KIP is flexible enough to fill this role.  For
> > example, it can't prevent users from creating> new single-replica topics
> > that get put on the "bad" replica.  Perhaps we should reopen the
> > discussion> about
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > > >
> > > > Creating topic with single-replica is beyond what KIP-491 is trying to
> > > > achieve.  The user needs to take responsibility of doing that. I do
> > see
> > > > some Samza clients notoriously creating single-replica topics and that
> > > > got flagged by alerts, because a single broker down/maintenance will
> > > > cause offline partitions. For KIP-491 preferred leader "blacklist",
> > > > the single-replica will still serve as leaders, because there is no
> > > > other alternative replica to be chosen as leader.
> > > >
> > > > Even with a new PlacementPolicy for topic creation/partition
> > expansion,
> > > > it still needs the blacklist info (e.g. a zk path node, or broker
> > > > level/topic level config) to "blacklist" the broker to be preferred
> > > > leader? Would it be the same as KIP-491 is introducing?
> > >
> > > I was thinking about a PlacementPolicy filling the role of preventing
> > > people from creating single-replica partitions on a node that we didn't
> > > want to ever be the leader.  I thought that it could also prevent
> > > people from designating those nodes as preferred leaders during topic
> > > creation, or Kafka from doing itduring random topic creation.  I was
> > > assuming that the PlacementPolicy would determine which nodes were
> > > which through static configuration keys.  I agree static configuration
> > > keys are somewhat less flexible than dynamic configuration.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > >
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe
> > > > <[email protected]> wrote:
> > > >
> > > >  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> > > > >  Hi Colin,
> > > > > Thanks for looking into this KIP.  Sorry for the late response. been
> > busy.
> > > > >
> > > > > If a cluster has MAMY topic partitions, moving this "blacklist"
> > broker
> > > > > to the end of replica list is still a rather "big" operation,
> > involving
> > > > > submitting reassignments.  The KIP-491 way of blacklist is much
> > > > > simpler/easier and can undo easily without changing the replica
> > > > > assignment ordering.
> > > >
> > > > Hi George,
> > > >
> > > > Even if you have a way of blacklisting an entire broker all at once,
> > > > you still would need to run a leader election for each partition where
> > > > you want to move the leader off of the blacklisted broker.  So the
> > > > operation is still O(N) in that sense-- you have to do something per
> > > > partition.
> > > >
> > > > In general, reassignment will get a lot easier and quicker once
> > KIP-455
> > > > is implemented.  Reassignments that just change the order of preferred
> > > > replicas for a specific partition should complete pretty much
> > instantly.
> > > >
> > > > I think it's simpler and easier just to have one source of truth for
> > > > what the preferred replica is for a partition, rather than two.  So
> > for
> > > > me, the fact that the replica assignment ordering isn't changed is
> > > > actually a big disadvantage of this KIP.  If you are a new user (or
> > > > just an existing user that didn't read all of the documentation) and
> > > > you just look at the replica assignment, you might be confused by why
> > a
> > > > particular broker wasn't getting any leaderships, even  though it
> > > > appeared like it should.  More mechanisms mean more complexity for
> > > > users and developers most of the time.
> > > >
> > > > > Major use case for me, a failed broker got swapped with new
> > hardware,
> > > > > and starts up as empty (with latest offset of all partitions), the
> > SLA
> > > > > of retention is 1 day, so before this broker is up to be in-sync for
> > 1
> > > > > day, we would like to blacklist this broker from serving traffic.
> > after
> > > > > 1 day, the blacklist is removed and run preferred leader election.
> > > > > This way, no need to run reassignments before/after.  This is the
> > > > > "temporary" use-case.
> > > >
> > > > What if we just add an option to the reassignment tool to generate a
> > > > plan to move all the leaders off of a specific broker?  The tool could
> > > > also run a leader election as well.  That would be a simple way of
> > > > doing this without adding new mechanisms or broker-side
> > configurations,
> > > > etc.
> > > >
> > > > >
> > > > > There are use-cases that this Preferred Leader "blacklist" can be
> > > > > somewhat permanent, as I explained in the AWS data center instances
> > Vs.
> > > > > on-premises data center bare metal machines (heterogenous hardware),
> > > > > that the AWS broker_ids will be blacklisted.  So new topics
> > created,
> > > > > or existing topic expansion would not make them serve traffic even
> > they
> > > > > could be the preferred leader.
> > > >
> > > > I agree that it would be nice if we could treat some brokers
> > > > differently for the purposes of placing replicas, selecting leaders,
> > > > etc.  Right now, we don't have any way of implementing that without
> > > > forking the broker.  I would support a new PlacementPolicy class that
> > > > would close this gap.  But I don't think this KIP is flexible enough
> > to
> > > > fill this role.  For example, it can't prevent users from creating new
> > > > single-replica topics that get put on the "bad" replica.  Perhaps we
> > > > should reopen the discussion about
> > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > > >
> > > > regards,
> > > > Colin
> > > >
> > > > >
> > > > > Please let me know there are more question.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe
> > > > > <[email protected]> wrote:
> > > > >
> > > > >  We still want to give the "blacklisted" broker the leadership if
> > > > > nobody else is available.  Therefore, isn't putting a broker on the
> > > > > blacklist pretty much the same as moving it to the last entry in the
> > > > > replicas list and then triggering a preferred leader election?
> > > > >
> > > > > If we want this to be undone after a certain amount of time, or
> > under
> > > > > certain conditions, that seems like something that would be more
> > > > > effectively done by an external system, rather than putting all
> > these
> > > > > policies into Kafka.
> > > > >
> > > > > best,
> > > > > Colin
> > > > >
> > > > >
> > > > > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > > > > >  Hi Satish,
> > > > > > Thanks for the reviews and feedbacks.
> > > > > >
> > > > > > > > The following is the requirements this KIP is trying to
> > accomplish:
> > > > > > > This can be moved to the"Proposed changes" section.
> > > > > >
> > > > > > Updated the KIP-491.
> > > > > >
> > > > > > > >>The logic to determine the priority/order of which broker
> > should be
> > > > > > > preferred leader should be modified.  The broker in the
> > preferred leader
> > > > > > > blacklist should be moved to the end (lowest priority) when
> > > > > > > determining leadership.
> > > > > > >
> > > > > > > I believe there is no change required in the ordering of the
> > preferred
> > > > > > > replica list. Brokers in the preferred leader blacklist are
> > skipped
> > > > > > > until other brokers int he list are unavailable.
> > > > > >
> > > > > > Yes. partition assignment remained the same, replica & ordering.
> > The
> > > > > > blacklist logic can be optimized during implementation.
> > > > > >
> > > > > > > >>The blacklist can be at the broker level. However, there might
> > be use cases
> > > > > > > where a specific topic should blacklist particular brokers, which
> > > > > > > would be at the
> > > > > > > Topic level Config. For this use cases of this KIP, it seems
> > that broker level
> > > > > > > blacklist would suffice.  Topic level preferred leader blacklist
> > might
> > > > > > > be future enhancement work.
> > > > > > >
> > > > > > > I agree that the broker level preferred leader blacklist would be
> > > > > > > sufficient. Do you have any use cases which require topic level
> > > > > > > preferred blacklist?
> > > > > >
> > > > > >
> > > > > >
> > > > > > I don't have any concrete use cases for Topic level preferred
> > leader
> > > > > > blacklist.  One scenarios I can think of is when a broker has high
> > CPU
> > > > > > usage, trying to identify the big topics (High MsgIn, High
> > BytesIn,
> > > > > > etc), then try to move the leaders away from this broker,  before
> > doing
> > > > > > an actual reassignment to change its preferred leader,  try to put
> > this
> > > > > > preferred_leader_blacklist in the Topic Level config, and run
> > preferred
> > > > > > leader election, and see whether CPU decreases for this broker,
> > if
> > > > > > yes, then do the reassignments to change the preferred leaders to
> > be
> > > > > > "permanent" (the topic may have many partitions like 256 that has
> > quite
> > > > > > a few of them having this broker as preferred leader).  So this
> > Topic
> > > > > > Level config is an easy way of doing trial and check the result.
> > > > > >
> > > > > >
> > > > > > > You can add the below workaround as an item in the rejected
> > alternatives section
> > > > > > > "Reassigning all the topic/partitions which the intended broker
> > is a
> > > > > > > replica for."
> > > > > >
> > > > > > Updated the KIP-491.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > George
> > > > > >
> > > > > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana
> > > > > > <[email protected]> wrote:
> > > > > >
> > > > > >  Thanks for the KIP. I have put my comments below.
> > > > > >
> > > > > > This is a nice improvement to avoid cumbersome maintenance.
> > > > > >
> > > > > > >> The following is the requirements this KIP is trying to
> > accomplish:
> > > > > >   The ability to add and remove the preferred leader deprioritized
> > > > > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > > > >
> > > > > > This can be moved to the"Proposed changes" section.
> > > > > >
> > > > > > >>The logic to determine the priority/order of which broker should
> > be
> > > > > > preferred leader should be modified.  The broker in the preferred
> > leader
> > > > > > blacklist should be moved to the end (lowest priority) when
> > > > > > determining leadership.
> > > > > >
> > > > > > I believe there is no change required in the ordering of the
> > preferred
> > > > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > > > until other brokers int he list are unavailable.
> > > > > >
> > > > > > >>The blacklist can be at the broker level. However, there might
> > be use cases
> > > > > > where a specific topic should blacklist particular brokers, which
> > > > > > would be at the
> > > > > > Topic level Config. For this use cases of this KIP, it seems that
> > broker level
> > > > > > blacklist would suffice.  Topic level preferred leader blacklist
> > might
> > > > > > be future enhancement work.
> > > > > >
> > > > > > I agree that the broker level preferred leader blacklist would be
> > > > > > sufficient. Do you have any use cases which require topic level
> > > > > > preferred blacklist?
> > > > > >
> > > > > > You can add the below workaround as an item in the rejected
> > alternatives section
> > > > > > "Reassigning all the topic/partitions which the intended broker is
> > a
> > > > > > replica for."
> > > > > >
> > > > > > Thanks,
> > > > > > Satish.
> > > > > >
> > > > > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > Hey George,
> > > > > > >
> > > > > > > Thanks for the KIP, it's an interesting idea.
> > > > > > >
> > > > > > > I was wondering whether we could achieve the same thing via the
> > > > > > > kafka-reassign-partitions tool. As you had also said in the
> > JIRA,  it is
> > > > > > > true that this is currently very tedious with the tool. My
> > thoughts are
> > > > > > > that we could improve the tool and give it the notion of a
> > "blacklisted
> > > > > > > preferred leader".
> > > > > > > This would have some benefits like:
> > > > > > > - more fine-grained control over the blacklist. we may not want
> > to
> > > > > > > blacklist all the preferred leaders, as that would make the
> > blacklisted
> > > > > > > broker a follower of last resort which is not very useful. In
> > the cases of
> > > > > > > an underpowered AWS machine or a controller, you might overshoot
> > and make
> > > > > > > the broker very underutilized if you completely make it
> > leaderless.
> > > > > > > - is not permanent. If we are to have a blacklist leaders config,
> > > > > > > rebalancing tools would also need to know about it and
> > manipulate/respect
> > > > > > > it to achieve a fair balance.
> > > > > > > It seems like both problems are tied to balancing partitions,
> > it's just
> > > > > > > that KIP-491's use case wants to balance them against other
> > factors in a
> > > > > > > more nuanced way. It makes sense to have both be done from the
> > same place
> > > > > > >
> > > > > > > To make note of the motivation section:
> > > > > > > > Avoid bouncing broker in order to lose its leadership
> > > > > > > The recommended way to make a broker lose its leadership is to
> > run a
> > > > > > > reassignment on its partitions
> > > > > > > > The cross-data center cluster has AWS cloud instances which
> > have less
> > > > > > > computing power
> > > > > > > We recommend running Kafka on homogeneous machines. It would be
> > cool if the
> > > > > > > system supported more flexibility in that regard but that is
> > more nuanced
> > > > > > > and a preferred leader blacklist may not be the best first
> > approach to the
> > > > > > > issue
> > > > > > >
> > > > > > > Adding a new config which can fundamentally change the way
> > replication is
> > > > > > > done is complex, both for the system (the replication code is
> > complex
> > > > > > > enough) and the user. Users would have another potential config
> > that could
> > > > > > > backfire on them - e.g if left forgotten.
> > > > > > >
> > > > > > > Could you think of any downsides to implementing this
> > functionality (or a
> > > > > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > > > > One downside I can see is that we would not have it handle new
> > partitions
> > > > > > > created after the "blacklist operation". As a first iteration I
> > think that
> > > > > > > may be acceptable
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Stanislav
> > > > > > >
> > > > > > > On Fri, Jul 19, 2019 at 3:20 AM George Li <
> > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > >  Hi,
> > > > > > > >
> > > > > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > > > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > > > > )
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > George
> > > > > > > >
> > > > > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > > > > [email protected]> wrote:
> > > > > > > >
> > > > > > > >  Hi,
> > > > > > > >
> > > > > > > > I have created KIP-491 (
> > > > > > > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > )
> > > > > > > > for putting a broker to the preferred leader blacklist or
> > deprioritized
> > > > > > > > list so when determining leadership,  it's moved to the lowest
> > priority for
> > > > > > > > some of the listed use-cases.
> > > > > > > >
> > > > > > > > Please provide your comments/feedbacks.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > George
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia
> > Sancio (JIRA) <
> > > > > > > > [email protected]>To: "[email protected]" <
> > [email protected]>Sent:
> > > > > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira]
> > [Commented]
> > > > > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > > > > > >
> > > > > > > >    [
> > > > > > > >
> > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > > > > ]
> > > > > > > >
> > > > > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > > > > ---------------------------------------------------
> > > > > > > >
> > > > > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > > > > >
> > > > > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > > > > -----------------------------------------------
> > > > > > > > >
> > > > > > > > >                Key: KAFKA-8638
> > > > > > > > >                URL:
> > https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > > > > >            Project: Kafka
> > > > > > > > >          Issue Type: Improvement
> > > > > > > > >          Components: config, controller, core
> > > > > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > > > > >            Reporter: GEORGE LI
> > > > > > > > >            Assignee: GEORGE LI
> > > > > > > > >            Priority: Major
> > > > > > > > >
> > > > > > > > > Currently, the kafka preferred leader election will pick the
> > broker_id
> > > > > > > > in the topic/partition replica assignments in a priority order
> > when the
> > > > > > > > broker is in ISR. The preferred leader is the broker id in the
> > first
> > > > > > > > position of replica. There are use-cases that, even the first
> > broker in the
> > > > > > > > replica assignment is in ISR, there is a need for it to be
> > moved to the end
> > > > > > > > of ordering (lowest priority) when deciding leadership during
> > preferred
> > > > > > > > leader election.
> > > > > > > > > Let’s use topic/partition replica (1,2,3) as an example. 1
> > is the
> > > > > > > > preferred leader.  When preferred leadership is run, it will
> > pick 1 as the
> > > > > > > > leader if it's ISR, if 1 is not online and in ISR, then pick
> > 2, if 2 is not
> > > > > > > > in ISR, then pick 3 as the leader. There are use cases that,
> > even 1 is in
> > > > > > > > ISR, we would like it to be moved to the end of ordering
> > (lowest priority)
> > > > > > > > when deciding leadership during preferred leader election.
> > Below is a list
> > > > > > > > of use cases:
> > > > > > > > > * (If broker_id 1 is a swapped failed host and brought up
> > with last
> > > > > > > > segments or latest offset without historical data (There is
> > another effort
> > > > > > > > on this), it's better for it to not serve leadership till it's
> > caught-up.
> > > > > > > > > * The cross-data center cluster has AWS instances which have
> > less
> > > > > > > > computing power than the on-prem bare metal machines.  We
> > could put the AWS
> > > > > > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers
> > can be elected
> > > > > > > > leaders, without changing the reassignments ordering of the
> > replicas.
> > > > > > > > > * If the broker_id 1 is constantly losing leadership after
> > some time:
> > > > > > > > "Flapping". we would want to exclude 1 to be a leader unless
> > all other
> > > > > > > > brokers of this topic/partition are offline.  The “Flapping”
> > effect was
> > > > > > > > seen in the past when 2 or more brokers were bad, when they
> > lost leadership
> > > > > > > > constantly/quickly, the sets of partition replicas they belong
> > to will see
> > > > > > > > leadership constantly changing.  The ultimate solution is to
> > swap these bad
> > > > > > > > hosts.  But for quick mitigation, we can also put the bad
> > hosts in the
> > > > > > > > Preferred Leader Blacklist to move the priority of its being
> > elected as
> > > > > > > > leaders to the lowest.
> > > > > > > > > *  If the controller is busy serving an extra load of
> > metadata requests
> > > > > > > > and other tasks. we would like to put the controller's leaders
> > to other
> > > > > > > > brokers to lower its CPU load. currently bouncing to lose
> > leadership would
> > > > > > > > not work for Controller, because after the bounce, the
> > controller fails
> > > > > > > > over to another broker.
> > > > > > > > > * Avoid bouncing broker in order to lose its leadership: it
> > would be
> > > > > > > > good if we have a way to specify which broker should be
> > excluded from
> > > > > > > > serving traffic/leadership (without changing the replica
> > assignment
> > > > > > > > ordering by reassignments, even though that's quick), and run
> > preferred
> > > > > > > > leader election.  A bouncing broker will cause temporary URP,
> > and sometimes
> > > > > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1)
> > can temporarily
> > > > > > > > lose all its leadership, but if another broker (e.g. broker_id
> > 2) fails or
> > > > > > > > gets bounced, some of its leaderships will likely failover to
> > broker_id 1
> > > > > > > > on a replica with 3 brokers.  If broker_id 1 is in the
> > blacklist, then in
> > > > > > > > such a scenario even broker_id 2 offline,  the 3rd broker can
> > take
> > > > > > > > leadership.
> > > > > > > > > The current work-around of the above is to change the
> > topic/partition's
> > > > > > > > replica reassignments to move the broker_id 1 from the first
> > position to
> > > > > > > > the last position and run preferred leader election. e.g. (1,
> > 2, 3) => (2,
> > > > > > > > 3, 1). This changes the replica reassignments, and we need to
> > keep track of
> > > > > > > > the original one and restore if things change (e.g. controller
> > fails over
> > > > > > > > to another broker, the swapped empty broker caught up). That’s
> > a rather
> > > > > > > > tedious task.
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > This message was sent by Atlassian JIRA
> > > > > > > > (v7.6.3#76005)
>

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Reply via email to