Hari Sekhon created AMBARI-24719:
------------------------------------
Summary: Kafka Rolling Restart causes outage(s)
Key: AMBARI-24719
URL: https://issues.apache.org/jira/browse/AMBARI-24719
Project: Ambari
Issue Type: Improvement
Components: ambari-server
Affects Versions: 2.6.2
Reporter: Hari Sekhon
Ambari causes Kafka topic partition outages during rolling restarts because it
only does a simplistic 2 minute wait between brokers and doesn't check the
state of partition replicas before taking another broker down.
On busty Kafka clusters with lots topics / partitions / data it might take a
while before in-sync replicas recover.
Ambari should therefore check for under any replicated partitions and wait as
long as it takes for them to recover before proceeding to the next broker.
There is however an issue in doing so which is there is a topic partition with
a replica that no longer exists (eg. ambari_kafka_service_check) then it will
never recover so there needs to be some thoughtful handling around that.
This might be solved by AMBARI-24203 but I'm not sure it is tied in properly to
the rolling restarts or what the timeout policy or time interval is for it, or
whether it takes the above paragraph in to account.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)