[
https://issues.apache.org/jira/browse/KAFKA-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925503#comment-15925503
]
Onur Karaman commented on KAFKA-4900:
-------------------------------------
We hit the same re-election loop issue from ZookeeperLeaderElector throwing
java.lang.IllegalArgumentException due to attempting to register
"connection-close-rate" in production at linkedin today.
A similar issue of duplicate metric registration came up recently in KAFKA-4811
as well. Strange.
I don't think we saw the ClassCastException.
> Brokers stuck in controller re-election loop after failing to register metrics
> ------------------------------------------------------------------------------
>
> Key: KAFKA-4900
> URL: https://issues.apache.org/jira/browse/KAFKA-4900
> Project: Kafka
> Issue Type: Bug
> Components: controller, core
> Affects Versions: 0.10.1.1
> Reporter: Nick Travers
>
> We hit this today in one of out three node staging clusters. The exception
> continues to occur on all three nodes.
> {code}
> 2017-03-15 02:17:30,677 ERROR
> [ZkClient-EventThread-35-samsa-zkserver.stage.sjc1.square:26101/samsa]
> server.ZookeeperLeaderElector - Error while electing or becoming leader on
> broker 9
> java.lang.IllegalArgumentException: A metric named 'MetricName
> [name=connection-close-rate,
> group=controller-channel-metrics,description=Connections closed per second in
> the window., tags={broker-id=10}]' already exists, can't register another one.
> at
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:380)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:179)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:164)
> at
> org.apache.kafka.common.network.Selector$SelectorMetrics.<init>(Selector.java:617)
> at org.apache.kafka.common.network.Selector.<init>(Selector.java:138)
> at
> kafka.controller.ControllerChannelManager.kafka$controller$ControllerChannelManager$$addNewBroker(ControllerChannelManager.scala:101)
> at
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at
> kafka.controller.ControllerChannelManager$$anonfun$1.apply(ControllerChannelManager.scala:45)
> at scala.collection.immutable.Set$Set3.foreach(Set.scala:163)
> at
> kafka.controller.ControllerChannelManager.<init>(ControllerChannelManager.scala:45)
> at
> kafka.controller.KafkaController.startChannelManager(KafkaController.scala:814)
> at
> kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:742)
> at
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:334)
> at
> kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:167)
> at
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
> at
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
> at
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
> at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:824)
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> {code}
> We observe a tight loop of controller (re-)election, i.e. one nodes hits this
> exception, and leadership transitions to the next, which then hits the
> exception, ad infinitum.
> Producers and consumers appear to be connecting ok, and are able to produce
> and consume messages.
> Relevant data points:
> - prior to this cluster restart, a repartition of a number of topics was
> attempted, which appeared to get stuck in the "in progress" state (on the
> order of days)
> - these topics were subsequently deleted
> - a rolling restart of the cluster was performed was to turn on
> broker-to-broker SSL communication
> - the SSL change has subsequently been _rolled back_ after we observed these
> exceptions
> - the entire cluster was shut down, and nodes brought back one at a time in
> an attempt to clear the exception. We were able to restart the cluster, but
> we continue to see the exceptions
> We also observed, during the same time as the exception above, the following
> exception on all hosts:
> {code}
> 2017-03-15 01:44:04,572 ERROR
> [ZkClient-EventThread-36-samsa-zkserver.stage.sjc1.square:26101/samsa]
> controller.ReplicaStateMachine$BrokerChangeListener - [BrokerChangeListener
> on Controller 10]: Error while handling broker changes
> java.lang.ClassCastException: java.lang.String cannot be cast to
> java.lang.Integer
> at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
> at
> kafka.controller.KafkaController$$anonfun$8$$anonfun$apply$2.apply(KafkaController.scala:436)
> at
> scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
> at scala.collection.immutable.List.exists(List.scala:84)
> at
> kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:436)
> at
> kafka.controller.KafkaController$$anonfun$8.apply(KafkaController.scala:435)
> at
> scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
> at
> scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
> at
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
> at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
> at
> kafka.controller.KafkaController.onBrokerStartup(KafkaController.scala:435)
> at
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:374)
> at
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358)
> at
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:358)
> at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
> at
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:357)
> at
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356)
> at
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:356)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at
> kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:355)
> at org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:843)
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)