[jira] [Comment Edited] (KAFKA-2300) Error in controller log when broker tries to rejoin cluster

Flavio Junqueira (JIRA) Tue, 14 Jul 2015 06:14:58 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626247#comment-14626247
 ]


Flavio Junqueira edited comment on KAFKA-2300 at 7/14/15 1:13 PM:
------------------------------------------------------------------

Attaching a preliminary patch in the case anyone is willing to give a hand. As 
I described before, one problem is that calls like sendRequest can throw an 
exception and if one is thrown, then the state of the 
ControllerBrokerRequestBatch object can be left broken (requests are not sent 
and newBatch calls keep throwing exceptions).

The attached patch catches exceptions that calls like sendRequest might throw, 
cleans the state, and throws an IllegalStateException. Cleaning the state can 
be problematic if we don't handle the IllegalStateException appropriately. For 
now, at least in the call path of the topic deletion, I'm suggesting that we 
make the controller resign, but this could be overkill. If anyone is willing to 
chime in, I'd appreciate suggestions around the best way of dealing with such a 
controller in an illegal state.


was (Author: fpj):
Attaching a preliminary patch in the case anyone is willing to give a hand. As 
I described before, one problem is that calls like sendRequest can throw an 
exception and if one is thrown, then the state of the 
ControllerBrokerRequestBatch object can be left broken (requests are not sent 
and newBatch calls keep throwing exceptions.

The attached patch catches exceptions that calls like sendRequest might throw, 
cleans the state, and throws an IllegalStateException. Cleaning the state can 
be problematic if we don't handle the IllegalStateException appropriately. For 
now, at least in the call path of the topic deletion, I'm suggesting that we 
make the controller resign, but this could be overkill. If anyone is willing to 
chime in, I'd appreciate suggestions around the best way of dealing with such a 
controller in an illegal state.

> Error in controller log when broker tries to rejoin cluster
> -----------------------------------------------------------
>
>                 Key: KAFKA-2300
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2300
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Johnny Brown
>            Assignee: Flavio Junqueira
>         Attachments: KAFKA-2300.patch
>
>
> Hello Kafka folks,
> We are having an issue where a broker attempts to join the cluster after 
> being restarted, but is never added to the ISR for its assigned partitions. 
> This is a three-node cluster, and the controller is broker 2.
> When broker 1 starts, we see the following message in broker 2's 
> controller.log.
> {{
> [2015-06-23 13:57:16,535] ERROR [BrokerChangeListener on Controller 2]: Error 
> while handling broker changes 
> (kafka.controller.ReplicaStateMachine$BrokerChangeListener)
> java.lang.IllegalStateException: Controller to broker state change requests 
> batch is not empty while creating a new one. Some UpdateMetadata state 
> changes Map(2 -> Map([prod-sver-end,1] -> 
> (LeaderAndIsrInfo:(Leader:-2,ISR:1,LeaderEpoch:0,ControllerEpoch:165),ReplicationFactor:1),AllReplicas:1)),
>  1 -> Map([prod-sver-end,1] -> 
> (LeaderAndIsrInfo:(Leader:-2,ISR:1,LeaderEpoch:0,ControllerEpoch:165),ReplicationFactor:1),AllReplicas:1)),
>  3 -> Map([prod-sver-end,1] -> 
> (LeaderAndIsrInfo:(Leader:-2,ISR:1,LeaderEpoch:0,ControllerEpoch:165),ReplicationFactor:1),AllReplicas:1)))
>  might be lost 
>   at 
> kafka.controller.ControllerBrokerRequestBatch.newBatch(ControllerChannelManager.scala:202)
>   at 
> kafka.controller.KafkaController.sendUpdateMetadataRequest(KafkaController.scala:974)
>   at 
> kafka.controller.KafkaController.onBrokerStartup(KafkaController.scala:399)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:371)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359)
>   at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:358)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357)
>   at kafka.utils.Utils$.inLock(Utils.scala:535)
>   at 
> kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:356)
>   at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:568)
>   at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> }}
> {{prod-sver-end}} is a topic we previously deleted. It seems some remnant of 
> it persists in the controller's memory, causing an exception which interrupts 
> the state change triggered by the broker startup.
> Has anyone seen something like this? Any idea what's happening here? Any 
> information would be greatly appreciated.
> Thanks,
> Johnny



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (KAFKA-2300) Error in controller log when broker tries to rejoin cluster

Reply via email to