[ https://issues.apache.org/jira/browse/KAFKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13840449#comment-13840449 ]
Guozhang Wang commented on KAFKA-1134: -------------------------------------- After checking the stack trace again, now I think the problem is that 1) In KafkaController.handleNewSession controllerContext.controllerLock synchronized { Utils.unregisterMBean(KafkaController.MBeanName) partitionStateMachine.shutdown() replicaStateMachine.shutdown() if(controllerContext.controllerChannelManager != null) { controllerContext.controllerChannelManager.shutdown() controllerContext.controllerChannelManager = null } controllerElector.elect } elect function is called directly after controllerChannelManager.shutdown and is lock covered by controllerContext.controllerLock, however from the logs. elect is not immediately called since addpartition listener gets triggered due to ZK expiration (known issue similar as KAFKA-1143) and which are covered by the same lock: 2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-587-send-thread], Stopped 2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-587-send-thread], Shutdown completed 2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-579-send-thread], Shutting down 2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-579-send-thread], Stopped 2013/11/14 00:00:24.596 [RequestSendThread] [Controller-583-to-broker-579-send-thread], Shutdown completed 2013/11/14 00:00:24.603 [ReplicaStateMachine$BrokerChangeListener] [BrokerChangeListener on Controller 583]: Broker change listener fired for path /brokers/ids with children 583,575,585,587,579,589 2013/11/14 00:00:24.605 [ReplicaStateMachine$BrokerChangeListener] [BrokerChangeListener on Controller 583]: Broker change listener fired for path /brokers/ids with children 583,575,585,587,579,589 2013/11/14 00:00:24.614 [PartitionStateMachine$AddPartitionsListener] [AddPartitionsListener on 583]: Add Partition triggered { "partitions":{ "0":[ 577, 589 ], "1":[ 579, 575 ], "2":[ 581, 577 ], "3":[ 583, 579 ] }, "version":1 } for path /brokers/topics/databus2-relay-log_event 2013/11/14 00:00:24.616 [PartitionStateMachine$AddPartitionsListener] [AddPartitionsListener on 583]: New partitions to be added [Map()] 2013/11/14 00:00:24.616 [KafkaController] [Controller 583]: New partition creation callback for 2013/11/14 00:00:24.618 [PartitionStateMachine$AddPartitionsListener] [AddPartitionsListener on 583]: Add Partition triggered { "partitions":{ "0":[ 577, 589 ], "1":[ 579, 575 ], "2":[ 581, 577 ], "3":[ 583, 579 ] }, "version":1 } for path /brokers/topics/databus2-relay-log_event ---------------- Without other logging info I cannot deduce any further, so I propose in this jira we just improve the logging info for better debugging if this issue comes up in the future. > onControllerFailover function should be synchronized with other functions > ------------------------------------------------------------------------- > > Key: KAFKA-1134 > URL: https://issues.apache.org/jira/browse/KAFKA-1134 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.0, 0.8.1 > Reporter: Guozhang Wang > Attachments: KAFKA-1134.patch, KAFKA-1134_2013-12-05_11:13:33.patch > > > Otherwise race conditions could happen. For example, handleNewSession will > close all sockets with brokers while the handleStateChange in > onControllerFailover tries to send requests to them. -- This message was sent by Atlassian JIRA (v6.1#6144)