[ 
https://issues.apache.org/jira/browse/KAFKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13840449#comment-13840449
 ] 

Guozhang Wang commented on KAFKA-1134:
--------------------------------------

After checking the stack trace again, now I think the problem is that

1) In KafkaController.handleNewSession

controllerContext.controllerLock synchronized {
        Utils.unregisterMBean(KafkaController.MBeanName)
        partitionStateMachine.shutdown()
        replicaStateMachine.shutdown()
        if(controllerContext.controllerChannelManager != null) {
          controllerContext.controllerChannelManager.shutdown()
          controllerContext.controllerChannelManager = null
        }
        controllerElector.elect
      }

elect function is called directly after controllerChannelManager.shutdown and 
is lock covered by controllerContext.controllerLock, however from the logs. 
elect is not immediately called since addpartition listener gets triggered due 
to ZK expiration (known issue similar as KAFKA-1143) and which are covered by 
the same lock:

2013/11/14 00:00:24.596 [RequestSendThread] 
[Controller-583-to-broker-587-send-thread], Stopped 
2013/11/14 00:00:24.596 [RequestSendThread] 
[Controller-583-to-broker-587-send-thread], Shutdown completed
2013/11/14 00:00:24.596 [RequestSendThread] 
[Controller-583-to-broker-579-send-thread], Shutting down
2013/11/14 00:00:24.596 [RequestSendThread] 
[Controller-583-to-broker-579-send-thread], Stopped 
2013/11/14 00:00:24.596 [RequestSendThread] 
[Controller-583-to-broker-579-send-thread], Shutdown completed
2013/11/14 00:00:24.603 [ReplicaStateMachine$BrokerChangeListener] 
[BrokerChangeListener on Controller 583]: Broker change listener fired for path 
/brokers/ids with children 583,575,585,587,579,589
2013/11/14 00:00:24.605 [ReplicaStateMachine$BrokerChangeListener] 
[BrokerChangeListener on Controller 583]: Broker change listener fired for path 
/brokers/ids with children 583,575,585,587,579,589
2013/11/14 00:00:24.614 [PartitionStateMachine$AddPartitionsListener] 
[AddPartitionsListener on 583]: Add Partition triggered { "partitions":{ "0":[ 
577, 589 ], "1":[ 579, 575 ], "2":[ 581, 577 ], "3":[ 583, 579 ] }, "version":1 
} for path /brokers/topics/databus2-relay-log_event
2013/11/14 00:00:24.616 [PartitionStateMachine$AddPartitionsListener] 
[AddPartitionsListener on 583]: New partitions to be added [Map()]
2013/11/14 00:00:24.616 [KafkaController] [Controller 583]: New partition 
creation callback for 
2013/11/14 00:00:24.618 [PartitionStateMachine$AddPartitionsListener] 
[AddPartitionsListener on 583]: Add Partition triggered { "partitions":{ "0":[ 
577, 589 ], "1":[ 579, 575 ], "2":[ 581, 577 ], "3":[ 583, 579 ] }, "version":1 
} for path /brokers/topics/databus2-relay-log_event

----------------

Without other logging info I cannot deduce any further, so I propose in this 
jira we just improve the logging info for better debugging if this issue comes 
up in the future.

> onControllerFailover function should be synchronized with other functions
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-1134
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1134
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.8.1
>            Reporter: Guozhang Wang
>         Attachments: KAFKA-1134.patch, KAFKA-1134_2013-12-05_11:13:33.patch
>
>
> Otherwise race conditions could happen. For example, handleNewSession will 
> close all sockets with brokers while the handleStateChange in 
> onControllerFailover tries to send requests to them.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to