[ 
https://issues.apache.org/jira/browse/KAFKA-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15700633#comment-15700633
 ] 

Jason Gustafson commented on KAFKA-4447:
----------------------------------------

[~Json Tu] Thanks for the report. I think adding guards to those listeners to 
verify that the controller is still active when they are fired makes sense. If 
you can reliably reproduce the problem, it would be good to know if that solves 
it. I'm not too familiar with this code, but it does seem like the listener 
removal logic could be vulnerable to some race conditions. In particular, the 
deregistration of the ReassignedPartitionsIsrChangeListeners happens while 
holding the controller lock, yet the listener itself must also acquire the same 
lock when it is executed, so I'm not sure how we can prevent it from running 
after resignation completes unless we have a guard like the one you're 
suggesting. It's a little more puzzling why you're seeing the 
IsrChangeNotificationListener (for example) also execute after the controller 
has resigned since it is explicitly deregesitered without the controller lock. 
There maybe also be a race condition in zkclient I guess.

> Controller resigned but it also acts as a controller for a long time 
> ---------------------------------------------------------------------
>
>                 Key: KAFKA-4447
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4447
>             Project: Kafka
>          Issue Type: Improvement
>          Components: controller
>    Affects Versions: 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>         Environment: Linux Os
>            Reporter: Json Tu
>         Attachments: log.tar.gz
>
>
> We have a cluster with 10 nodes,and we execute following operation as below.
> 1.we execute some topic partition reassign from one node to other 9 nodes in 
> the cluster, and which triggered controller.
> 2.controller invoke PartitionsReassignedListener's handleDataChange and read 
> all partition reassign rules from the zk path, and executed all 
> onPartitionReassignment for all partition that match conditions.
> 3.but the controller is expired from zk, after what some nodes of 9 nodes 
> also expired from zk.
> 5.then controller invoke onControllerResignation to resigned as the 
> controller.
> we found after the controller is resigned, it acts as controller for about 3 
> minutes, which can be found in my attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to