[jira] [Issue Comment Deleted] (KAFKA-4447) Controller resigned but it also acts as a controller for a long time

2016-11-28 Thread Json Tu (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Json Tu updated KAFKA-4447:
---
Comment: was deleted

(was: [~wangg...@gmail.com] thank you for such a detailed analysis .
as you mentioned here,  Before that happens I think a simple check as Jason 
mentioned before may not be sufficient, since it could happen that when the 
listener thread does the check it is still not resigned, but while it is 
executing the resignation happens.
at this place,my opinion is a little different from yours,though when the 
resignation is happens but not complete,the other listener may be fired,but it 
doesn‘t matter,because zk's callback process is single-threaded,so the simple 
check after that will be take effect.
as you say,even if these phenomenon happens,I very much agree with it 
doesn‘t do any harm to the cluster because of the obsoleted epoch number. the 
effect of this check can be used to decrease the interfere with the old 
controller‘s log,from the point of this, may be it will have certain meaning.
just be told,controller will be re-writed,could you reveal the release time 
of this change. thanks.

)

> Controller resigned but it also acts as a controller for a long time 
> -
>
> Key: KAFKA-4447
> URL: https://issues.apache.org/jira/browse/KAFKA-4447
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller
>Affects Versions: 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
> Environment: Linux Os
>Reporter: Json Tu
> Attachments: log.tar.gz
>
>
> We have a cluster with 10 nodes,and we execute following operation as below.
> 1.we execute some topic partition reassign from one node to other 9 nodes in 
> the cluster, and which triggered controller.
> 2.controller invoke PartitionsReassignedListener's handleDataChange and read 
> all partition reassign rules from the zk path, and executed all 
> onPartitionReassignment for all partition that match conditions.
> 3.but the controller is expired from zk, after what some nodes of 9 nodes 
> also expired from zk.
> 5.then controller invoke onControllerResignation to resigned as the 
> controller.
> we found after the controller is resigned, it acts as controller for about 3 
> minutes, which can be found in my attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (KAFKA-4447) Controller resigned but it also acts as a controller for a long time

2016-11-28 Thread Json Tu (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Json Tu updated KAFKA-4447:
---
Comment: was deleted

(was: [~skarface] thanks for your reply.
the latest release version 0.10.1.0,handleNewSession()'s implemention is as 
below,
  def handleNewSession() {
  info("ZK expired; shut down all controller components and try to 
re-elect")
  inLock(controllerContext.controllerLock) {
onControllerResignation()
controllerElector.elect
  }
}

so deregisterIsrChangeNotificationListener() is also with the controllerlock. 
the lock is out of the onControllerResignation(). and this is a bug which was 
reported at https://issues.apache.org/jira/browse/KAFKA-4360.

my version is 0.9.0.1, so it is not bugfixed,  so we can image it as below.
1. ZK expired callback queue is fired. and he get controllerLock first. then 
start to execute onControllerResignation .
2. at that time IsrChangeNotificationListener、PartitionsReassignedListener and 
so on are all fired very compact. 
3. then the onControllerResignation() start to exectue  de-register listeners.

as we know,the zkclient callback thread is single thread,so the listener fired 
after zk expired only can be executed after handleNewSession(),
may be this is make sense.)

> Controller resigned but it also acts as a controller for a long time 
> -
>
> Key: KAFKA-4447
> URL: https://issues.apache.org/jira/browse/KAFKA-4447
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller
>Affects Versions: 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
> Environment: Linux Os
>Reporter: Json Tu
> Attachments: log.tar.gz
>
>
> We have a cluster with 10 nodes,and we execute following operation as below.
> 1.we execute some topic partition reassign from one node to other 9 nodes in 
> the cluster, and which triggered controller.
> 2.controller invoke PartitionsReassignedListener's handleDataChange and read 
> all partition reassign rules from the zk path, and executed all 
> onPartitionReassignment for all partition that match conditions.
> 3.but the controller is expired from zk, after what some nodes of 9 nodes 
> also expired from zk.
> 5.then controller invoke onControllerResignation to resigned as the 
> controller.
> we found after the controller is resigned, it acts as controller for about 3 
> minutes, which can be found in my attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)