[
https://issues.apache.org/jira/browse/KAFKA-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pengwei updated KAFKA-4229:
---------------------------
Description:
We found the controller not started after several zk expired event in our test
environment. By analysing the log, I found the controller will handle the
ephemeral node data delete event first and then the zk expired event , then the
controller will gone.
I can reproducer it on my develop env:
1. set up a one broker and one zk env, specify a very large zk timeout (20s)
2. stop the broker and remove the zk's /broker/ids/0 directory
3. restart the broker and make a breakpoint in the zk client's event thread to
queue the delete event.
4. after the /controller node gone the breakpoint will hit.
5. expired the current session(suspend the send thread) and create a new
session s2
6. resume the event thread, then the controller will handle
LeaderChangeListener.handleDataDeleted and become leader
7. then controller will handle SessionExpirationListener.handleNewSession, it
resign the controller and elect, but when elect it found the /controller node
is exist and not become the leader. But the /controller node is created by
current session s2 will not remove. So the controller is gone
was:
We found the controller not started after serveral zk expired event in our test
environment. By analysing the log, I found the controller will handle the
ephemeral node data delete event first and then the zk expired event , then the
controller will gone.
I can reproducer it on my develop env:
1. set up a one broker and one zk env, specify a very large zk timeout (20s)
2. stop the broker and remove the zk's /broker/ids/0 directory
3. restart the broker and make a breakpoint in the zk client's event thread to
queue the delete event.
4. after the /controller node gone the breakpoint will hit.
5. expired the current session(suspend the send thread) and create a new
session s2
6. resume the event thread, then the controller will handle
LeaderChangeListener.handleDataDeleted and become leader
7. then controller will handle SessionExpirationListener.handleNewSession, it
resign the controller and elect, but when elect it found the /controller node
is exist and not become the leader. But the /controller node is created by
current session s2 will not remove. So the controller is gone
> Controller can't start after several zk expired event
> -----------------------------------------------------
>
> Key: KAFKA-4229
> URL: https://issues.apache.org/jira/browse/KAFKA-4229
> Project: Kafka
> Issue Type: Bug
> Components: controller
> Affects Versions: 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
> Reporter: Pengwei
>
> We found the controller not started after several zk expired event in our
> test environment. By analysing the log, I found the controller will handle
> the ephemeral node data delete event first and then the zk expired event ,
> then the controller will gone.
> I can reproducer it on my develop env:
> 1. set up a one broker and one zk env, specify a very large zk timeout (20s)
> 2. stop the broker and remove the zk's /broker/ids/0 directory
> 3. restart the broker and make a breakpoint in the zk client's event thread
> to queue the delete event.
> 4. after the /controller node gone the breakpoint will hit.
> 5. expired the current session(suspend the send thread) and create a new
> session s2
> 6. resume the event thread, then the controller will handle
> LeaderChangeListener.handleDataDeleted and become leader
> 7. then controller will handle SessionExpirationListener.handleNewSession, it
> resign the controller and elect, but when elect it found the /controller
> node is exist and not become the leader. But the /controller node is created
> by current session s2 will not remove. So the controller is gone
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)