[ https://issues.apache.org/jira/browse/KAFKA-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15698987#comment-15698987 ]
ASF GitHub Bot commented on KAFKA-4229: --------------------------------------- GitHub user pengwei-li opened a pull request: https://github.com/apache/kafka/pull/2175 KAFKA-4229:Controller can't start after several zk expired event Author: pengwei <pengwei...@huawei.com> Reviewers: wangguoz.gmail.com You can merge this pull request into a Git repository by running: $ git pull https://github.com/pengwei-li/kafka trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/2175.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2175 ---- commit a920d4e9807add634cc44e4b7cf9e156edd515cf Author: pengwei-li <pengwei...@huawei.com> Date: 2016-07-10T00:31:56Z KAFKA-1429: Yet another deadlock in controller shutdown Author: pengwei <pengwei...@huawei.com> Reviewers: NA commit 2a5a4322c8ac359587f05b459588cd2b5843a2ac Author: pengwei-li <pengwei...@huawei.com> Date: 2016-11-20T11:31:21Z Merge branch 'trunk' of https://github.com/apache/kafka into trunk commit b827a8b4f249050ca40db9f14e8e10b01650a6b8 Author: pengwei-li <pengwei...@huawei.com> Date: 2016-11-20T12:18:49Z Merge branch 'trunk' of https://github.com/apache/kafka into trunk commit 43e186f223dee1e24177a87ee6888eaae91547d9 Author: pengwei-li <pengwei...@huawei.com> Date: 2016-11-27T01:54:00Z Merge branch 'trunk' of https://github.com/apache/kafka into trunk commit febe4f433452a2ad8849a329bc5c9f4d1507a317 Author: pengwei-li <pengwei...@huawei.com> Date: 2016-11-27T03:31:26Z issue:KAFKA-4229 reason: controoler can't start afeter several zk expired event ---- > Controller can't start after several zk expired event > ----------------------------------------------------- > > Key: KAFKA-4229 > URL: https://issues.apache.org/jira/browse/KAFKA-4229 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 > Reporter: Pengwei > Assignee: Pengwei > > We found the controller not started after several zk expired event in our > test environment. By analysing the log, I found the controller will handle > the ephemeral node data delete event first and then the zk expired event , > then the controller will gone. > I can reproducer it on my develop env: > 1. set up a one broker and one zk env, specify a very large zk timeout (20s) > 2. stop the broker and remove the zk's /broker/ids/0 directory > 3. restart the broker and make a breakpoint in the zk client's event thread > to queue the delete event. > 4. after the /controller node gone the breakpoint will hit. > 5. expired the current session(suspend the send thread) and create a new > session s2 > 6. resume the event thread, then the controller will handle > LeaderChangeListener.handleDataDeleted and become leader > 7. then controller will handle SessionExpirationListener.handleNewSession, it > resign the controller and elect, but when elect it found the /controller > node is exist and not become the leader. But the /controller node is created > by current session s2 will not remove. So the controller is gone -- This message was sent by Atlassian JIRA (v6.3.4#6332)