[ https://issues.apache.org/jira/browse/KAFKA-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Candice Wan updated KAFKA-8188: ------------------------------- Description: We recently upgraded to 2.1.1 and saw below zookeeper connection issues which took down the whole cluster. We've got 3 nodes in the cluster, 2 of which threw below exceptions at the same second. 2019-04-03 08:25:19.603 [main-SendThread(iaase00003184.svr.emea.jpmchase.net:36100)] WARN org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x10071ff9baf0001 has expired 2019-04-03 08:25:19.603 [main-SendThread(iaase00003184.svr.emea.jpmchase.net:36100)] INFO org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x10071ff9baf0001 has expired, closing socket connection 2019-04-03 08:25:19.605 [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x10071ff9baf0001 2019-04-03 08:25:19.605 [zk-session-expiry-handler0] INFO kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Session expired. 2019-04-03 08:25:19.609 [zk-session-expiry-handler0] INFO kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Initializing a new session to vsie5p0551.svr.emea.jpmchase.net:36100,iaase00003184.svr.emea.jpmchase.net:36100,iaase00003360.svr.emea.jpmchase.net:36100. 2019-04-03 08:25:19.610 [zk-session-expiry-handler0] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=vsie5p0551.svr.emea.jpmchase.net:36100,iaase00003184.svr.emea.jpmchase.net:36100,iaase00003360.svr.emea.jpmchase.net:36100 sessionTimeout=6000 watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@12f8b1d8 2019-04-03 08:25:19.610 [zk-session-expiry-handler0] INFO o.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes 2019-04-03 08:25:19.611 [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] WARN org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: 'file:/app0/common/config/ldap-auth.config'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 2019-04-03 08:25:19.611 [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100 2019-04-03 08:25:19.611 [zk-session-expiry-handler0-EventThread] ERROR kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Auth failed. 2019-04-03 08:25:19.611 [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /169.20.222.18:56876, server: vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100 2019-04-03 08:25:19.612 [controller-event-thread] INFO k.controller.PartitionStateMachine - [PartitionStateMachine controllerId=3] Stopped partition state machine 2019-04-03 08:25:19.613 [controller-event-thread] INFO kafka.controller.ReplicaStateMachine - [ReplicaStateMachine controllerId=3] Stopped replica state machine 2019-04-03 08:25:19.614 [controller-event-thread] INFO kafka.controller.KafkaController - [Controller id=3] Resigned 2019-04-03 08:25:19.615 [controller-event-thread] INFO kafka.zk.KafkaZkClient - Creating /brokers/ids/3 (is it secure? false) 2019-04-03 08:25:19.628 [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100, sessionid = 0x1007f4d2b810000, negotiated timeout = 6000 2019-04-03 08:25:19.631 [/config/changes-event-process-thread] INFO k.c.ZkNodeChangeNotificationListener - Processing notification(s) to /config/changes 2019-04-03 08:25:19.637 [controller-event-thread] ERROR k.zk.KafkaZkClient$CheckedEphemeral - Error while creating ephemeral at /brokers/ids/3, node already exists and owner '72182936680464385' does not match current session '72197563457011712' 2019-04-03 08:25:19.637 [controller-event-thread] INFO kafka.zk.KafkaZkClient - Result of znode creation at /brokers/ids/3 is: NODEEXISTS 2019-04-03 08:25:19.644 [controller-event-thread] ERROR k.c.ControllerEventManager$ControllerEventThread - [ControllerEventThread controllerId=3] Error processing event RegisterBrokerAndReelect org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1631) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:87) at kafka.controller.KafkaController$RegisterBrokerAndReelect$.process(KafkaController.scala:1516) at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:89) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:89) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) Thread dump attached was: We recently upgraded to 2.1.1 and we saw below zookeeper connection issues which took down the whole cluster. We've got 3 nodes in the cluster, 2 of which had issues. 2019-04-03 08:25:19.603 [main-SendThread(iaase00003184.svr.emea.jpmchase.net:36100)] WARN org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x10071ff9baf0001 has expired 2019-04-03 08:25:19.603 [main-SendThread(iaase00003184.svr.emea.jpmchase.net:36100)] INFO org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x10071ff9baf0001 has expired, closing socket connection 2019-04-03 08:25:19.605 [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x10071ff9baf0001 2019-04-03 08:25:19.605 [zk-session-expiry-handler0] INFO kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Session expired. 2019-04-03 08:25:19.609 [zk-session-expiry-handler0] INFO kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Initializing a new session to vsie5p0551.svr.emea.jpmchase.net:36100,iaase00003184.svr.emea.jpmchase.net:36100,iaase00003360.svr.emea.jpmchase.net:36100. 2019-04-03 08:25:19.610 [zk-session-expiry-handler0] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=vsie5p0551.svr.emea.jpmchase.net:36100,iaase00003184.svr.emea.jpmchase.net:36100,iaase00003360.svr.emea.jpmchase.net:36100 sessionTimeout=6000 watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@12f8b1d8 2019-04-03 08:25:19.610 [zk-session-expiry-handler0] INFO o.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes 2019-04-03 08:25:19.611 [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] WARN org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: 'file:/app0/common/config/ldap-auth.config'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 2019-04-03 08:25:19.611 [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100 2019-04-03 08:25:19.611 [zk-session-expiry-handler0-EventThread] ERROR kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Auth failed. 2019-04-03 08:25:19.611 [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /169.20.222.18:56876, server: vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100 2019-04-03 08:25:19.612 [controller-event-thread] INFO k.controller.PartitionStateMachine - [PartitionStateMachine controllerId=3] Stopped partition state machine 2019-04-03 08:25:19.613 [controller-event-thread] INFO kafka.controller.ReplicaStateMachine - [ReplicaStateMachine controllerId=3] Stopped replica state machine 2019-04-03 08:25:19.614 [controller-event-thread] INFO kafka.controller.KafkaController - [Controller id=3] Resigned 2019-04-03 08:25:19.615 [controller-event-thread] INFO kafka.zk.KafkaZkClient - Creating /brokers/ids/3 (is it secure? false) 2019-04-03 08:25:19.628 [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100, sessionid = 0x1007f4d2b810000, negotiated timeout = 6000 2019-04-03 08:25:19.631 [/config/changes-event-process-thread] INFO k.c.ZkNodeChangeNotificationListener - Processing notification(s) to /config/changes 2019-04-03 08:25:19.637 [controller-event-thread] ERROR k.zk.KafkaZkClient$CheckedEphemeral - Error while creating ephemeral at /brokers/ids/3, node already exists and owner '72182936680464385' does not match current session '72197563457011712' 2019-04-03 08:25:19.637 [controller-event-thread] INFO kafka.zk.KafkaZkClient - Result of znode creation at /brokers/ids/3 is: NODEEXISTS 2019-04-03 08:25:19.644 [controller-event-thread] ERROR k.c.ControllerEventManager$ControllerEventThread - [ControllerEventThread controllerId=3] Error processing event RegisterBrokerAndReelect org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1631) at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:87) at kafka.controller.KafkaController$RegisterBrokerAndReelect$.process(KafkaController.scala:1516) at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:89) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:89) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) Thread dump attached > Zookeeper Connection Issue Take Down the Whole kafka cluster > ------------------------------------------------------------ > > Key: KAFKA-8188 > URL: https://issues.apache.org/jira/browse/KAFKA-8188 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 2.1.1 > Reporter: Candice Wan > Priority: Critical > Attachments: thread_dump.log > > > We recently upgraded to 2.1.1 and saw below zookeeper connection issues which > took down the whole cluster. We've got 3 nodes in the cluster, 2 of which > threw below exceptions at the same second. > 2019-04-03 08:25:19.603 > [main-SendThread(iaase00003184.svr.emea.jpmchase.net:36100)] WARN > org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, > session 0x10071ff9baf0001 has expired > 2019-04-03 08:25:19.603 > [main-SendThread(iaase00003184.svr.emea.jpmchase.net:36100)] INFO > org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, > session 0x10071ff9baf0001 has expired, closing socket connection > 2019-04-03 08:25:19.605 [main-EventThread] INFO > org.apache.zookeeper.ClientCnxn - EventThread shut down for session: > 0x10071ff9baf0001 > 2019-04-03 08:25:19.605 [zk-session-expiry-handler0] INFO > kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Session expired. > 2019-04-03 08:25:19.609 [zk-session-expiry-handler0] INFO > kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Initializing a new > session to > vsie5p0551.svr.emea.jpmchase.net:36100,iaase00003184.svr.emea.jpmchase.net:36100,iaase00003360.svr.emea.jpmchase.net:36100. > 2019-04-03 08:25:19.610 [zk-session-expiry-handler0] INFO > org.apache.zookeeper.ZooKeeper - Initiating client connection, > connectString=vsie5p0551.svr.emea.jpmchase.net:36100,iaase00003184.svr.emea.jpmchase.net:36100,iaase00003360.svr.emea.jpmchase.net:36100 > sessionTimeout=6000 > watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@12f8b1d8 > 2019-04-03 08:25:19.610 [zk-session-expiry-handler0] INFO > o.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes > 2019-04-03 08:25:19.611 > [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] > WARN org.apache.zookeeper.ClientCnxn - SASL configuration failed: > javax.security.auth.login.LoginException: No JAAS configuration section named > 'Client' was found in specified JAAS configuration file: > 'file:/app0/common/config/ldap-auth.config'. Will continue connection to > Zookeeper server without SASL authentication, if Zookeeper server allows it. > 2019-04-03 08:25:19.611 > [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] > INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server > vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100 > 2019-04-03 08:25:19.611 [zk-session-expiry-handler0-EventThread] ERROR > kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Auth failed. > 2019-04-03 08:25:19.611 > [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] > INFO org.apache.zookeeper.ClientCnxn - Socket connection established, > initiating session, client: /169.20.222.18:56876, server: > vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100 > 2019-04-03 08:25:19.612 [controller-event-thread] INFO > k.controller.PartitionStateMachine - [PartitionStateMachine controllerId=3] > Stopped partition state machine > 2019-04-03 08:25:19.613 [controller-event-thread] INFO > kafka.controller.ReplicaStateMachine - [ReplicaStateMachine controllerId=3] > Stopped replica state machine > 2019-04-03 08:25:19.614 [controller-event-thread] INFO > kafka.controller.KafkaController - [Controller id=3] Resigned > 2019-04-03 08:25:19.615 [controller-event-thread] INFO > kafka.zk.KafkaZkClient - Creating /brokers/ids/3 (is it secure? false) > 2019-04-03 08:25:19.628 > [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] > INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on > server vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100, sessionid = > 0x1007f4d2b810000, negotiated timeout = 6000 > 2019-04-03 08:25:19.631 [/config/changes-event-process-thread] INFO > k.c.ZkNodeChangeNotificationListener - Processing notification(s) to > /config/changes > 2019-04-03 08:25:19.637 [controller-event-thread] ERROR > k.zk.KafkaZkClient$CheckedEphemeral - Error while creating ephemeral at > /brokers/ids/3, node already exists and owner '72182936680464385' does not > match current session '72197563457011712' > 2019-04-03 08:25:19.637 [controller-event-thread] INFO > kafka.zk.KafkaZkClient - Result of znode creation at /brokers/ids/3 is: > NODEEXISTS > 2019-04-03 08:25:19.644 [controller-event-thread] ERROR > k.c.ControllerEventManager$ControllerEventThread - [ControllerEventThread > controllerId=3] Error processing event RegisterBrokerAndReelect > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists > at org.apache.zookeeper.KeeperException.create(KeeperException.java:126) > at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1631) > at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:87) > at > kafka.controller.KafkaController$RegisterBrokerAndReelect$.process(KafkaController.scala:1516) > at > kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:89) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) > at > kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:89) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > > Thread dump attached > -- This message was sent by Atlassian JIRA (v7.6.3#76005)