[ https://issues.apache.org/jira/browse/KAFKA-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983015#comment-16983015 ]
ASF GitHub Bot commented on KAFKA-7987: --------------------------------------- parafiend commented on pull request #7751: KAFKA-7987: Reinitialize ZookeeperClient after auth failures URL: https://github.com/apache/kafka/pull/7751 Schedule a client reinitialization after encountering a Zookeeper auth failure. This allows for reconnections when transient network errors are encountered during connection establishment. The Zookeeper client doesn't expose details of the auth failure so we can't determine whether an error is retriable or not, so all auth failures are retried. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) Tested in a local docker environment by using iptables to block communication with ZK and Kerberos nodes. With this change, after auth failure due to Kerberos timeout, client continued retrying until communication was reopened and connection successfully re-established. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > a broker's ZK session may die on transient auth failure > ------------------------------------------------------- > > Key: KAFKA-7987 > URL: https://issues.apache.org/jira/browse/KAFKA-7987 > Project: Kafka > Issue Type: Improvement > Reporter: Jun Rao > Priority: Major > > After a transient network issue, we saw the following log in a broker. > {code:java} > [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: > javax.security.sasl.SaslException: An error: > (java.security.PrivilegedActionException: javax.security.sasl.SaslException: > GSS initiate failed [Caused by GSSException: No valid credentials provided > (Mechanism level: Server not found in Kerberos database (7))]) occurred when > evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client > will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn) > [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. > (kafka.zookeeper.ZooKeeperClient) > {code} > The network issue prevented the broker from communicating to ZK. The broker's > ZK session then expired, but the broker didn't know that yet since it > couldn't establish a connection to ZK. When the network was back, the broker > tried to establish a connection to ZK, but failed due to auth failure (likely > due to a transient KDC issue). The current logic just ignores the auth > failure without trying to create a new ZK session. Then the broker will be > permanently in a state that it's alive, but not registered in ZK. > -- This message was sent by Atlassian Jira (v8.3.4#803005)