ZhenyuLi created ZOOKEEPER-5033:
-----------------------------------

             Summary: Quorum SASL authentication fails permanently after Login 
TGT refresh thread exits 
                 Key: ZOOKEEPER-5033
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5033
             Project: ZooKeeper
          Issue Type: Bug
          Components: quorum, server
    Affects Versions: 3.9.3
            Reporter: ZhenyuLi


When quorum SASL authentication is enabled 
({{{}quorum.auth.enableSasl=true{}}}) with Kerberos, the {{Login}} class runs a 
background daemon thread to periodically 
refresh the TGT. This thread can silently exit in several scenarios:

1. Clock skew between the host and KDC ({{{}Login.java:185-193{}}}) 
2. TGT is not renewable ({{{}Login.java:148-160{}}})
3. {{kinit -R}} fails after retry ({{{}Login.java:236-242{}}}) 
4. {{reLogin()}} fails after retry ({{{}Login.java:268-270{}}}) 
5. {{nextRefresh}} is in the past ({{{}Login.java:207-214{}}})

After the thread exits, the TGT cached in the {{Subject}} eventually expires. 
When a Follower/Observer later needs to reconnect to the Leader (e.g., after a 
network
partition or leader switch), {{SaslQuorumAuthLearner.authenticate()}} uses the 
stale credentials from {{learnerLogin.getSubject()}} and fails with 
{{{}SaslException{}}}.

The caller ({{{}QuorumPeer{}}} main loop) retries by going through {{{}LOOKING 
→ FOLLOWING → connectToLeader → authenticate{}}}, but the {{authLearner}} 
object is created 
once in {{QuorumPeer.initialize()}} and never recreated. The same stale 
{{Login}} and {{Subject}} are reused, causing every retry to fail indefinitely.

*Trigger conditions (all must be met):*
 - {{quorum.auth.enableSasl=true}}
 - {{quorum.auth.learnerRequireSasl=true}} and 
{{quorum.auth.serverRequireSasl=true}}
 - Kerberos authentication (not DIGEST-MD5)
 - Login refresh thread exits due to one of the above scenarios
 - A reconnection event occurs after TGT expires

*Impact:* 
The affected server permanently loses the ability to join the quorum. If 
multiple servers are affected, quorum may be lost. The only 
recovery is to restart the process.

*Affected code paths:*
 - {{Learner.connectToLeader()}} → {{self.authLearner.authenticate()}} 
(Learner.java:354)
 - {{QuorumCnxManager.initiateConnection()}} → {{authLearner.authenticate()}} 
(QuorumCnxManager.java:506)
 - {{QuorumCnxManager.handleConnection()}} → {{authServer.authenticate()}} 
(QuorumCnxManager.java:633)
 - {{LearnerHandler}} constructor → {{authServer.authenticate()}} 
(LearnerHandler.java:291)

*Fix:* 
Add a {{forceReLogin()}} method to {{Login}} that re-logins immediately 
(bypassing the minimum time check), and call it from 
{{SaslQuorumAuthLearner.authenticate()}} and 
{{SaslQuorumAuthServer.authenticate()}} when authentication fails. This ensures 
the next authentication attempt uses 
fresh credentials.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to