ZhenyuLi created ZOOKEEPER-5033:
-----------------------------------
Summary: Quorum SASL authentication fails permanently after Login
TGT refresh thread exits
Key: ZOOKEEPER-5033
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5033
Project: ZooKeeper
Issue Type: Bug
Components: quorum, server
Affects Versions: 3.9.3
Reporter: ZhenyuLi
When quorum SASL authentication is enabled
({{{}quorum.auth.enableSasl=true{}}}) with Kerberos, the {{Login}} class runs a
background daemon thread to periodically
refresh the TGT. This thread can silently exit in several scenarios:
1. Clock skew between the host and KDC ({{{}Login.java:185-193{}}})
2. TGT is not renewable ({{{}Login.java:148-160{}}})
3. {{kinit -R}} fails after retry ({{{}Login.java:236-242{}}})
4. {{reLogin()}} fails after retry ({{{}Login.java:268-270{}}})
5. {{nextRefresh}} is in the past ({{{}Login.java:207-214{}}})
After the thread exits, the TGT cached in the {{Subject}} eventually expires.
When a Follower/Observer later needs to reconnect to the Leader (e.g., after a
network
partition or leader switch), {{SaslQuorumAuthLearner.authenticate()}} uses the
stale credentials from {{learnerLogin.getSubject()}} and fails with
{{{}SaslException{}}}.
The caller ({{{}QuorumPeer{}}} main loop) retries by going through {{{}LOOKING
→ FOLLOWING → connectToLeader → authenticate{}}}, but the {{authLearner}}
object is created
once in {{QuorumPeer.initialize()}} and never recreated. The same stale
{{Login}} and {{Subject}} are reused, causing every retry to fail indefinitely.
*Trigger conditions (all must be met):*
- {{quorum.auth.enableSasl=true}}
- {{quorum.auth.learnerRequireSasl=true}} and
{{quorum.auth.serverRequireSasl=true}}
- Kerberos authentication (not DIGEST-MD5)
- Login refresh thread exits due to one of the above scenarios
- A reconnection event occurs after TGT expires
*Impact:*
The affected server permanently loses the ability to join the quorum. If
multiple servers are affected, quorum may be lost. The only
recovery is to restart the process.
*Affected code paths:*
- {{Learner.connectToLeader()}} → {{self.authLearner.authenticate()}}
(Learner.java:354)
- {{QuorumCnxManager.initiateConnection()}} → {{authLearner.authenticate()}}
(QuorumCnxManager.java:506)
- {{QuorumCnxManager.handleConnection()}} → {{authServer.authenticate()}}
(QuorumCnxManager.java:633)
- {{LearnerHandler}} constructor → {{authServer.authenticate()}}
(LearnerHandler.java:291)
*Fix:*
Add a {{forceReLogin()}} method to {{Login}} that re-logins immediately
(bypassing the minimum time check), and call it from
{{SaslQuorumAuthLearner.authenticate()}} and
{{SaslQuorumAuthServer.authenticate()}} when authentication fails. This ensures
the next authentication attempt uses
fresh credentials.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)