Ramachandran Krishnan created RANGER-5654:
---------------------------------------------
Summary: Solr audit dispatcher fails to index after Kerberos TGT
relogin (No key to store) with default useTicketCache=true
Key: RANGER-5654
URL: https://issues.apache.org/jira/browse/RANGER-5654
Project: Ranger
Issue Type: Task
Components: Ranger
Reporter: Ramachandran Krishnan
Assignee: Ramachandran Krishnan
Fix For: 3.0.0
The Solr audit dispatcher ({{{}ranger-audit-dispatcher-solr{}}}) can stop
indexing audits into Kerberos-protected Solr after TGT refresh/relogin. Logs
show repeated batch failures:
Error sending message to Solr
Login failed due to: Unable to login with rangerauditserver/<host>@<REALM> due
to: No key to store
Kafka consumption continues, but Solr doc counts do not increase and Ranger
Admin Audit → Access does not show new events.
h3. Affected components (master)
* {{audit-server/audit-dispatcher/dispatcher-solr}} → {{SolrAuditDestination}}
/ {{KerberosAction}}
* {{agents-audit/core}} → {{AbstractKerberosUser.checkTGTAndRelogin()}}
* Shipped config templates:
**
{{audit-server/audit-dispatcher/dispatcher-solr/src/main/resources/conf/ranger-audit-dispatcher-solr-site.xml}}
**
{{dev-support/ranger-docker/scripts/audit-dispatcher/ranger-audit-dispatcher-solr-site.xml}}
Both ship:
xasecure.audit.jaas.Client.option.useTicketCache = true
with {{useKeyTab=true}} and {{{}storeKey=true{}}}.
h3. Root cause
# Config: {{useTicketCache=true}} encourages reuse of the default credential
cache. In Docker/long-running deployments, cache tickets (e.g. from {{kinit}} /
{{{}KRB5CCNAME{}}}) can mix with keytab-based JAAS login. On relogin,
{{Krb5LoginModule}} may not have a storable key and fails with “No key to
store”.
# Code: {{AbstractKerberosUser.checkTGTAndRelogin()}} does {{logout();
login();}} with no recovery if relogin fails (no fresh {{Subject}} /
{{{}LoginContext{}}}).
h2. The pipeline (normal path)
HDFS plugin → audit ingestor → Kafka (ranger_audits) → Solr dispatcher → Solr →
Ranger Admin UI
# HDFS plugin — A user (e.g. {{{}testuser1{}}}) does something audited, like
{{hdfs dfs -ls}} on a path they’re denied on. The plugin sends the audit to the
ingestor.
# Ingestor — Accepts the audit and writes it to Kafka topic
{{{}ranger_audits{}}}.
# Kafka — Holds the audit record. You can see the topic’s end offset go up.
# Solr dispatcher — Reads from Kafka, then POSTs/indexes each batch into the
Kerberos-protected Solr collection {{{}ranger_audits{}}}.
# Solr — Stores the document. A query like {{reqUser:testuser1}} returns more
docs.
# Ranger Admin — Audit → Access reads from Solr and shows the new event.
So: Kafka growing only means step 2–3 worked. Solr/Admin updating means steps
4–6 worked.
h3. Docker Tier 3 audit stack:
||Container||Role||
|{{ranger-kdc}}|Kerberos for Kafka, plugins, ingestor, Solr|
|{{ranger}} + {{ranger-postgres}}|Ranger Admin + policies|
|{{ranger-solr}} + {{ranger-zk}}|Audit search backend ({{{}ranger_audits{}}}
collection)|
|{{ranger-kafka}}|Topic {{ranger_audits}}|
|{{ranger-audit-ingestor}}|Plugins POST audits here ({{{}:7081{}}})|
|{{ranger-audit-dispatcher-solr}}|Kafka → Solr (Kerberos to Solr)|
|{{ranger-hadoop}}|HDFS + Ranger HDFS plugin ({{{}dev_hdfs{}}})|
h3. Reproduction (Docker Tier 3 audit stack)
h3. What you do to reproduce the bug
h3. Step 1 — Run the stack with Kerberos + Solr dispatcher
Bring up the Tier 3 Docker audit stack: ingestor, Kafka,
ranger-audit-dispatcher-solr, Solr, HDFS, etc., all using Kerberos (keytabs,
not simple auth).
Solr is locked down; the dispatcher must log in as {{rangerauditserver/...}} to
write to Solr.
h3. Step 2 — Trigger real audits
Run something that produces audits end-to-end, e.g.:
* the HDFS deny-traverse flow ({{{}testuser1{}}} tries to traverse a path
Ranger denies — that generates an audited DENY).
At first this often works: plugin ✓, ingestor ✓, Kafka offset ✓.
h3. Step 3 — Stress Kerberos / login state
Do one or more of:
* Restart {{ranger-audit-ingestor}} (common during E2E {{--fresh-plan}} when
topics are recreated).
* Delete and recreate {{ranger_audits}} (dispatchers restart, consumers
rewind).
* Wait long enough for the dispatcher’s Kerberos TGT to need refresh/relogin
(or hit the 80% TGT lifetime window in {{{}AbstractKerberosUser{}}}).
These don’t break Kafka itself; they change tickets, caches, and JVM login
state in the Solr dispatcher.
h3. Step 4 — Trigger audits again
Run the same HDFS audit trigger again. Now watch each hop.
----
h3. What you observe when the bug hits
*Solr dispatcher logs (the smoking gun)*
ERROR - Error processing batch in worker 'solr-worker-0', batch size: 5
java.lang.Exception: Failure in sending audits into Solr
ERROR - Error sending message to Solr
Login failed due to: Unable to login with rangerauditserver/...@... due to: No
key to store
h3. Meaning:
* The dispatcher still consumes from Kafka.
* When it tries to send the batch to Solr, Kerberos login/relogin fails.
* Every batch fails → nothing new in Solr.
h3. Kafka — looks healthy (misleading)
end offset 4 → 5 ✓
The ingestor → Kafka path is fine. New audits land on the topic. That’s why the
bug is easy to miss if you only check Kafka.
h3. Solr — stuck
waiting for Solr docs (reqUser:testuser1)...
Solr count did not increase (before=63, after=63) ✗
Query {{reqUser:testuser1}} (via Kerberos from inside the dispatcher
container): count unchanged.
h3. Ranger Admin — often unchanged too
{{totalCount}} may not move; {{testuser1}} doesn’t appear in recent audits
because Admin reads Solr, not Kafka.
----
h2. Why it happens
||Piece||Role||
|{{useTicketCache=true}}|On relogin, JAAS tries the ticket cache instead of
always using the keytab.|
|Mixed state|Container may have tickets from {{kinit}} / restarts while the JVM
subject expects keytab credentials.|
|Relogin|After TGT refresh, {{checkTGTAndRelogin()}} runs {{logout();
login();}} and fails with “No key to store” — no recovery on master.|
|Result|Kafka fills up; Solr dispatcher can’t authenticate to Solr; pipeline
stalls at step 4.|
*Proposed fix:*
Config (dispatcher Solr site XML):
xasecure.audit.jaas.Client.option.useTicketCache = false
Force keytab-based login for a keytab service principal.
Code ({{{}AbstractKerberosUser{}}}): On relogin {{{}LoginException{}}}, reset
{{{}loginContext{}}}, create new {{{}Subject{}}}, and retry {{login()}} from
keytab.
*Verification*
* HDFS audit pipeline E2E: plugin → ingestor → Kafka → Solr dispatcher → Solr
→ Admin API
* Solr {{numFound}} increases for {{reqUser:testuser1}}
* Dispatcher logs show {{Successful login for rangerauditserver/...}} without
repeated {{No key to store}}
*Notes*
* Not specific to dynamic Kafka partition plan; reproduces on master with
standard Solr dispatcher + Kerberos.
* {{AuditServerConstants.JAAS_USER_TICKET_CACHE}} already documents
{{useTicketCache=false}} for some Kafka paths; Solr dispatcher template is
inconsistent.
h3. How we proved the fix
* Set {{useTicketCache=false}} → login always from keytab.
* Harden {{AbstractKerberosUser}} relogin → recreate subject if relogin fails.
* Restart Solr dispatcher before pipeline checks in E2E.
After that: Solr {{{}63 → 64{}}}, Admin shows {{{}testuser1{}}}, logs show
{{{}Successful login for rangerauditserver/...{}}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)