Ramachandran Krishnan created RANGER-5654:
---------------------------------------------

             Summary: Solr audit dispatcher fails to index after Kerberos TGT 
relogin (No key to store) with default useTicketCache=true
                 Key: RANGER-5654
                 URL: https://issues.apache.org/jira/browse/RANGER-5654
             Project: Ranger
          Issue Type: Task
          Components: Ranger
            Reporter: Ramachandran Krishnan
            Assignee: Ramachandran Krishnan
             Fix For: 3.0.0


The Solr audit dispatcher ({{{}ranger-audit-dispatcher-solr{}}}) can stop 
indexing audits into Kerberos-protected Solr after TGT refresh/relogin. Logs 
show repeated batch failures:
Error sending message to Solr
Login failed due to: Unable to login with rangerauditserver/<host>@<REALM> due 
to: No key to store
Kafka consumption continues, but Solr doc counts do not increase and Ranger 
Admin Audit → Access does not show new events.
h3. Affected components (master)
 * {{audit-server/audit-dispatcher/dispatcher-solr}} → {{SolrAuditDestination}} 
/ {{KerberosAction}}
 * {{agents-audit/core}} → {{AbstractKerberosUser.checkTGTAndRelogin()}}
 * Shipped config templates:
 ** 
{{audit-server/audit-dispatcher/dispatcher-solr/src/main/resources/conf/ranger-audit-dispatcher-solr-site.xml}}
 ** 
{{dev-support/ranger-docker/scripts/audit-dispatcher/ranger-audit-dispatcher-solr-site.xml}}

Both ship:
xasecure.audit.jaas.Client.option.useTicketCache = true
with {{useKeyTab=true}} and {{{}storeKey=true{}}}.
h3. Root cause
 # Config: {{useTicketCache=true}} encourages reuse of the default credential 
cache. In Docker/long-running deployments, cache tickets (e.g. from {{kinit}} / 
{{{}KRB5CCNAME{}}}) can mix with keytab-based JAAS login. On relogin, 
{{Krb5LoginModule}} may not have a storable key and fails with “No key to 
store”.

 # Code: {{AbstractKerberosUser.checkTGTAndRelogin()}} does {{logout(); 
login();}} with no recovery if relogin fails (no fresh {{Subject}} / 
{{{}LoginContext{}}}).

h2. The pipeline (normal path)
HDFS plugin → audit ingestor → Kafka (ranger_audits) → Solr dispatcher → Solr → 
Ranger Admin UI
  # HDFS plugin — A user (e.g. {{{}testuser1{}}}) does something audited, like 
{{hdfs dfs -ls}} on a path they’re denied on. The plugin sends the audit to the 
ingestor.
 # Ingestor — Accepts the audit and writes it to Kafka topic 
{{{}ranger_audits{}}}.
 # Kafka — Holds the audit record. You can see the topic’s end offset go up.
 # Solr dispatcher — Reads from Kafka, then POSTs/indexes each batch into the 
Kerberos-protected Solr collection {{{}ranger_audits{}}}.
 # Solr — Stores the document. A query like {{reqUser:testuser1}} returns more 
docs.
 # Ranger Admin — Audit → Access reads from Solr and shows the new event.

So: Kafka growing only means step 2–3 worked. Solr/Admin updating means steps 
4–6 worked.
h3. Docker Tier 3 audit stack:
||Container||Role||
|{{ranger-kdc}}|Kerberos for Kafka, plugins, ingestor, Solr|
|{{ranger}} + {{ranger-postgres}}|Ranger Admin + policies|
|{{ranger-solr}} + {{ranger-zk}}|Audit search backend ({{{}ranger_audits{}}} 
collection)|
|{{ranger-kafka}}|Topic {{ranger_audits}}|
|{{ranger-audit-ingestor}}|Plugins POST audits here ({{{}:7081{}}})|
|{{ranger-audit-dispatcher-solr}}|Kafka → Solr (Kerberos to Solr)|
|{{ranger-hadoop}}|HDFS + Ranger HDFS plugin ({{{}dev_hdfs{}}})|
h3. Reproduction (Docker Tier 3 audit stack)

 
h3. What you do to reproduce the bug
h3. Step 1 — Run the stack with Kerberos + Solr dispatcher

Bring up the Tier 3 Docker audit stack: ingestor, Kafka, 
ranger-audit-dispatcher-solr, Solr, HDFS, etc., all using Kerberos (keytabs, 
not simple auth).

Solr is locked down; the dispatcher must log in as {{rangerauditserver/...}} to 
write to Solr.
h3. Step 2 — Trigger real audits

Run something that produces audits end-to-end, e.g.:
 * the HDFS deny-traverse flow ({{{}testuser1{}}} tries to traverse a path 
Ranger denies — that generates an audited DENY).

At first this often works: plugin ✓, ingestor ✓, Kafka offset ✓.
h3. Step 3 — Stress Kerberos / login state

Do one or more of:
 * Restart {{ranger-audit-ingestor}} (common during E2E {{--fresh-plan}} when 
topics are recreated).
 * Delete and recreate {{ranger_audits}} (dispatchers restart, consumers 
rewind).
 * Wait long enough for the dispatcher’s Kerberos TGT to need refresh/relogin 
(or hit the 80% TGT lifetime window in {{{}AbstractKerberosUser{}}}).

These don’t break Kafka itself; they change tickets, caches, and JVM login 
state in the Solr dispatcher.
h3. Step 4 — Trigger audits again

Run the same HDFS audit trigger again. Now watch each hop.
----
h3. What you observe when the bug hits

*Solr dispatcher logs (the smoking gun)*
ERROR - Error processing batch in worker 'solr-worker-0', batch size: 5
java.lang.Exception: Failure in sending audits into Solr
 
ERROR - Error sending message to Solr
Login failed due to: Unable to login with rangerauditserver/...@... due to: No 
key to store
h3. Meaning:
 * The dispatcher still consumes from Kafka.
 * When it tries to send the batch to Solr, Kerberos login/relogin fails.
 * Every batch fails → nothing new in Solr.

h3. Kafka — looks healthy (misleading)
end offset 4 → 5 ✓
The ingestor → Kafka path is fine. New audits land on the topic. That’s why the 
bug is easy to miss if you only check Kafka.
h3. Solr — stuck
waiting for Solr docs (reqUser:testuser1)...
Solr count did not increase (before=63, after=63) ✗
Query {{reqUser:testuser1}} (via Kerberos from inside the dispatcher 
container): count unchanged.
h3. Ranger Admin — often unchanged too

{{totalCount}} may not move; {{testuser1}} doesn’t appear in recent audits 
because Admin reads Solr, not Kafka.
----
h2. Why it happens 
||Piece||Role||
|{{useTicketCache=true}}|On relogin, JAAS tries the ticket cache instead of 
always using the keytab.|
|Mixed state|Container may have tickets from {{kinit}} / restarts while the JVM 
subject expects keytab credentials.|
|Relogin|After TGT refresh, {{checkTGTAndRelogin()}} runs {{logout(); 
login();}} and fails with “No key to store” — no recovery on master.|
|Result|Kafka fills up; Solr dispatcher can’t authenticate to Solr; pipeline 
stalls at step 4.|

 

*Proposed fix:*

Config (dispatcher Solr site XML):
xasecure.audit.jaas.Client.option.useTicketCache = false
Force keytab-based login for a keytab service principal.

Code ({{{}AbstractKerberosUser{}}}): On relogin {{{}LoginException{}}}, reset 
{{{}loginContext{}}}, create new {{{}Subject{}}}, and retry {{login()}} from 
keytab.

*Verification*
 * HDFS audit pipeline E2E: plugin → ingestor → Kafka → Solr dispatcher → Solr 
→ Admin API
 * Solr {{numFound}} increases for {{reqUser:testuser1}}
 * Dispatcher logs show {{Successful login for rangerauditserver/...}} without 
repeated {{No key to store}}

*Notes*
 * Not specific to dynamic Kafka partition plan; reproduces on master with 
standard Solr dispatcher + Kerberos.
 * {{AuditServerConstants.JAAS_USER_TICKET_CACHE}} already documents 
{{useTicketCache=false}} for some Kafka paths; Solr dispatcher template is 
inconsistent.

h3. How we proved the fix
 * Set {{useTicketCache=false}} → login always from keytab.
 * Harden {{AbstractKerberosUser}} relogin → recreate subject if relogin fails.
 * Restart Solr dispatcher before pipeline checks in E2E.

After that: Solr {{{}63 → 64{}}}, Admin shows {{{}testuser1{}}}, logs show 
{{{}Successful login for rangerauditserver/...{}}}.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to