[ 
https://issues.apache.org/jira/browse/RANGER-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramachandran Krishnan updated RANGER-5637:
------------------------------------------
    Description: 
h2. 1. ozone-om — Java 17 mismatch (primary)

Evidence from CI:
UnsupportedClassVersionError: XmlConfigChanger ... class file version 61.0 ... 
only recognizes up to 55.0
UnsupportedClassVersionError: RangerOzoneAuthorizer ... class file version 61.0
STARTUP_MSG: java = 11.0.19
 
Ranger 3.0 is built with Java 17. The Ozone stack uses 
{{{}apache/ozone-runner:20230615-1{}}}, which ships Java 11.

The setup script hardcodes Java 11 for plugin enable:

ranger-ozone-setup.shLines 29-30
 
 
{code:java}
 echo"export JAVA_HOME=${JAVA_HOME}">>conf/ozone-env.sh   
sudoJAVA_HOME=/usr/lib/jvm/jre/./enable-ozone-plugin.sh   {code}
 
Even if XmlConfigChanger were fixed, OM would still fail loading 
{{RangerOzoneAuthorizer}} at runtime on Java 11.

 

Changes:
 # {{dev-support/ranger-docker/.env}} — set 
{{OZONE_RUNNER_VERSION=20241022-jdk17-1}}
 # {{dev-support/ranger-docker/scripts/ozone/ranger-ozone-setup.sh}} — use 
{{${JAVA_HOME}}} instead of {{/usr/lib/jvm/jre/}}

Risk: Ozone 1.4.x officially targets JDK 11 for some CLI paths 
([HDDS-12153|https://github.com/apache/ozone-docker/pull/39]), but services run 
fine on JDK 17, which is what CI needs.
h2. 2. ozone-om — SCM startup ordering (secondary)

Evidence: {{{}Connection refused: scm:9863{}}}, then 
{{ServerNotLeaderException}} during OM {{{}--init{}}}. OM init did succeed 
once; SCM caught up shortly after.

In {{{}docker-compose.ranger-ozone.yml{}}}, {{om}} has no dependency on {{scm}} 
or {{{}datanode{}}}:
{code:java}
docker-compose.ranger-ozone.ymlLines 37-50        
depends_on:   ranger:   condition:service_started   
ranger-solr:   condition:service_started       {code}
 
...
command:bash -c "/opt/hadoop/ranger-ozone-plugin/ranger-ozone-setup.sh && 
/opt/hadoop/bin/ozone om"
 
All three (scm, datanode, om) start in parallel.
h3. Fix
om:
depends_on:
 
{code:java}
 scm:   condition:service_started   datanode:   
condition:service_started   
ranger:   condition:service_started   
ranger-solr:   condition:service_started {code}
 
Optionally add a short {{wait-for-scm.sh}} (poll {{{}scm:9860{}}}) before 
{{ozone om}} for extra stability. This is a flake reducer, not the root cause — 
Java mismatch is what actually killed the container.
h2. 3. ranger-knox — Gateway failed to start

Evidence from CI: Plugin enable completed successfully (audit XML, topology 
updates, cred.jceks). LDAP started. Then:
Starting Gateway failed.
The Knox Gateway process probably exited, no process id found!
So this is not XmlConfigChanger or plugin-enable failure. The gateway JVM exits 
immediately; gateway.log is not printed in CI, which makes diagnosis harder.
h3. Most likely cause: incomplete Knox plugin packaging after RANGER-5632

[RANGER-5632|https://github.com/apache/ranger/pull/999] removed Solr/HDFS audit 
destinations from plugin tarballs, leaving auditserver only. Docker install 
props enable auditserver:
ranger-knox-plugin-install.propertiesLines 35-37
XAAUDIT.AUDITSERVER.ENABLE=true
XAAUDIT.AUDITSERVER.URL=http://ranger-audit-ingestor.rangernw:7081
{{ranger-audit-dest-auditserver}} depends on Jersey 2 + HK2 (which needs 
{{{}javax.inject{}}}). Compare assembly whitelists:
||Dependency||{{plugin-kafka.xml}} (passes CI)||{{knox-agent.xml}} (fails)||
|{{jersey-media-json-jackson}}|✅|❌|
|{{jersey-entity-filtering}}|✅|❌|
|{{jackson-jaxrs-json-provider}}|✅|❌|
|{{javax.inject}}|❌|❌|
|{{httpasyncclient}} / {{httpcore-nio}}|✅|✅|

Knox’s whitelist is much thinner than Kafka’s. Before RANGER-5632, Knox audits 
went through Solr ({{{}solrj{}}} is in the Knox whitelist). After switching to 
auditserver-only, the Jersey client stack may be missing from 
{{{}lib/ranger-knox-plugin-impl/{}}}, causing gateway classpath failure at 
startup (same class of bug as PDP’s {{javax.inject.Singleton}} issue).
h3. Fixes for Knox

A. Packaging (likely root fix) — align 
{{distro/src/main/assembly/knox-agent.xml}} with Kafka/Ozone:
 * Add {{javax.inject:javax.inject}}
 * Add {{org.glassfish.jersey.core:jersey-client}}
 * Add {{org.glassfish.jersey.inject:jersey-hk2}}
 * Add {{org.glassfish.jersey.media:jersey-media-json-jackson}}
 * Add {{com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider}}
 * Add HK2 deps if needed (as in {{{}pdp.xml{}}})

B. Diagnostics — improve {{ranger-knox.sh}} so CI captures the real error:
if [ -z "$KNOX_GATEWAY_PID" ]; then
echo"Gateway logs:"
tail-100"${KNOX_HOME}/logs/gateway.log"2>/dev/null||true
tail-100"${KNOX_HOME}/logs/gateway-${HOSTNAME}.log"2>/dev/null||true
fi
C. Compose ordering (optional) — Knox {{depends_on}} only {{ranger}} + 
{{{}ranger-zk{}}}, but sandbox topology references {{{}ranger-hadoop{}}}, 
{{{}ranger-hive{}}}, {{{}ranger-hbase{}}}. Adding {{ranger-hadoop: 
service_healthy}} may help stability; it is unlikely to be the immediate 
gateway crash cause.

D. Audit ingestor not in {{plugins-docker-build}} — {{ranger-audit-ingestor}} 
is not started in that CI job. That should not block gateway startup (audits 
are async), but audits will not flow until ingestor is added to the compose 
stack or auditserver is disabled for the smoke test only.
h2. Recommended fix order
!image-2026-06-10-10-26-27-904.png! # Ozone Java 17 runner — highest 
confidence, small diff
 # Knox assembly deps — likely fixes gateway; mirrors PDP/Kafka pattern
 # Ozone compose ordering — reduces SCM flakes
 # Knox log dump in CI — confirms root cause if gateway still fails

 

  was:
[CI run 
27213972490|https://github.com/apache/ranger/actions/runs/27213972490/job/80356301472]
 ({{{}plugins-docker-build{}}} on PR #980) was cancelled at the 60-minute job 
limit:
 * GitHub Actions cache miss — all plugin distros downloaded cold.
 * {{download-archives.sh}} spent ~59 minutes pulling {{hadoop-3.4.2.tar.gz}} 
(~1 GB) at ~200 KB/s, reaching only 72% before timeout.
 * Subsequent steps (HBase, Hive, Tez, Kafka, Knox, Ozone) and Docker build 
never ran.
 * {{build-17}} and {{services-docker-build}} passed; only the plugin docker 
job was affected.

This is a CI infrastructure / mirror slowness issue, not a product code defect.

 


> Ranger CI: fix plugins-docker-build (download timeouts, Knox/Ozone smoke-test 
> failures)
> ---------------------------------------------------------------------------------------
>
>                 Key: RANGER-5637
>                 URL: https://issues.apache.org/jira/browse/RANGER-5637
>             Project: Ranger
>          Issue Type: Task
>          Components: Ranger
>            Reporter: Ramachandran Krishnan
>            Assignee: Ramachandran Krishnan
>            Priority: Major
>             Fix For: 3.0.0
>
>         Attachments: image-2026-06-10-10-26-27-904.png
>
>
> h2. 1. ozone-om — Java 17 mismatch (primary)
> Evidence from CI:
> UnsupportedClassVersionError: XmlConfigChanger ... class file version 61.0 
> ... only recognizes up to 55.0
> UnsupportedClassVersionError: RangerOzoneAuthorizer ... class file version 
> 61.0
> STARTUP_MSG: java = 11.0.19
>  
> Ranger 3.0 is built with Java 17. The Ozone stack uses 
> {{{}apache/ozone-runner:20230615-1{}}}, which ships Java 11.
> The setup script hardcodes Java 11 for plugin enable:
> ranger-ozone-setup.shLines 29-30
>  
>  
> {code:java}
>  echo"export JAVA_HOME=${JAVA_HOME}">>conf/ozone-env.sh   
> sudoJAVA_HOME=/usr/lib/jvm/jre/./enable-ozone-plugin.sh   {code}
>  
> Even if XmlConfigChanger were fixed, OM would still fail loading 
> {{RangerOzoneAuthorizer}} at runtime on Java 11.
>  
> Changes:
>  # {{dev-support/ranger-docker/.env}} — set 
> {{OZONE_RUNNER_VERSION=20241022-jdk17-1}}
>  # {{dev-support/ranger-docker/scripts/ozone/ranger-ozone-setup.sh}} — use 
> {{${JAVA_HOME}}} instead of {{/usr/lib/jvm/jre/}}
> Risk: Ozone 1.4.x officially targets JDK 11 for some CLI paths 
> ([HDDS-12153|https://github.com/apache/ozone-docker/pull/39]), but services 
> run fine on JDK 17, which is what CI needs.
> h2. 2. ozone-om — SCM startup ordering (secondary)
> Evidence: {{{}Connection refused: scm:9863{}}}, then 
> {{ServerNotLeaderException}} during OM {{{}--init{}}}. OM init did succeed 
> once; SCM caught up shortly after.
> In {{{}docker-compose.ranger-ozone.yml{}}}, {{om}} has no dependency on 
> {{scm}} or {{{}datanode{}}}:
> {code:java}
> docker-compose.ranger-ozone.ymlLines 37-50        
> depends_on:   ranger:   condition:service_started   
> ranger-solr:   condition:service_started       {code}
>  
> ...
> command:bash -c "/opt/hadoop/ranger-ozone-plugin/ranger-ozone-setup.sh && 
> /opt/hadoop/bin/ozone om"
>  
> All three (scm, datanode, om) start in parallel.
> h3. Fix
> om:
> depends_on:
>  
> {code:java}
>  scm:   condition:service_started   datanode:   
> condition:service_started   
> ranger:   condition:service_started   
> ranger-solr:   condition:service_started {code}
>  
> Optionally add a short {{wait-for-scm.sh}} (poll {{{}scm:9860{}}}) before 
> {{ozone om}} for extra stability. This is a flake reducer, not the root cause 
> — Java mismatch is what actually killed the container.
> h2. 3. ranger-knox — Gateway failed to start
> Evidence from CI: Plugin enable completed successfully (audit XML, topology 
> updates, cred.jceks). LDAP started. Then:
> Starting Gateway failed.
> The Knox Gateway process probably exited, no process id found!
> So this is not XmlConfigChanger or plugin-enable failure. The gateway JVM 
> exits immediately; gateway.log is not printed in CI, which makes diagnosis 
> harder.
> h3. Most likely cause: incomplete Knox plugin packaging after RANGER-5632
> [RANGER-5632|https://github.com/apache/ranger/pull/999] removed Solr/HDFS 
> audit destinations from plugin tarballs, leaving auditserver only. Docker 
> install props enable auditserver:
> ranger-knox-plugin-install.propertiesLines 35-37
> XAAUDIT.AUDITSERVER.ENABLE=true
> XAAUDIT.AUDITSERVER.URL=http://ranger-audit-ingestor.rangernw:7081
> {{ranger-audit-dest-auditserver}} depends on Jersey 2 + HK2 (which needs 
> {{{}javax.inject{}}}). Compare assembly whitelists:
> ||Dependency||{{plugin-kafka.xml}} (passes CI)||{{knox-agent.xml}} (fails)||
> |{{jersey-media-json-jackson}}|✅|❌|
> |{{jersey-entity-filtering}}|✅|❌|
> |{{jackson-jaxrs-json-provider}}|✅|❌|
> |{{javax.inject}}|❌|❌|
> |{{httpasyncclient}} / {{httpcore-nio}}|✅|✅|
> Knox’s whitelist is much thinner than Kafka’s. Before RANGER-5632, Knox 
> audits went through Solr ({{{}solrj{}}} is in the Knox whitelist). After 
> switching to auditserver-only, the Jersey client stack may be missing from 
> {{{}lib/ranger-knox-plugin-impl/{}}}, causing gateway classpath failure at 
> startup (same class of bug as PDP’s {{javax.inject.Singleton}} issue).
> h3. Fixes for Knox
> A. Packaging (likely root fix) — align 
> {{distro/src/main/assembly/knox-agent.xml}} with Kafka/Ozone:
>  * Add {{javax.inject:javax.inject}}
>  * Add {{org.glassfish.jersey.core:jersey-client}}
>  * Add {{org.glassfish.jersey.inject:jersey-hk2}}
>  * Add {{org.glassfish.jersey.media:jersey-media-json-jackson}}
>  * Add {{com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider}}
>  * Add HK2 deps if needed (as in {{{}pdp.xml{}}})
> B. Diagnostics — improve {{ranger-knox.sh}} so CI captures the real error:
> if [ -z "$KNOX_GATEWAY_PID" ]; then
> echo"Gateway logs:"
> tail-100"${KNOX_HOME}/logs/gateway.log"2>/dev/null||true
> tail-100"${KNOX_HOME}/logs/gateway-${HOSTNAME}.log"2>/dev/null||true
> fi
> C. Compose ordering (optional) — Knox {{depends_on}} only {{ranger}} + 
> {{{}ranger-zk{}}}, but sandbox topology references {{{}ranger-hadoop{}}}, 
> {{{}ranger-hive{}}}, {{{}ranger-hbase{}}}. Adding {{ranger-hadoop: 
> service_healthy}} may help stability; it is unlikely to be the immediate 
> gateway crash cause.
> D. Audit ingestor not in {{plugins-docker-build}} — {{ranger-audit-ingestor}} 
> is not started in that CI job. That should not block gateway startup (audits 
> are async), but audits will not flow until ingestor is added to the compose 
> stack or auditserver is disabled for the smoke test only.
> h2. Recommended fix order
> !image-2026-06-10-10-26-27-904.png! # Ozone Java 17 runner — highest 
> confidence, small diff
>  # Knox assembly deps — likely fixes gateway; mirrors PDP/Kafka pattern
>  # Ozone compose ordering — reduces SCM flakes
>  # Knox log dump in CI — confirms root cause if gateway still fails
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to