[
https://issues.apache.org/jira/browse/RANGER-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ramachandran Krishnan updated RANGER-5637:
------------------------------------------
Description:
h2. 1. ozone-om — Java 17 mismatch (primary)
Evidence from CI:
UnsupportedClassVersionError: XmlConfigChanger ... class file version 61.0 ...
only recognizes up to 55.0
UnsupportedClassVersionError: RangerOzoneAuthorizer ... class file version 61.0
STARTUP_MSG: java = 11.0.19
Ranger 3.0 is built with Java 17. The Ozone stack uses
{{{}apache/ozone-runner:20230615-1{}}}, which ships Java 11.
The setup script hardcodes Java 11 for plugin enable:
ranger-ozone-setup.shLines 29-30
{code:java}
echo"export JAVA_HOME=${JAVA_HOME}">>conf/ozone-env.sh
sudoJAVA_HOME=/usr/lib/jvm/jre/./enable-ozone-plugin.sh {code}
Even if XmlConfigChanger were fixed, OM would still fail loading
{{RangerOzoneAuthorizer}} at runtime on Java 11.
Changes:
# {{dev-support/ranger-docker/.env}} — set
{{OZONE_RUNNER_VERSION=20241022-jdk17-1}}
# {{dev-support/ranger-docker/scripts/ozone/ranger-ozone-setup.sh}} — use
{{${JAVA_HOME}}} instead of {{/usr/lib/jvm/jre/}}
Risk: Ozone 1.4.x officially targets JDK 11 for some CLI paths
([HDDS-12153|https://github.com/apache/ozone-docker/pull/39]), but services run
fine on JDK 17, which is what CI needs.
h2. 2. ozone-om — SCM startup ordering (secondary)
Evidence: {{{}Connection refused: scm:9863{}}}, then
{{ServerNotLeaderException}} during OM {{{}--init{}}}. OM init did succeed
once; SCM caught up shortly after.
In {{{}docker-compose.ranger-ozone.yml{}}}, {{om}} has no dependency on {{scm}}
or {{{}datanode{}}}:
{code:java}
docker-compose.ranger-ozone.ymlLines 37-50
depends_on: ranger: condition:service_started
ranger-solr: condition:service_started {code}
...
command:bash -c "/opt/hadoop/ranger-ozone-plugin/ranger-ozone-setup.sh &&
/opt/hadoop/bin/ozone om"
All three (scm, datanode, om) start in parallel.
h3. Fix
om:
depends_on:
{code:java}
scm: condition:service_started datanode:
condition:service_started
ranger: condition:service_started
ranger-solr: condition:service_started {code}
Optionally add a short {{wait-for-scm.sh}} (poll {{{}scm:9860{}}}) before
{{ozone om}} for extra stability. This is a flake reducer, not the root cause —
Java mismatch is what actually killed the container.
h2. 3. ranger-knox — Gateway failed to start
Evidence from CI: Plugin enable completed successfully (audit XML, topology
updates, cred.jceks). LDAP started. Then:
Starting Gateway failed.
The Knox Gateway process probably exited, no process id found!
So this is not XmlConfigChanger or plugin-enable failure. The gateway JVM exits
immediately; gateway.log is not printed in CI, which makes diagnosis harder.
h3. Most likely cause: incomplete Knox plugin packaging after RANGER-5632
[RANGER-5632|https://github.com/apache/ranger/pull/999] removed Solr/HDFS audit
destinations from plugin tarballs, leaving auditserver only. Docker install
props enable auditserver:
ranger-knox-plugin-install.propertiesLines 35-37
XAAUDIT.AUDITSERVER.ENABLE=true
XAAUDIT.AUDITSERVER.URL=http://ranger-audit-ingestor.rangernw:7081
{{ranger-audit-dest-auditserver}} depends on Jersey 2 + HK2 (which needs
{{{}javax.inject{}}}). Compare assembly whitelists:
||Dependency||{{plugin-kafka.xml}} (passes CI)||{{knox-agent.xml}} (fails)||
|{{jersey-media-json-jackson}}|✅|❌|
|{{jersey-entity-filtering}}|✅|❌|
|{{jackson-jaxrs-json-provider}}|✅|❌|
|{{javax.inject}}|❌|❌|
|{{httpasyncclient}} / {{httpcore-nio}}|✅|✅|
Knox’s whitelist is much thinner than Kafka’s. Before RANGER-5632, Knox audits
went through Solr ({{{}solrj{}}} is in the Knox whitelist). After switching to
auditserver-only, the Jersey client stack may be missing from
{{{}lib/ranger-knox-plugin-impl/{}}}, causing gateway classpath failure at
startup (same class of bug as PDP’s {{javax.inject.Singleton}} issue).
h3. Fixes for Knox
A. Packaging (likely root fix) — align
{{distro/src/main/assembly/knox-agent.xml}} with Kafka/Ozone:
* Add {{javax.inject:javax.inject}}
* Add {{org.glassfish.jersey.core:jersey-client}}
* Add {{org.glassfish.jersey.inject:jersey-hk2}}
* Add {{org.glassfish.jersey.media:jersey-media-json-jackson}}
* Add {{com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider}}
* Add HK2 deps if needed (as in {{{}pdp.xml{}}})
B. Diagnostics — improve {{ranger-knox.sh}} so CI captures the real error:
if [ -z "$KNOX_GATEWAY_PID" ]; then
echo"Gateway logs:"
tail-100"${KNOX_HOME}/logs/gateway.log"2>/dev/null||true
tail-100"${KNOX_HOME}/logs/gateway-${HOSTNAME}.log"2>/dev/null||true
fi
C. Compose ordering (optional) — Knox {{depends_on}} only {{ranger}} +
{{{}ranger-zk{}}}, but sandbox topology references {{{}ranger-hadoop{}}},
{{{}ranger-hive{}}}, {{{}ranger-hbase{}}}. Adding {{ranger-hadoop:
service_healthy}} may help stability; it is unlikely to be the immediate
gateway crash cause.
D. Audit ingestor not in {{plugins-docker-build}} — {{ranger-audit-ingestor}}
is not started in that CI job. That should not block gateway startup (audits
are async), but audits will not flow until ingestor is added to the compose
stack or auditserver is disabled for the smoke test only.
h2. Recommended fix order
!image-2026-06-10-10-26-27-904.png! # Ozone Java 17 runner — highest
confidence, small diff
# Knox assembly deps — likely fixes gateway; mirrors PDP/Kafka pattern
# Ozone compose ordering — reduces SCM flakes
# Knox log dump in CI — confirms root cause if gateway still fails
was:
[CI run
27213972490|https://github.com/apache/ranger/actions/runs/27213972490/job/80356301472]
({{{}plugins-docker-build{}}} on PR #980) was cancelled at the 60-minute job
limit:
* GitHub Actions cache miss — all plugin distros downloaded cold.
* {{download-archives.sh}} spent ~59 minutes pulling {{hadoop-3.4.2.tar.gz}}
(~1 GB) at ~200 KB/s, reaching only 72% before timeout.
* Subsequent steps (HBase, Hive, Tez, Kafka, Knox, Ozone) and Docker build
never ran.
* {{build-17}} and {{services-docker-build}} passed; only the plugin docker
job was affected.
This is a CI infrastructure / mirror slowness issue, not a product code defect.
> Ranger CI: fix plugins-docker-build (download timeouts, Knox/Ozone smoke-test
> failures)
> ---------------------------------------------------------------------------------------
>
> Key: RANGER-5637
> URL: https://issues.apache.org/jira/browse/RANGER-5637
> Project: Ranger
> Issue Type: Task
> Components: Ranger
> Reporter: Ramachandran Krishnan
> Assignee: Ramachandran Krishnan
> Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2026-06-10-10-26-27-904.png
>
>
> h2. 1. ozone-om — Java 17 mismatch (primary)
> Evidence from CI:
> UnsupportedClassVersionError: XmlConfigChanger ... class file version 61.0
> ... only recognizes up to 55.0
> UnsupportedClassVersionError: RangerOzoneAuthorizer ... class file version
> 61.0
> STARTUP_MSG: java = 11.0.19
>
> Ranger 3.0 is built with Java 17. The Ozone stack uses
> {{{}apache/ozone-runner:20230615-1{}}}, which ships Java 11.
> The setup script hardcodes Java 11 for plugin enable:
> ranger-ozone-setup.shLines 29-30
>
>
> {code:java}
> echo"export JAVA_HOME=${JAVA_HOME}">>conf/ozone-env.sh
> sudoJAVA_HOME=/usr/lib/jvm/jre/./enable-ozone-plugin.sh {code}
>
> Even if XmlConfigChanger were fixed, OM would still fail loading
> {{RangerOzoneAuthorizer}} at runtime on Java 11.
>
> Changes:
> # {{dev-support/ranger-docker/.env}} — set
> {{OZONE_RUNNER_VERSION=20241022-jdk17-1}}
> # {{dev-support/ranger-docker/scripts/ozone/ranger-ozone-setup.sh}} — use
> {{${JAVA_HOME}}} instead of {{/usr/lib/jvm/jre/}}
> Risk: Ozone 1.4.x officially targets JDK 11 for some CLI paths
> ([HDDS-12153|https://github.com/apache/ozone-docker/pull/39]), but services
> run fine on JDK 17, which is what CI needs.
> h2. 2. ozone-om — SCM startup ordering (secondary)
> Evidence: {{{}Connection refused: scm:9863{}}}, then
> {{ServerNotLeaderException}} during OM {{{}--init{}}}. OM init did succeed
> once; SCM caught up shortly after.
> In {{{}docker-compose.ranger-ozone.yml{}}}, {{om}} has no dependency on
> {{scm}} or {{{}datanode{}}}:
> {code:java}
> docker-compose.ranger-ozone.ymlLines 37-50
> depends_on: ranger: condition:service_started
> ranger-solr: condition:service_started {code}
>
> ...
> command:bash -c "/opt/hadoop/ranger-ozone-plugin/ranger-ozone-setup.sh &&
> /opt/hadoop/bin/ozone om"
>
> All three (scm, datanode, om) start in parallel.
> h3. Fix
> om:
> depends_on:
>
> {code:java}
> scm: condition:service_started datanode:
> condition:service_started
> ranger: condition:service_started
> ranger-solr: condition:service_started {code}
>
> Optionally add a short {{wait-for-scm.sh}} (poll {{{}scm:9860{}}}) before
> {{ozone om}} for extra stability. This is a flake reducer, not the root cause
> — Java mismatch is what actually killed the container.
> h2. 3. ranger-knox — Gateway failed to start
> Evidence from CI: Plugin enable completed successfully (audit XML, topology
> updates, cred.jceks). LDAP started. Then:
> Starting Gateway failed.
> The Knox Gateway process probably exited, no process id found!
> So this is not XmlConfigChanger or plugin-enable failure. The gateway JVM
> exits immediately; gateway.log is not printed in CI, which makes diagnosis
> harder.
> h3. Most likely cause: incomplete Knox plugin packaging after RANGER-5632
> [RANGER-5632|https://github.com/apache/ranger/pull/999] removed Solr/HDFS
> audit destinations from plugin tarballs, leaving auditserver only. Docker
> install props enable auditserver:
> ranger-knox-plugin-install.propertiesLines 35-37
> XAAUDIT.AUDITSERVER.ENABLE=true
> XAAUDIT.AUDITSERVER.URL=http://ranger-audit-ingestor.rangernw:7081
> {{ranger-audit-dest-auditserver}} depends on Jersey 2 + HK2 (which needs
> {{{}javax.inject{}}}). Compare assembly whitelists:
> ||Dependency||{{plugin-kafka.xml}} (passes CI)||{{knox-agent.xml}} (fails)||
> |{{jersey-media-json-jackson}}|✅|❌|
> |{{jersey-entity-filtering}}|✅|❌|
> |{{jackson-jaxrs-json-provider}}|✅|❌|
> |{{javax.inject}}|❌|❌|
> |{{httpasyncclient}} / {{httpcore-nio}}|✅|✅|
> Knox’s whitelist is much thinner than Kafka’s. Before RANGER-5632, Knox
> audits went through Solr ({{{}solrj{}}} is in the Knox whitelist). After
> switching to auditserver-only, the Jersey client stack may be missing from
> {{{}lib/ranger-knox-plugin-impl/{}}}, causing gateway classpath failure at
> startup (same class of bug as PDP’s {{javax.inject.Singleton}} issue).
> h3. Fixes for Knox
> A. Packaging (likely root fix) — align
> {{distro/src/main/assembly/knox-agent.xml}} with Kafka/Ozone:
> * Add {{javax.inject:javax.inject}}
> * Add {{org.glassfish.jersey.core:jersey-client}}
> * Add {{org.glassfish.jersey.inject:jersey-hk2}}
> * Add {{org.glassfish.jersey.media:jersey-media-json-jackson}}
> * Add {{com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider}}
> * Add HK2 deps if needed (as in {{{}pdp.xml{}}})
> B. Diagnostics — improve {{ranger-knox.sh}} so CI captures the real error:
> if [ -z "$KNOX_GATEWAY_PID" ]; then
> echo"Gateway logs:"
> tail-100"${KNOX_HOME}/logs/gateway.log"2>/dev/null||true
> tail-100"${KNOX_HOME}/logs/gateway-${HOSTNAME}.log"2>/dev/null||true
> fi
> C. Compose ordering (optional) — Knox {{depends_on}} only {{ranger}} +
> {{{}ranger-zk{}}}, but sandbox topology references {{{}ranger-hadoop{}}},
> {{{}ranger-hive{}}}, {{{}ranger-hbase{}}}. Adding {{ranger-hadoop:
> service_healthy}} may help stability; it is unlikely to be the immediate
> gateway crash cause.
> D. Audit ingestor not in {{plugins-docker-build}} — {{ranger-audit-ingestor}}
> is not started in that CI job. That should not block gateway startup (audits
> are async), but audits will not flow until ingestor is added to the compose
> stack or auditserver is disabled for the smoke test only.
> h2. Recommended fix order
> !image-2026-06-10-10-26-27-904.png! # Ozone Java 17 runner — highest
> confidence, small diff
> # Knox assembly deps — likely fixes gateway; mirrors PDP/Kafka pattern
> # Ozone compose ordering — reduces SCM flakes
> # Knox log dump in CI — confirms root cause if gateway still fails
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)