[
https://issues.apache.org/jira/browse/RANGER-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ramachandran Krishnan resolved RANGER-5637.
-------------------------------------------
Resolution: Fixed
> Ranger CI: fix plugins-docker-build (download timeouts, Knox/Ozone smoke-test
> failures)
> ---------------------------------------------------------------------------------------
>
> Key: RANGER-5637
> URL: https://issues.apache.org/jira/browse/RANGER-5637
> Project: Ranger
> Issue Type: Task
> Components: Ranger
> Reporter: Ramachandran Krishnan
> Assignee: Ramachandran Krishnan
> Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2026-06-10-10-26-27-904.png
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> h2. 1. ozone-om — Java 17 mismatch (primary)
> Evidence from CI:
> UnsupportedClassVersionError: XmlConfigChanger ... class file version 61.0
> ... only recognizes up to 55.0
> UnsupportedClassVersionError: RangerOzoneAuthorizer ... class file version
> 61.0
> STARTUP_MSG: java = 11.0.19
>
> Ranger 3.0 is built with Java 17. The Ozone stack uses
> {{{}apache/ozone-runner:20230615-1{}}}, which ships Java 11.
> The setup script hardcodes Java 11 for plugin enable:
> ranger-ozone-setup.shLines 29-30
>
>
> {code:java}
> echo"export JAVA_HOME=${JAVA_HOME}">>conf/ozone-env.sh
> sudoJAVA_HOME=/usr/lib/jvm/jre/./enable-ozone-plugin.sh {code}
>
> Even if XmlConfigChanger were fixed, OM would still fail loading
> {{RangerOzoneAuthorizer}} at runtime on Java 11.
>
> Changes:
> # {{dev-support/ranger-docker/.env}} — set
> {{OZONE_RUNNER_VERSION=20241022-jdk17-1}}
> # {{dev-support/ranger-docker/scripts/ozone/ranger-ozone-setup.sh}} — use
> {{${JAVA_HOME}}} instead of {{/usr/lib/jvm/jre/}}
> Risk: Ozone 1.4.x officially targets JDK 11 for some CLI paths
> ([HDDS-12153|https://github.com/apache/ozone-docker/pull/39]), but services
> run fine on JDK 17, which is what CI needs.
> h2. 2. ozone-om — SCM startup ordering (secondary)
> Evidence: {{{}Connection refused: scm:9863{}}}, then
> {{ServerNotLeaderException}} during OM {{{}--init{}}}. OM init did succeed
> once; SCM caught up shortly after.
> In {{{}docker-compose.ranger-ozone.yml{}}}, {{om}} has no dependency on
> {{scm}} or {{{}datanode{}}}:
> {code:java}
> docker-compose.ranger-ozone.ymlLines 37-50
> depends_on: ranger: condition:service_started
> ranger-solr: condition:service_started {code}
>
> ...
> command:bash -c "/opt/hadoop/ranger-ozone-plugin/ranger-ozone-setup.sh &&
> /opt/hadoop/bin/ozone om"
>
> All three (scm, datanode, om) start in parallel.
> h3. Fix
> om:
> depends_on:
>
> {code:java}
> scm: condition:service_started datanode:
> condition:service_started
> ranger: condition:service_started
> ranger-solr: condition:service_started {code}
>
> Optionally add a short {{wait-for-scm.sh}} (poll {{{}scm:9860{}}}) before
> {{ozone om}} for extra stability. This is a flake reducer, not the root cause
> — Java mismatch is what actually killed the container.
> h2. 3. ranger-knox — Gateway failed to start
> Evidence from CI: Plugin enable completed successfully (audit XML, topology
> updates, cred.jceks). LDAP started. Then:
> Starting Gateway failed.
> The Knox Gateway process probably exited, no process id found!
> So this is not XmlConfigChanger or plugin-enable failure. The gateway JVM
> exits immediately; gateway.log is not printed in CI, which makes diagnosis
> harder.
> h3. Most likely cause: incomplete Knox plugin packaging after RANGER-5632
> [RANGER-5632|https://github.com/apache/ranger/pull/999] removed Solr/HDFS
> audit destinations from plugin tarballs, leaving auditserver only. Docker
> install props enable auditserver:
> ranger-knox-plugin-install.propertiesLines 35-37
> XAAUDIT.AUDITSERVER.ENABLE=true
> XAAUDIT.AUDITSERVER.URL=http://ranger-audit-ingestor.rangernw:7081
> {{ranger-audit-dest-auditserver}} depends on Jersey 2 + HK2 (which needs
> {{{}javax.inject{}}}). Compare assembly whitelists:
> ||Dependency||{{plugin-kafka.xml}} (passes CI)||{{knox-agent.xml}} (fails)||
> |{{jersey-media-json-jackson}}|✅|❌|
> |{{jersey-entity-filtering}}|✅|❌|
> |{{jackson-jaxrs-json-provider}}|✅|❌|
> |{{javax.inject}}|❌|❌|
> |{{httpasyncclient}} / {{httpcore-nio}}|✅|✅|
> Knox’s whitelist is much thinner than Kafka’s. Before RANGER-5632, Knox
> audits went through Solr ({{{}solrj{}}} is in the Knox whitelist). After
> switching to auditserver-only, the Jersey client stack may be missing from
> {{{}lib/ranger-knox-plugin-impl/{}}}, causing gateway classpath failure at
> startup (same class of bug as PDP’s {{javax.inject.Singleton}} issue).
> h3. Fixes for Knox
> A. Packaging (likely root fix) — align
> {{distro/src/main/assembly/knox-agent.xml}} with Kafka/Ozone:
> * Add {{javax.inject:javax.inject}}
> * Add {{org.glassfish.jersey.core:jersey-client}}
> * Add {{org.glassfish.jersey.inject:jersey-hk2}}
> * Add {{org.glassfish.jersey.media:jersey-media-json-jackson}}
> * Add {{com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider}}
> * Add HK2 deps if needed (as in {{{}pdp.xml{}}})
> B. Diagnostics — improve {{ranger-knox.sh}} so CI captures the real error:
> if [ -z "$KNOX_GATEWAY_PID" ]; then
> echo"Gateway logs:"
> tail-100"${KNOX_HOME}/logs/gateway.log"2>/dev/null||true
> tail-100"${KNOX_HOME}/logs/gateway-${HOSTNAME}.log"2>/dev/null||true
> fi
> C. Compose ordering (optional) — Knox {{depends_on}} only {{ranger}} +
> {{{}ranger-zk{}}}, but sandbox topology references {{{}ranger-hadoop{}}},
> {{{}ranger-hive{}}}, {{{}ranger-hbase{}}}. Adding {{ranger-hadoop:
> service_healthy}} may help stability; it is unlikely to be the immediate
> gateway crash cause.
> D. Audit ingestor not in {{plugins-docker-build}} — {{ranger-audit-ingestor}}
> is not started in that CI job. That should not block gateway startup (audits
> are async), but audits will not flow until ingestor is added to the compose
> stack or auditserver is disabled for the smoke test only.
> h2. Recommended fix order
> !image-2026-06-10-10-26-27-904.png! # Ozone Java 17 runner — highest
> confidence, small diff
> # Knox assembly deps — likely fixes gateway; mirrors PDP/Kafka pattern
> # Ozone compose ordering — reduces SCM flakes
> # Knox log dump in CI — confirms root cause if gateway still fails
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)