[
https://issues.apache.org/jira/browse/RANGER-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ramachandran Krishnan updated RANGER-5655:
------------------------------------------
Issue Type: Improvement (was: Task)
> Implement dynamic unified ingestor registry for audit-ingestor: runtime Kafka
> partition routing and per-repo service allowlists via compacted topic + REST,
> without ingestor restarts. Feature flag default off.
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: RANGER-5655
> URL: https://issues.apache.org/jira/browse/RANGER-5655
> Project: Ranger
> Issue Type: Improvement
> Components: Ranger
> Reporter: Ramachandran Krishnan
> Assignee: Ramachandran Krishnan
> Priority: Major
> Fix For: 3.0.0
>
> Attachments: Dynamic Ingestor Registry Guide (Ranger Audit
> Ingestor).pdf
>
>
> Implement a *dynamic unified ingestor registry* for Ranger audit-ingestor so
> operators can change Kafka partition routing and per-repo service allowlists
> at runtime — without restarting ingestor pods.
> The registry is stored in a Kafka compacted topic
> ({{{}ranger_audit_partition_plan{}}}) and managed via REST
> ({{{}/api/audit/partition-plan{}}}). All ingestor replicas converge on the
> same versioned plan through a background watcher; {{AuditPartitioner}} routes
> audit records on the hot path from in-memory state only.
> Feature flag (default off):
> {{ranger.audit.ingestor.kafka.partition.plan.dynamic.enabled=false}}
>
> ----
> h2. *Problem*
> Today (static mode), audit-ingestor loads two kinds of configuration from XML
> at startup only:
> ||Job||Question||Static behavior||
> |Service allowlist|May this Kerberos principal POST audits for repo
> {*}R{*}?|{{ranger.audit.ingestor.service.<repo>.allowed.users}} in site XML|
> |Partition routing|After accept, which {{ranger_audits}}
> partition?|{{kafka.configured.plugins}} + per-plugin overrides in site XML|
> Changing either requires editing XML and {*}restarting every ingestor
> replica{*}. Contiguous-range static allocation can also reshuffle later
> plugins when an early plugin's partition count changes.
> *Goals:*
> * Onboard new plugins/repos and scale hot plugins without ingestor restart
> * Append-only partition growth (no reshuffle of existing plugin assignments)
> * One shared source of truth across all ingestor pods
> * No new infra (no Postgres / ZooKeeper for the registry)
> ----
> h2. Solution
> Introduce a *unified partition plan document* (versioned JSON) in Kafka topic
> {{ranger_audit_partition_plan}} (1 partition, compacted). One document holds:
> * {{plugins}} — dedicated partition IDs per plugin id (Kafka record key /
> agent id)
> * {{buffer}} — partition pool for not-yet-promoted plugins (sticky hash)
> * {{services}} — per-repo {{allowedUsers}} for {{POST /api/audit/access}}
> * {{topicPartitionCount}} — must match live {{ranger_audits}} partition count
> * {{version}} — optimistic locking for REST mutations
> *Control plane:* REST + registry topic + {{PartitionPlanWatcher}}
> *Data plane:* plugins POST {{/api/audit/access}} → allowlist check →
> {{AuditPartitioner}} → {{ranger_audits}}
> Solr/HDFS dispatchers are unchanged; they consume all partitions of
> {{{}ranger_audits{}}}.
> ----
> h2. Key deliverables
> * *Kafka registry* — {{{}KafkaPartitionPlanRegistry{}}}, bootstrap from XML
> on greenfield, brownfield pre-seed support
> * *REST API* — {{{}GET/PATCH /partition-plan{}}}, {{{}POST
> /partition-plan/plugins{}}}, {{{}POST /partition-plan/services{}}}, {{{}PATCH
> /partition-plan/plugins/{pluginId{}}}} with {{expectedVersion}} (409 on stale)
> * *Dynamic partitioner* — {{AuditPartitioner}} reads
> {{{}PartitionPlanHolder{}}}; round-robin for promoted plugins; buffer sticky
> hash for unknown plugins; post-scale routing uses {{max(cluster, plan)}} when
> metadata lags
> * *Unified allowlist* — {{services}} map in same registry; {{auth_to_local}}
> rules recomposed from allowlists in dynamic mode
> * *Topic grow* — grow {{ranger_audits}} before registry write when
> promoting/scaling
> * *E2E harness* — Docker Tier 3 scripts for partition-plan REST, plugin
> onboard/routing, auth_to_local, full plugin→Solr pipelines (HDFS, Ozone,
> Hive, etc.)
> * *Documentation* — implementation guide, ops runbook, brownfield migration,
> E2E test plan
> ----
> h2. Configuration
> {code:xml}
> <property>
> <name>ranger.audit.ingestor.kafka.partition.plan.dynamic.enabled</name>
> <value>true</value>
> </property>
> <property>
> <name>ranger.audit.ingestor.kafka.partition.plan.topic</name>
> <value>ranger_audit_partition_plan</value>
> </property>
> {code}
> When dynamic mode is on and registry is empty, the first ingestor pod
> bootstraps an initial plan from existing XML properties
> ({{{}kafka.configured.plugins{}}}, buffer/per-plugin counts, service
> allowlists).
> ----
> h2. Example plan JSON
> {code:json}
> {
> "topic": "ranger_audits",
> "version": 12,
> "topicPartitionCount": 48,
> "plugins": {
> "hdfs": { "partitions": [0, 1, 2, 3, 4, 5] },
> "hiveServer2": { "partitions": [6, 7, 8, 9, 10, 11] }
> },
> "buffer": { "partitions": [12, 13, "..."] },
> "services": {
> "dev_hive": { "allowedUsers": ["hive"] },
> "dev_ozone": { "allowedUsers": ["om", "ozone"] },
> "dev_hdfs": { "allowedUsers": ["hdfs", "nn"] }
> }
> }
> {code}
> ----
> h2. Testing
> h3. Validated scenarios
> h4. 1. Static vs dynamic mode
> What it proves: The feature flag works and existing deployments are not
> broken.
> ||Mode||Config||Expected behavior||
> |Static (default)|{{kafka.partition.plan.dynamic.enabled=false}}|Routing and
> allowlists come from XML at startup only. {{GET /api/audit/partition-plan}}
> returns 503. Ingestor health is 200. HDFS/plugin audits still flow to Solr.|
> |Dynamic|{{dynamic.enabled=true}}|Ingestor starts
> {{{}PartitionPlanWatcher{}}}, reads plan from Kafka topic
> {{{}ranger_audit_partition_plan{}}}, and routes audits from in-memory plan.
> {{GET /partition-plan}} returns JSON.|
>
> ----
> h4. 2. Greenfield bootstrap
> What it proves: On a new cluster with an empty plan topic, the first ingestor
> pod creates version 1 of the partition plan automatically — no manual REST
> call.
> Flow:
> # Enable dynamic mode; restart ingestor.
> # Plan topic {{ranger_audit_partition_plan}} is created (1 partition,
> compacted).
> # Ingestor publishes v1 from existing XML ({{{}kafka.configured.plugins{}}},
> buffer size, per-plugin overrides, service allowlists).
> # Plan’s {{topicPartitionCount}} matches live {{ranger_audits}} partition
> count (e.g. 48 = 13 plugins × 3 + 9 buffer on full lab layout).
> Success: {{GET /partition-plan}} shows v1, correct plugin list, partition
> count aligned with Kafka. Audits still work after enable.
> ----
> h4. 3. REST promote / scale (200 / 409 / 400)
> What it proves: Operators can change routing at runtime via REST without
> restarting ingestor. The API rejects bad requests.
> ||Action||REST||Success||Failure cases||
> |Promote new plugin (e.g. {{{}storm{}}}) from buffer → dedicated
> partitions|{{POST /partition-plan/plugins}}|200, {{version}} increments,
> plugin in {{plugins}} map|400 if plugin already promoted (e.g. {{hdfs}}
> twice)|
> |Scale hot plugin (+N partitions)|{{{}PATCH
> /partition-plan/plugins/{pluginId{}}}}|200, new partition IDs appended at
> tail; {{ranger_audits}} grown first if needed|400 if scaling a buffer-only
> plugin|
> |Stale edit|Any mutation with wrong {{expectedVersion}}|—|409 + current plan
> body (forces refresh and retry)|
> |Invalid plan|Overlapping partitions, reshuffle existing IDs,
> {{partitionCount: 0}}|—|400|
> Key property: Changes are append-only — existing plugin partition lists are
> never reshuffled.
> ----
> h4. 4. Multi-pod convergence
> What it proves: In a real cluster with multiple ingestor replicas behind a
> load balancer, all pods see the same plan after an admin change.
> Flow:
> # Primary ingestor on :7081, second replica on :7082 (same Kafka plan topic,
> different Kerberos identity).
> # Both pods report the same {{version}} at startup.
> # Admin promotes a plugin on primary only.
> # Within one watcher cycle (~30s), replica on :7082 shows the same new
> version without any REST call on the replica.
> Success: No drift between pods; routing is consistent cluster-wide.
> ----
> h4. 5. Brownfield pre-seed
> What it proves: Existing production clusters can cut over to dynamic mode
> safely by writing the plan into Kafka before enabling the feature — ingestor
> must not overwrite it with a fresh XML bootstrap.
> Flow (Path A migration):
> # Capture current plan JSON while dynamic is briefly on.
> # Turn dynamic off; delete plan topic (simulate “still on static”).
> # Operator pre-seeds plan to Kafka with {{version=1}} and marker
> {{{}updatedBy=brownfield-e2e-seed{}}}.
> # Enable dynamic; restart ingestor.
> # Ingestor adopts pre-seeded plan — not auto-{{{}bootstrap{}}} from XML.
> # Rollback: turn dynamic off → partition-plan API 503, health 200, audits
> still OK in static mode.
> ----
> h4. 6. Plugin onboard + Kafka partition routing
> What it proves: A real plugin can be onboarded via REST and its audit events
> land on the correct Kafka partitions defined in the plan.
> Flow:
> # Enable dynamic mode (greenfield buffer-only layout).
> # For each running plugin container (HDFS, Ozone, Hive, etc.): {{POST
> /partition-plan/services}} with {{{}serviceName{}}}, {{{}pluginId{}}},
> {{{}partitionCount{}}}, {{{}allowedUsers{}}}.
> # Plugin authenticates with Kerberos and {{{}POST /api/audit/access{}}}.
> # Read Kafka record for that event → partition number must be in the
> plugin’s assigned list in the plan (not buffer, after promote).
> ----
> h4. 7. auth_to_local recomposition
> What it proves: The unified registry {{services}} map controls who may POST
> audits, and Kerberos principals are mapped to short usernames correctly.
> Flow (per plugin repo):
> # Plugin calls {{POST /access}} with Kerberos principal (e.g.
> {{{}hdfs/[email protected]{}}}) → ingestor maps via
> {{auth_to_local}} → short name {{hdfs}} → 200 if in
> {{{}services[dev_hdfs].allowedUsers{}}}.
> # Remove user from allowlist via {{PATCH /partition-plan}} (services delta)
> → same principal gets 403.
> # Re-add allowlist → 200 again.
> # Cross-repo denial: HDFS principal posting to {{dev_kms}} repo → 403.
> Why it matters: In dynamic mode, allowlists live in the registry (not XML
> restart). {{auth_to_local}} rules must stay in sync with
> {{{}services[].allowedUsers{}}}.
> ----
> h4. 8. HDFS / Ozone / Hive full audit pipelines
> What it proves: Dynamic partition plan changes do not break the end-to-end
> audit path: plugin → ingestor → Kafka → dispatcher → Solr (and Admin Audit UI
> when {{{}audit_store=solr{}}}).
> ||Plugin||What is exercised||
> |HDFS|Real NameNode operation → plugin audit → ingestor → correct partition →
> Solr doc|
> |Ozone|OM principal ({{{}om{}}}, {{{}ozone{}}}) → {{dev_ozone}} repo → full
> pipeline|
> |Hive|HS2 principal ({{{}hive{}}}) → {{dev_hive}} repo → full pipeline|
> Validated after: promote/scale (partition plan changed live), not only at
> bootstrap.
> Verify Solr:
> curl -s
> 'http://localhost:8983/solr/ranger_audits/select?q=repo:dev_hdfs&rows=3&wt=json'
> ----
> h3. Summary table (for Jira)
> ||Scenario||Plain question||How verified||Pass criteria||
> |Static vs dynamic|Does default-off behavior still work?|Feature flag + REST
> 503/200|Static unchanged; dynamic enables plan API|
> |Greenfield bootstrap|Who creates v1 on empty topic?|First ingestor
> start|Plan v1 in Kafka; counts match|
> |REST promote/scale|Can ops change routing live?|REST mutations|200 on valid;
> 409 stale; 400 illegal|
> |Multi-pod convergence|Do all replicas agree?|2 ingestors, promote on
> one|Same version ≤35s on both|
> |Brownfield pre-seed|Safe production cutover?|Pre-write plan, then
> enable|Pre-seed preserved; rollback OK|
> |Plugin onboard + routing|Do audits hit right partitions?|POST access + Kafka
> inspect|Partition ∈ plan list|
> |auth_to_local|Does allowlist enforcement work?|Allow/deny via services
> map|200/403 per principal+repo|
> |Full pipelines|Is audit delivery intact?|HDFS/Ozone/Hive → Solr|Docs in
> {{ranger_audits}} Solr collection|
> h2. Acceptance criteria
> # With {{{}dynamic.enabled=false{}}}, behavior matches existing static XML
> partitioning; {{GET /partition-plan}} returns 503
> # With {{{}dynamic.enabled=true{}}}, registry topic created (1 partition,
> compacted); bootstrap plan published on greenfield
> # REST promote/scale updates plan version; stale {{expectedVersion}} returns
> 409
> # All ingestor replicas converge to same plan within watcher refresh interval
> # Audits for promoted plugin land only on assigned {{ranger_audits}}
> partitions
> # {{POST /partition-plan/services}} updates allowlist; unauthorized
> principal returns 403
> # Post-scale routing works when Kafka metadata lags plan
> ({{{}AuditPartitioner{}}} bound logic)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)