Hi Rishi, Thanks for writing this up. I lean toward a global baseline sanitizer. Filtering sensitive attributes and pruning very large payloads are cross-listener concerns, so making each listener remember to opt in seems easy to get wrong.
Access to raw, unsanitized events should be explicit and reserved for a well-defined use case. It should not be the default behavior. Otherwise a future listener can accidentally bypass the same safety rules persistence is trying to enforce. For configuration, I would start with a shared baseline policy and only add per-listener overrides where we have a concrete need. That keeps the default behavior predictable while still leaving room for specialized consumers. I haven't looked at the PR yet, please correct me if I'm wrong. Yufei On Tue, 12 May 2026 21:19:05 -0500, Srinivas Rishindra [email protected] wrote: Hi All, I’d like to discuss the architecture for event persistence and a related proposal for a global event sanitization layer, building on the feedback from PR #4225 https://github.com/apache/polaris/pull/4225. *Current State of PR #4225 https://github.com/apache/polaris/pull/4225 <https://github.com/apache/polaris/pull/4225>* The PR introduces a universal mechanism for persisting PolarisEvents to the database. Rather than per-event-type handlers (which would require 150+ implementations), it uses: 1. A single PolarisPersistenceEventListener that deterministically maps any event to a persistence entity using the event’s PolarisEventType.Category for resource type classification and a fallback chain for resource identifier resolution. 2. An EventAttributeFilter (CDI bean) — an allowlist-based gate that determines which AttributeKeys from the EventAttributeMap are safe to persist. 3. An EventPayloadPruner (CDI bean) — reduces large attribute values (particularly TableMetadata) to bounded summaries suitable for storage. For example, TableMetadata is reduced to uuid, location, format-version, current-snapshot-id, schema, and last-updated-ms (~1KB vs potentially hundreds of MB). Both the filter and pruner are operator-configurable and replaceable via CDI. *The Architectural Question: Global Sanitization Pipeline* During review, Alex Dutra raised an important point: the sanitization logic (filtering sensitive attributes, pruning large payloads) is currently applied only within the persistence listener. However, this concern applies to all event consumers — persistence, CloudWatch, and any future listeners. The natural insertion point would be in PolarisEventListeners.deliverEvent(), which is the single choke point between the Vert.x event bus and all listener implementations. A global EventSanitizer (composed of the filter + pruner) could transform events before they reach any listener. *Proposal for Global Sanitization* The filter and pruner interfaces are already designed for this promotion: - They live in the org.apache.polaris.service.events package (not within listeners) - They have no coupling to persistence concerns - They operate on AttributeKey + Object pairs from EventAttributeMap *A potential implementation path:* // In PolarisEventListeners.deliverEvent(): private void deliverEvent(PolarisEvent event, String listenerName, PolarisEventListener listener) { PolarisEvent sanitizedEvent = eventSanitizer.sanitize(event); listener.onEvent(sanitizedEvent); } Where EventSanitizer wraps the filter and pruner, producing a new PolarisEvent with a sanitized EventAttributeMap. Individual listeners could still apply additional listener-specific transformations if needed. *Open Questions for the Community* 1. Should sanitization be global (all listeners get pre-sanitized events) or should each listener opt in? 2. If global, should the original unsanitized event still be accessible for listeners that need raw data (e.g., a compliance auditor that needs full TableMetadata)? 3. Should the filter/pruner configuration be per-listener or shared? The current implementation allows per-listener configuration via the CDI bean model. I’d appreciate input on the direction here. I think the PR is functional as-is (sanitization works within persistence), but I want to align on the long-term architecture before it becomes harder to refactor. Thanks, Rishi
