Hi Rishi,
Thanks for writing this up.

I lean toward a global baseline sanitizer. Filtering sensitive attributes
and pruning very large payloads are cross-listener concerns, so making each
listener remember to opt in seems easy to get wrong.

Access to raw, unsanitized events should be explicit and reserved for a
well-defined use case. It should not be the default behavior. Otherwise a
future listener can accidentally bypass the same safety rules persistence
is trying to enforce.

For configuration, I would start with a shared baseline policy and only add
per-listener overrides where we have a concrete need. That keeps the
default behavior predictable while still leaving room for specialized
consumers.

I haven't looked at the PR yet, please correct me if I'm wrong.

Yufei

On Tue, 12 May 2026 21:19:05 -0500, Srinivas Rishindra
[email protected] wrote:

Hi All, I’d like to discuss the architecture for event persistence and a
related proposal for a global event sanitization layer, building on the
feedback from PR #4225 https://github.com/apache/polaris/pull/4225.

*Current State of PR #4225 https://github.com/apache/polaris/pull/4225
<https://github.com/apache/polaris/pull/4225>*

The PR introduces a universal mechanism for persisting PolarisEvents to
the database. Rather than per-event-type handlers (which would require 150+
implementations), it uses:

   1. A single PolarisPersistenceEventListener that deterministically maps
   any event to a persistence entity using the event’s
   PolarisEventType.Category for resource type classification and a fallback
   chain for
   resource identifier resolution.
   2. An EventAttributeFilter (CDI bean) — an allowlist-based gate that
   determines which AttributeKeys from the EventAttributeMap are safe to
   persist.
   3. An EventPayloadPruner (CDI bean) — reduces large attribute values
   (particularly TableMetadata) to bounded summaries suitable for storage.
   For
   example, TableMetadata is reduced to uuid, location, format-version,
   current-snapshot-id, schema, and last-updated-ms (~1KB vs potentially
   hundreds of MB).

Both the filter and pruner are operator-configurable and replaceable via
CDI.

*The Architectural Question: Global Sanitization Pipeline*

During review, Alex Dutra raised an important point: the sanitization
logic (filtering sensitive attributes, pruning large payloads) is currently
applied only within the persistence listener. However, this concern
applies to all event consumers — persistence, CloudWatch, and any future
listeners.

The natural insertion point would be in
PolarisEventListeners.deliverEvent(), which is the single choke point
between the Vert.x event bus and all listener implementations. A global
EventSanitizer (composed of the
filter + pruner) could transform events before they reach any listener.

*Proposal for Global Sanitization*

The filter and pruner interfaces are already designed for this promotion:
- They live in the org.apache.polaris.service.events package (not within
listeners)
- They have no coupling to persistence concerns
- They operate on AttributeKey + Object pairs from EventAttributeMap

*A potential implementation path:*

// In PolarisEventListeners.deliverEvent():
private void deliverEvent(PolarisEvent event, String listenerName,
PolarisEventListener listener) {
PolarisEvent sanitizedEvent = eventSanitizer.sanitize(event);
listener.onEvent(sanitizedEvent);
}

Where EventSanitizer wraps the filter and pruner, producing a new
PolarisEvent with a sanitized EventAttributeMap. Individual listeners could
still apply additional listener-specific transformations if needed.

*Open Questions for the Community*

   1. Should sanitization be global (all listeners get pre-sanitized events)
   or should each listener opt in?
   2. If global, should the original unsanitized event still be accessible
   for listeners that need raw data (e.g., a compliance auditor that needs
   full TableMetadata)?
   3. Should the filter/pruner configuration be per-listener or shared? The
   current implementation allows per-listener configuration via the CDI bean
   model.

I’d appreciate input on the direction here. I think the PR is functional
as-is (sanitization works within persistence), but I want to align on the
long-term architecture before it becomes harder to refactor.

Thanks,
Rishi

Reply via email to