Hi All, I'd like to discuss the architecture for event persistence and a
related proposal for a global event sanitization layer, building on the
feedback from PR #4225 <https://github.com/apache/polaris/pull/4225>.

  *Current State of PR #4225 <https://github.com/apache/polaris/pull/4225>*

  The PR introduces a universal mechanism for persisting PolarisEvents to
the database. Rather than per-event-type handlers (which would require 150+
implementations), it uses:

  1. A single PolarisPersistenceEventListener that deterministically maps
any event to a persistence entity using the event's
PolarisEventType.Category for resource type classification and a fallback
chain for
  resource identifier resolution.
  2. An EventAttributeFilter (CDI bean) — an allowlist-based gate that
determines which AttributeKeys from the EventAttributeMap are safe to
persist.
  3. An EventPayloadPruner (CDI bean) — reduces large attribute values
(particularly TableMetadata) to bounded summaries suitable for storage. For
example, TableMetadata is reduced to uuid, location, format-version,
   current-snapshot-id, schema, and last-updated-ms (~1KB vs potentially
hundreds of MB).

  Both the filter and pruner are operator-configurable and replaceable via
CDI.

  *The Architectural Question: Global Sanitization Pipeline*

  During review, Alex Dutra raised an important point: the sanitization
logic (filtering sensitive attributes, pruning large payloads) is currently
applied only within the persistence listener. However, this concern
  applies to all event consumers — persistence, CloudWatch, and any future
listeners.

  The natural insertion point would be in
PolarisEventListeners.deliverEvent(), which is the single choke point
between the Vert.x event bus and all listener implementations. A global
EventSanitizer (composed of the
   filter + pruner) could transform events before they reach any listener.

  *Proposal for Global Sanitization*

  The filter and pruner interfaces are already designed for this promotion:
  - They live in the org.apache.polaris.service.events package (not within
listeners)
  - They have no coupling to persistence concerns
  - They operate on AttributeKey + Object pairs from EventAttributeMap

  *A potential implementation path:*

  // In PolarisEventListeners.deliverEvent():
  private void deliverEvent(PolarisEvent event, String listenerName,
PolarisEventListener listener) {
      PolarisEvent sanitizedEvent = eventSanitizer.sanitize(event);
      listener.onEvent(sanitizedEvent);
  }

  Where EventSanitizer wraps the filter and pruner, producing a new
PolarisEvent with a sanitized EventAttributeMap. Individual listeners could
still apply additional listener-specific transformations if needed.

  *Open Questions for the Community*

  1. Should sanitization be global (all listeners get pre-sanitized events)
or should each listener opt in?
  2. If global, should the original unsanitized event still be accessible
for listeners that need raw data (e.g., a compliance auditor that needs
full TableMetadata)?
  3. Should the filter/pruner configuration be per-listener or shared? The
current implementation allows per-listener configuration via the CDI bean
model.

  I'd appreciate input on the direction here. I think the PR is functional
as-is (sanitization works within persistence), but I want to align on the
long-term architecture before it becomes harder to refactor.

  Thanks,
  Rishi

Reply via email to