Thank Yufei for the feedback. I completely agree that a "secure-by-default"
pipeline is the safest way to handle cross-listener concerns.

I have just updated PR #4225 <https://github.com/apache/polaris/pull/4225>
to fully implement this architecture. Here is a quick summary of how it
aligns with your suggestions:

   -

   *Global Secure-by-Default Pipeline:* Event sanitization is now
   intercepted at the dispatcher level (PolarisEventListeners.deliverEvent()).
   All listeners automatically receive a cloned, pre-sanitized event by
   default—no opt-in required.
   -

   *Strict Denylist:* The pipeline enforces a shared baseline denylist
   (stripping sensitive objects like PRINCIPAL and CATALOG that contain
   credentials/secrets), while safely extracting derived attributes that
   listeners actually need (like CATALOG_NAME).
   -

   *Raw Access Escape Hatch:* I added a TODO at the dispatcher level to
   track building an explicit opt-in mechanism (like a marker interface) for
   specialized listeners that might need raw data in the future.
   -

   *Universal Routing:* With events pre-sanitized globally, the persistence
   listener now universally routes and persists any event without needing 150+
   switch-case handlers.

Thanks again for the guidance on this. Let me know if you or anyone else
has further thoughts upon reviewing the updated PR!

Best,
Rishi

On Wed, May 13, 2026 at 8:00 PM Yufei Gu <[email protected]> wrote:

> Hi Rishi,
> Thanks for writing this up.
>
> I lean toward a global baseline sanitizer. Filtering sensitive attributes
> and pruning very large payloads are cross-listener concerns, so making each
> listener remember to opt in seems easy to get wrong.
>
> Access to raw, unsanitized events should be explicit and reserved for a
> well-defined use case. It should not be the default behavior. Otherwise a
> future listener can accidentally bypass the same safety rules persistence
> is trying to enforce.
>
> For configuration, I would start with a shared baseline policy and only add
> per-listener overrides where we have a concrete need. That keeps the
> default behavior predictable while still leaving room for specialized
> consumers.
>
> I haven't looked at the PR yet, please correct me if I'm wrong.
>
> Yufei
>
> On Tue, 12 May 2026 21:19:05 -0500, Srinivas Rishindra
> [email protected] wrote:
>
> Hi All, I’d like to discuss the architecture for event persistence and a
> related proposal for a global event sanitization layer, building on the
> feedback from PR #4225 https://github.com/apache/polaris/pull/4225.
>
> *Current State of PR #4225 https://github.com/apache/polaris/pull/4225
> <https://github.com/apache/polaris/pull/4225>*
>
> The PR introduces a universal mechanism for persisting PolarisEvents to
> the database. Rather than per-event-type handlers (which would require 150+
> implementations), it uses:
>
>    1. A single PolarisPersistenceEventListener that deterministically maps
>    any event to a persistence entity using the event’s
>    PolarisEventType.Category for resource type classification and a
> fallback
>    chain for
>    resource identifier resolution.
>    2. An EventAttributeFilter (CDI bean) — an allowlist-based gate that
>    determines which AttributeKeys from the EventAttributeMap are safe to
>    persist.
>    3. An EventPayloadPruner (CDI bean) — reduces large attribute values
>    (particularly TableMetadata) to bounded summaries suitable for storage.
>    For
>    example, TableMetadata is reduced to uuid, location, format-version,
>    current-snapshot-id, schema, and last-updated-ms (~1KB vs potentially
>    hundreds of MB).
>
> Both the filter and pruner are operator-configurable and replaceable via
> CDI.
>
> *The Architectural Question: Global Sanitization Pipeline*
>
> During review, Alex Dutra raised an important point: the sanitization
> logic (filtering sensitive attributes, pruning large payloads) is currently
> applied only within the persistence listener. However, this concern
> applies to all event consumers — persistence, CloudWatch, and any future
> listeners.
>
> The natural insertion point would be in
> PolarisEventListeners.deliverEvent(), which is the single choke point
> between the Vert.x event bus and all listener implementations. A global
> EventSanitizer (composed of the
> filter + pruner) could transform events before they reach any listener.
>
> *Proposal for Global Sanitization*
>
> The filter and pruner interfaces are already designed for this promotion:
> - They live in the org.apache.polaris.service.events package (not within
> listeners)
> - They have no coupling to persistence concerns
> - They operate on AttributeKey + Object pairs from EventAttributeMap
>
> *A potential implementation path:*
>
> // In PolarisEventListeners.deliverEvent():
> private void deliverEvent(PolarisEvent event, String listenerName,
> PolarisEventListener listener) {
> PolarisEvent sanitizedEvent = eventSanitizer.sanitize(event);
> listener.onEvent(sanitizedEvent);
> }
>
> Where EventSanitizer wraps the filter and pruner, producing a new
> PolarisEvent with a sanitized EventAttributeMap. Individual listeners could
> still apply additional listener-specific transformations if needed.
>
> *Open Questions for the Community*
>
>    1. Should sanitization be global (all listeners get pre-sanitized
> events)
>    or should each listener opt in?
>    2. If global, should the original unsanitized event still be accessible
>    for listeners that need raw data (e.g., a compliance auditor that needs
>    full TableMetadata)?
>    3. Should the filter/pruner configuration be per-listener or shared? The
>    current implementation allows per-listener configuration via the CDI
> bean
>    model.
>
> I’d appreciate input on the direction here. I think the PR is functional
> as-is (sanitization works within persistence), but I want to align on the
> long-term architecture before it becomes harder to refactor.
>
> Thanks,
> Rishi
>

Reply via email to