adutra commented on PR #2962: URL: https://github.com/apache/polaris/pull/2962#issuecomment-3497464019
> However, I don't think you fully considered all angles of why the `PropertyMapEventListener` was introduced. There are many event fields that likely should not ever be pushed to an external system - for security reasons (such as table credentials that could be exposed in a `LoadTableResponse`) or for functional reasons (such as when a `TableMetadata` object is larger than 1MB, the maximum size for a record in CloudWatch Logs). I'm sure there are many other reasons here that are not coming top of mind. `PropertyMapEventListener` is supposed to be the transformation layer where only relevant information can be parsed out of the event and then sent onwards. I have some reservations regarding a common transformation layer for all data sinks. While the idea is sound in principle, I'm skeptical about reaching a consensus on the specific transformations to apply and information to redact. We'll likely agree on redacting sensitive data. However, other transformations could lead to endless debate. For instance, CloudWatch rejects payloads over 1MB, but a Kafka cluster might accept up to 100MB. Similarly, if a use case doesn't require a specific field, another use case might. And finally, transformations should be vendor-agnostic, of course. This leads me to believe that a "transformation layer," despite good intentions, will never be generic enough to serve all purposes. While we could introduce parameters to influence transformations, this would be a significant and time-consuming effort. Instead, I propose applying the least possible logic to events, ideally only security-related logic. Events should be passed as-is to the sinks, allowing each sink to perform further transformations and redactions on the event payload. I think we should absolutely avoid specific logic per event type, as this would imply having to special-case each one of the 150+ event types. JSON mixins should also be as generic as possible. If a sink cannot use them due to them not producing the desired JSON, it should use a different object mapper. Alternatively, a sink could use mixins and the default object mapper to produce not a String, but a JsonNode instance, apply further transformations to it, and then send it to the target remote system. Finally, I don't believe an intermediary Map representation offers much benefit. Anyways, I think the transformation layer topic deserves a discussion on the ML. > Maybe I could've done a better job to add a comment to the class stating this - but I can't approve this PR without all these considerations addressed. This is the exact reason why it has been taking me a while to complete the functionality through as well. > I am still quite interesting in seeing if MixIns can make things better - but I'm not familiar with what other options MixIns/Jackson provide to do automated parsing and/or redacting of the sensitive/larger/etc objects. I am hoping you can help direct this conversation. I would also recommend you to post these findings not only here on the PR but on the Dev Mailing list, as it may help gain more eyes who can help with this issue. I understand and share the security concerns, but we can't delay this PR indefinitely. Manually addressing each event for transformations is incredibly time-consuming, as you've noted. Asking the original contributor to implement logic for sensitive or larger objects falls outside the PR's original scope and will take too long (we have 150+ events!). I propose merging this PR now and tackling the transformation layer as a follow-up. Let's not forget: the events feature is still in beta for a good reason. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
