adutra commented on PR #2962:
URL: https://github.com/apache/polaris/pull/2962#issuecomment-3497464019

   > However, I don't think you fully considered all angles of why the 
`PropertyMapEventListener` was introduced. There are many event fields that 
likely should not ever be pushed to an external system - for security reasons 
(such as table credentials that could be exposed in a `LoadTableResponse`) or 
for functional reasons (such as when a `TableMetadata` object is larger than 
1MB, the maximum size for a record in CloudWatch Logs). I'm sure there are many 
other reasons here that are not coming top of mind. `PropertyMapEventListener` 
is supposed to be the transformation layer where only relevant information can 
be parsed out of the event and then sent onwards.
   
   I have some reservations regarding a common transformation layer for all 
data sinks. While the idea is sound in principle, I'm skeptical about reaching 
a consensus on the specific transformations to apply and information to redact.
   
   We'll likely agree on redacting sensitive data. However, other 
transformations could lead to endless debate. For instance, CloudWatch rejects 
payloads over 1MB, but a Kafka cluster might accept up to 100MB. Similarly, if 
a use case doesn't require a specific field, another use case might. And 
finally, transformations should be vendor-agnostic, of course. 
   
   This leads me to believe that a "transformation layer," despite good 
intentions, will never be generic enough to serve all purposes. While we could 
introduce parameters to influence transformations, this would be a significant 
and time-consuming effort.
   
   Instead, I propose applying the least possible logic to events, ideally only 
security-related logic. Events should be passed as-is to the sinks, allowing 
each sink to perform further transformations and redactions on the event 
payload. I think we should absolutely avoid specific logic per event type, as 
this would imply having to special-case each one of the 150+ event types.
   
   JSON mixins should also be as generic as possible. If a sink cannot use them 
due to them not producing the desired JSON, it should use a different object 
mapper. Alternatively, a sink could use mixins and the default object mapper to 
produce not a String, but a JsonNode instance, apply further transformations to 
it, and then send it to the target remote system.
   
   Finally, I don't believe an intermediary Map representation offers much 
benefit.
   
   Anyways, I think the transformation layer topic deserves a discussion on the 
ML.
   
   > Maybe I could've done a better job to add a comment to the class stating 
this - but I can't approve this PR without all these considerations addressed. 
This is the exact reason why it has been taking me a while to complete the 
functionality through as well.
   > I am still quite interesting in seeing if MixIns can make things better - 
but I'm not familiar with what other options MixIns/Jackson provide to do 
automated parsing and/or redacting of the sensitive/larger/etc objects. I am 
hoping you can help direct this conversation. I would also recommend you to 
post these findings not only here on the PR but on the Dev Mailing list, as it 
may help gain more eyes who can help with this issue.
   
   I understand and share the security concerns, but we can't delay this PR 
indefinitely. Manually addressing each event for transformations is incredibly 
time-consuming, as you've noted.
   
   Asking the original contributor to implement logic for sensitive or larger 
objects falls outside the PR's original scope and will take too long (we have 
150+ events!). I propose merging this PR now and tackling the transformation 
layer as a follow-up.
   
   Let's not forget: the events feature is still in beta for a good reason.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to