andaag opened a new issue #2135:
URL: https://github.com/apache/hudi/issues/2135


   **Describe the problem you faced**
   
   I'm trying to come up with a consistent and understandable way to deal with 
gdpr deletes.
   
   What I'd like to do:
   1. Stream realtime data into bucket A
   2. Collect GDPR delete signals into B (or delete one and one record in A?)
   3. Every 24h delete from A where userid in B
   4. And after that ack back to the gdpr controller, saying records 1,2,3 are 
100% deleted.
   
   My problem is step 4. I'd like to ack only the records that are deleted on 
disk, aka after compaction. I understand that I need to set   
'hoodie.cleaner.commits.retained': 1 and  
'hoodie.cleaner.fileversions.retained': 1 to try to reduce the time window for 
point in time queries, and I also saw some code references to manually running 
compaction, but this is complex.
   
   I really like delta.io's "vacuum 7 days" command, as it makes it very 
understandable and explicit that I can ack any records older than 7 days. How 
can I know which records have been compacted, and which haven't? Can I pull out 
the last compaction time from the table in an easy way? (and are these time 
ordered, so I can know that older records than X are compacted?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to