andaag opened a new issue #2135: URL: https://github.com/apache/hudi/issues/2135
**Describe the problem you faced** I'm trying to come up with a consistent and understandable way to deal with gdpr deletes. What I'd like to do: 1. Stream realtime data into bucket A 2. Collect GDPR delete signals into B (or delete one and one record in A?) 3. Every 24h delete from A where userid in B 4. And after that ack back to the gdpr controller, saying records 1,2,3 are 100% deleted. My problem is step 4. I'd like to ack only the records that are deleted on disk, aka after compaction. I understand that I need to set 'hoodie.cleaner.commits.retained': 1 and 'hoodie.cleaner.fileversions.retained': 1 to try to reduce the time window for point in time queries, and I also saw some code references to manually running compaction, but this is complex. I really like delta.io's "vacuum 7 days" command, as it makes it very understandable and explicit that I can ack any records older than 7 days. How can I know which records have been compacted, and which haven't? Can I pull out the last compaction time from the table in an easy way? (and are these time ordered, so I can know that older records than X are compacted?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
