Hi All, I have 100s of HUDI tables (AWS S3) where each of those are populated via Spark structured streaming from kafka streams. Now I have to delete records for a given user (userId) from all the tables which has data for that user. Meaning all tables where we have reference to that specific userId. I cannot republish all the events/records for that user to kafka to perform delete, since its around 10-15 year’s worth of data for each user and is going to be so costly and time consuming. So I am wondering how everybody is performing GDPR on the their HUDI tables?
How I get delete request? On a delete kafka topic we get a delete event [which just contains the userId of the user to delete], so we have to use that as filter condition and read all the records from HUDI tables and write it back with data source operation as ‘delete’. But while performing/running this delete spark job on the table if the streaming job continues to ingest new arriving data- what will be the side effect? Will it work, since seems like multi writers are not currently supported. Could you help me with a solution? Regards, Felix K Jose ________________________________ The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.