Hi All,

I have 100s of HUDI tables (AWS S3) where each of those are populated via Spark 
structured streaming from kafka streams. Now I have to delete records for a 
given user (userId) from all the tables which has data for that user. Meaning 
all tables where we have reference to that specific userId. I cannot republish 
all the events/records for that user to kafka to perform delete, since its 
around 10-15 year’s worth of data for each user and is going to be so costly 
and time consuming. So I am wondering how everybody is performing GDPR on the 
their HUDI tables?


How I get delete request?
On a delete kafka topic we get a delete event [which just contains the userId 
of the user  to delete], so we have to use that as filter condition and read 
all the records from HUDI tables and write it back with data source operation 
as ‘delete’. But while performing/running this delete spark job on the table if 
the streaming job continues to ingest new arriving data- what will be the side 
effect? Will it work, since seems like multi writers are not currently 
supported.

Could you help me with a solution?

Regards,
Felix K Jose

________________________________
The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

Reply via email to