Felix, Happy to help you through trying and rolling out multi-writer on Hudi tables. Do you have a test environment where you can try out the feature by following the doc that Vinoth pointed above ?
Thanks, Nishith On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <vin...@apache.org> wrote: > Hi Felix, > > Most people I think are publishing this data into Kafka,and apply the > deletes as a part of the streaming job itself. The reason why this works is > because typically, only a small fraction of users leave the service (say << > 0.1% weekly is what I have heard). So, the cost of storage on Kafka is not > much. Is that not the case for you? Are you looking for one time scrubbing > of data for e.g? The benefit of this approach is that you eliminate any > concurrency issues that arise from streaming job producing data for a user, > while the deletes are also issued for that user. > > On concurrency control, Hudi now supports multiple writers, if you want to > write a background job that will perform these deletes for you. it's in > 0.8.0, see https://hudi.apache.org/docs/concurrency_control.html. One of > us > can help you out with trying this and rolling out. (Nishith is the feature > author). Here, if the delete job touches same files, that the streaming job > is writing to, then only one of them will succeed. > > We are working on a design for true lock free concurrency control, which > provides the benefits of both models. But, won't be there for another month > or two. > > Thanks > Vinoth > > > On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix > <felix.j...@philips.com.invalid> wrote: > > > Hi All, > > > > I have 100s of HUDI tables (AWS S3) where each of those are populated via > > Spark structured streaming from kafka streams. Now I have to delete > records > > for a given user (userId) from all the tables which has data for that > user. > > Meaning all tables where we have reference to that specific userId. I > > cannot republish all the events/records for that user to kafka to perform > > delete, since its around 10-15 year’s worth of data for each user and is > > going to be so costly and time consuming. So I am wondering how everybody > > is performing GDPR on the their HUDI tables? > > > > > > How I get delete request? > > On a delete kafka topic we get a delete event [which just contains the > > userId of the user to delete], so we have to use that as filter > condition > > and read all the records from HUDI tables and write it back with data > > source operation as ‘delete’. But while performing/running this delete > > spark job on the table if the streaming job continues to ingest new > > arriving data- what will be the side effect? Will it work, since seems > like > > multi writers are not currently supported. > > > > Could you help me with a solution? > > > > Regards, > > Felix K Jose > > > > ________________________________ > > The information contained in this message may be confidential and legally > > protected under applicable law. The message is intended solely for the > > addressee(s). If you are not the intended recipient, you are hereby > > notified that any use, forwarding, dissemination, or reproduction of this > > message is strictly prohibited and may be unlawful. If you are not the > > intended recipient, please contact the sender by return e-mail and > destroy > > all copies of the original message. > > >