Felix,

Happy to help you through trying and rolling out multi-writer on Hudi
tables. Do you have a test environment where you can try out the feature by
following the doc that Vinoth pointed above ?

Thanks,
Nishith

On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <vin...@apache.org> wrote:

> Hi Felix,
>
> Most people I think are publishing this data into Kafka,and apply the
> deletes as a part of the streaming job itself. The reason why this works is
> because typically, only a small fraction of users leave the service (say <<
> 0.1% weekly is what I have heard). So, the cost of storage on Kafka is not
> much. Is that not the case for you? Are you looking for one time scrubbing
> of data for e.g? The benefit of this approach is that you eliminate any
> concurrency issues that arise from streaming job producing data for a user,
> while the deletes are also issued for that user.
>
> On concurrency control, Hudi now supports multiple writers, if you want to
> write a background job that will perform these deletes for you. it's in
> 0.8.0, see https://hudi.apache.org/docs/concurrency_control.html. One of
> us
> can help you out with trying this and rolling out. (Nishith is the feature
> author). Here, if the delete job touches same files, that the streaming job
> is writing to, then only one of them will succeed.
>
> We are working on a design for true lock free concurrency control, which
> provides the benefits of both models. But, won't be there for another month
> or two.
>
> Thanks
> Vinoth
>
>
> On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
> <felix.j...@philips.com.invalid> wrote:
>
> > Hi All,
> >
> > I have 100s of HUDI tables (AWS S3) where each of those are populated via
> > Spark structured streaming from kafka streams. Now I have to delete
> records
> > for a given user (userId) from all the tables which has data for that
> user.
> > Meaning all tables where we have reference to that specific userId. I
> > cannot republish all the events/records for that user to kafka to perform
> > delete, since its around 10-15 year’s worth of data for each user and is
> > going to be so costly and time consuming. So I am wondering how everybody
> > is performing GDPR on the their HUDI tables?
> >
> >
> > How I get delete request?
> > On a delete kafka topic we get a delete event [which just contains the
> > userId of the user  to delete], so we have to use that as filter
> condition
> > and read all the records from HUDI tables and write it back with data
> > source operation as ‘delete’. But while performing/running this delete
> > spark job on the table if the streaming job continues to ingest new
> > arriving data- what will be the side effect? Will it work, since seems
> like
> > multi writers are not currently supported.
> >
> > Could you help me with a solution?
> >
> > Regards,
> > Felix K Jose
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>

Reply via email to