Re: Incremental query on partition column

Balaji Varadarajan Fri, 21 Aug 2020 13:03:01 -0700

 Thanks for the detailed email David. We had discussed this in last week 
community meeting and Vinoth had ideas on how to implement this. This is 
something that can be supported by the timeline layout that Hudi has. It would 
be a new feature (new write operation) that basically appends the delete marker 
to all versions of the data instead of just the latest. 
Opened a Jira : https://issues.apache.org/jira/browse/HUDI-1212
Balaji.V

    On Friday, August 14, 2020, 06:12:26 AM PDT, David Rosalia 
<davidrosa...@hotmail.com> wrote:  

 Hello,

I am Siva's colleague and I am working on the problem below as well.

I would like to describe what we are trying to achieve with Hudi as well as our 
current way of working and our GDPR and "Right To Be Forgotten " compliance 
policies.

Our requirements :
- We wish to apply a strict interpretation of the RTBF.  In other words, when 
we remove a person's data, it should be throughout the historical data and not 
just the latest snapshot.
- We wish to use Hudi to reduce our storage requirements using upserts and 
don't want to have duplicates between commits.
- We wish to retain history for persons who have not requested to be forgotten 
and therefore we do not want to delete commit files from the history as some 
have proposed.

We have tried a couple of solutions, but so far without success :
- replay the data omitting the data of the persons who have requested to be 
forgotten.  We wanted to manipulate the commit times to rebuild the history.
We found that we couldn't manipulate the commit times and retain the history.

- replay the data omitting the data of the persons who have requested to be 
forgotten, but writing to a date-based partition folder using the 
"partitionpath" parameter.
We found that commits using upserts between the partitionpath folders, do not 
ignore data that is unchanged between 2 commit dates as when using the default 
commit file system, so we will not save on our storage or speed up our  
processing using this technique.

So basically we would like to find a way to apply a strict RTBF, GDPR, maintain 
history and time-travel (large history) and save storage space using Hudi.

Can anyone see a way to achieve this?

Kind Regards,
David Rosalia

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Vinoth Chandar <vin...@apache.org>
Sent: Friday, August 14, 2020 8:26:22 AM
To: dev@hudi.apache.org <dev@hudi.apache.org>
Subject: Re: Incremental query on partition column

Hi,

On re-ingesting, do you mean to say you want to overwrite the table, while
not getting the changes in the incremental query?  This has not come up
before.
As you can imagine, it'd tricky scenario, where we need some special
handling/action type introduced.

yes, yes on the next two questions.
Commit. time can be controlled if using the HoodieWriteClient API, not on
datasource/deltastreamer atm

On Thu, Aug 13, 2020 at 12:13 AM Sivaprakash <sivaprakashshanmu...@gmail.com>
wrote:

> Hi,
>
>
> What is the design that can be used/implemented when we re-ingest the data
> without affecting incremental query?
>
>
>
>    - Is it possible to maintain a delta dataset across partitions (
>    hoodie.datasource.write.partitionpath.field) ? In my case it is a date.
>    - Can I do a snapshot query on across and specific partitions?
>    - Or, possible to control Hudi's commit time?
>
>
> Thanks
>

Re: Incremental query on partition column

Reply via email to