Re: Recommendation to load HUDI data across partitions

2020-08-14 Thread tanu dua
Thanks Vinoth for detailed explanation and I was about to reply you that it
worked and followed most of the steps that you mentioned below.
Used forEachBatch() of stream to process the batch data from kafka and then
finding out the partitions using aggregate functions on Kafka Dataset and
then feed those partitions using Glob Pattern to Hudi to get hudiDs
Then performed join on both Ds , I had some complex logic to deduce from
both kafkaDs and hudiDs and hence using flatMap but I am now able to remove
flatMap and could use Dataset joins.

Thanks again for all your help as always !!




On Thu, Aug 13, 2020 at 1:42 PM Vinoth Chandar  wrote:

> Hi Tanuj,
>
> From this example, it appears as if you are trying to use sparkSession from
> within the executor? This will be problematic. Can you please open a
> support ticket with the full stack trace?
>
> I think what you are describing is a join between Kafka and Hudi tables. So
> I'd read from Kafka first, cache the 2K messages in memory, find out what
> partitions they belong to, and only load those affected partitions instead
> of the entire table.
> At this point, you will have two datasets : kafkaDF and hudiDF (or RDD or
> DataSet.. my suggestion remains valid)
> And instead of hand crafting the join at the record level, like you have.
> you can just use RDD/DataSet level join operations and then get a resultDF
>
> then you do a resultDF.write.format("hudi") and you are done?
>
> On Tue, Aug 11, 2020 at 2:33 AM Tanuj  wrote:
>
> > Hi,
> > I have a problem statement where I am consuming messages from Kafka and
> > then depending upon that Kafka message (2K records) I need to query Hudi
> > table and create a dataset (with both updates and inserts) and push them
> > back to Hudi table.
> >
> > I tried following but it threw NP exception from sparkSession scala code
> > and rightly so as sparkSession was used in Executor.
> >
> >  Dataset hudiDs = companyStatusDf.flatMap(new
> > FlatMapFunction() {
> > @Override
> > public Iterator call(KafkaRecord kafkaRecord)
> > throws Exception {
> > String prop1= kafkaRecord.getProp1();
> > String prop2= kafkaRecord.getProp2();
> > HudiRecord hudiRecord =  sparkSession.read()
> > .format(HUDI_DATASOURCE)
> > .schema()
> > .load()
> > .as(Encoders.bean((HudiRecord.class)))
> > .filter( say prop1);
> > hudiRecord = tranform();
> > // Modificiation in hudi record
> > return Arrays.asList(kafkaRecord, hudiRecord).iterator();
> > }
> >
> > }
> > }, Encoders.bean(CompanyStatusGoldenRecord.class));
> >
> > In HUDI, I have 2 level of partitions (year and month) so for eg if I get
> > 2K records from Kafka which will be spanned across multiple partitions -
> > what is advisable load first the full table like "/*/*/*" or first read
> > kafka record, find out which partitions need to be hit and then load only
> > those HUDI tables as per partitions .I believe 2nd option would be faster
> > i.e. loading the specific partitions and thats what I was trying in above
> > snippet of code. So if have to leverage partitions, is collect() on Kafka
> > Dataset to get the list of partitions  and then supply to HUDI is the
> only
> > option or I can do it just with the spark datasets ?
> >
>


Re: [DISCUSS] Release 0.6.0 timelines

2020-08-14 Thread Vinoth Chandar
Thanks Sudha! This is means master is now open for regular PRs. Thanks for
your patience, everyone.

On Fri, Aug 14, 2020 at 3:51 PM Bhavani Sudha 
wrote:

> Hello all,
>
> We have cut the release branch -
> https://github.com/apache/hudi/tree/release-0.6.0 . Since it is already
> Friday, we will be sending the release candidate early next week (after
> some testing).
>
> Happy Friday!
>
> Thanks,
> Sudha
>
> On Wed, Aug 12, 2020 at 3:56 PM vbal...@apache.org 
> wrote:
>
> >
> > Hi Folks,
> > We are continuing to work on CI stabilization and will cut the release
> > once we stabilize the builds hopefully tonight/tomorrow.
> > Thanks,Balaji.V
> > On Tuesday, August 11, 2020, 09:15:05 PM PDT, Vinoth Chandar <
> > vin...@apache.org> wrote:
> >
> >  Hello all,
> >
> > Update on this. We have landed most of the blockers for the 0.6.0 release
> > and I am currently working on the last major blocker, HUDI-1013.
> > We are working through some unexpected CI flakiness. We hope to stabilize
> > master, cut the RC, and then open up master for regular PR merges.
> > ETA for this is tomorrow night PST (Aug 12, PST).
> >
> > We will keep this thread posted!
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Aug 4, 2020 at 9:47 PM Vinoth Chandar  wrote:
> >
> > > Small correction:
> > >
> > > >> Vinoth working on code review, tests for PR 1876,
> > > This is landed!
> > >
> > >
> > > On Tue, Aug 4, 2020 at 9:44 PM Bhavani Sudha 
> > > wrote:
> > >
> > >> Hello all,
> > >>
> > >> We are targeting the end of this week to cut RC. Here is an update of
> > >> where
> > >> we are at release blockers.
> > >>
> > >> 0.6.0 Release blocker status (board
> > >> <
> > >>
> >
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=397&projectKey=HUDI&view=detail&selectedIssue=HUDI-69
> > >> >)
> > >> ,
> > >>
> > >>- Spark Datasource/MOR https://github.com/apache/hudi/pull/1848
> > needs
> > >> to
> > >>be tested by gary/balaji (About to land)
> > >>- Hive Sync restructuring (Review done, about to land)
> > >>- Bootstrap
> > >>  - Vinoth working on code review, tests for PR 1876,
> > >>  - then udit will rework PR 1702 (In Code review)
> > >>  - then we will review, land PR 1870, 1869
> > >>- Bulk insert V2 PR 1834, lower risk, independent PR, well tested
> > >> already
> > >>  - Dependent PR 1149 to be landed,
> > >>  - and modes to be respected in V2 impl as well (At risk)
> > >>- Upgrade Downgrade Hooks, PR 1858 : (In Code review)
> > >>- HUDI-1054- Marker list perf improvement, Udit has a PR out
> > >>- HUDI-115 : Overwrite with... ordering issue, Sudha has a PR
> nearing
> > >>landing
> > >>- HUDI-1098 : Marker file issue with non-existent files. (In Code
> > >> review)
> > >>- Spark Streaming + Async Compaction , test complete, code review
> > >>comments and land PR 1752 (About to land)
> > >>- Spark DataSource/Hive MOR Incremental Query HUDI-920 (At risk)
> > >>- Flink/Multi Engine refactor, will need a large rebase and rework,
> > >>review, land (At risk for 0.6.0)
> > >>- BloomIndex V2 - Global index implementation. (At risk)
> > >>- HUDI-845 : Parallel writing i.e allow multiple writers (Pushed
> out
> > of
> > >>0.6.0)
> > >>- HUDI-860 : Small File Handling without memory caching (Pushed out
> > of
> > >>0.6.0)
> > >>
> > >>
> > >> Thanks,
> > >> Sudha
> > >>
> > >> On Mon, Aug 3, 2020 at 3:41 PM Vinoth Chandar 
> > wrote:
> > >>
> > >> > +1 (we need to formalize this well)
> > >> > But having just blockers land first, would help not just with
> > rebasing,
> > >> but
> > >> > also wind down towards cutting an RC by end of week.
> > >> >
> > >> >
> > >> > On Mon, Aug 3, 2020 at 2:53 PM Bhavani Sudha <
> bhavanisud...@gmail.com
> > >
> > >> > wrote:
> > >> >
> > >> > > Hello all,
> > >> > >
> > >> > > As we are all hustling towards getting the blockers in, I wanted
> to
> > >> > propose
> > >> > > a code/merge freeze until we cut a release for 0.6.0  and restrict
> > it
> > >> to
> > >> > > only merging blockers identified for this release. It would reduce
> > >> > rebasing
> > >> > > time for blockers in progress. If we feel some issue is a serious
> > >> blocker
> > >> > > we can discuss it here and bump it's priority.
> > >> > >
> > >> > > Please share your thoughts or concerns.
> > >> > >
> > >> > > Thanks,
> > >> > > Sudha
> > >> > >
> > >> > >
> > >> > > On Mon, Aug 3, 2020 at 8:19 AM Vinoth Chandar 
> > >> wrote:
> > >> > >
> > >> > > > Given enough time has passed, Sudha can be our RM for 0.6.0.
> > >> > > >
> > >> > > > On the release blocker progress, we landed few blockers over the
> > >> > weekend,
> > >> > > > with some almost ready for landing
> > >> > > >
> > >> > > > Will send out a status update again tomorrow night PST!
> > >> > > >
> > >> > > > On Mon, Aug 3, 2020 at 8:17 AM Vinoth Chandar <
> vin...@apache.org>
> > >> > wrote:
> > >> > > >
> > >> > > > > Hi anton.
> > >> > > > >
> > >> > > > > We were

Re: [DISCUSS] Release 0.6.0 timelines

2020-08-14 Thread Bhavani Sudha
Hello all,

We have cut the release branch -
https://github.com/apache/hudi/tree/release-0.6.0 . Since it is already
Friday, we will be sending the release candidate early next week (after
some testing).

Happy Friday!

Thanks,
Sudha

On Wed, Aug 12, 2020 at 3:56 PM vbal...@apache.org 
wrote:

>
> Hi Folks,
> We are continuing to work on CI stabilization and will cut the release
> once we stabilize the builds hopefully tonight/tomorrow.
> Thanks,Balaji.V
> On Tuesday, August 11, 2020, 09:15:05 PM PDT, Vinoth Chandar <
> vin...@apache.org> wrote:
>
>  Hello all,
>
> Update on this. We have landed most of the blockers for the 0.6.0 release
> and I am currently working on the last major blocker, HUDI-1013.
> We are working through some unexpected CI flakiness. We hope to stabilize
> master, cut the RC, and then open up master for regular PR merges.
> ETA for this is tomorrow night PST (Aug 12, PST).
>
> We will keep this thread posted!
>
> Thanks
> Vinoth
>
> On Tue, Aug 4, 2020 at 9:47 PM Vinoth Chandar  wrote:
>
> > Small correction:
> >
> > >> Vinoth working on code review, tests for PR 1876,
> > This is landed!
> >
> >
> > On Tue, Aug 4, 2020 at 9:44 PM Bhavani Sudha 
> > wrote:
> >
> >> Hello all,
> >>
> >> We are targeting the end of this week to cut RC. Here is an update of
> >> where
> >> we are at release blockers.
> >>
> >> 0.6.0 Release blocker status (board
> >> <
> >>
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=397&projectKey=HUDI&view=detail&selectedIssue=HUDI-69
> >> >)
> >> ,
> >>
> >>- Spark Datasource/MOR https://github.com/apache/hudi/pull/1848
> needs
> >> to
> >>be tested by gary/balaji (About to land)
> >>- Hive Sync restructuring (Review done, about to land)
> >>- Bootstrap
> >>  - Vinoth working on code review, tests for PR 1876,
> >>  - then udit will rework PR 1702 (In Code review)
> >>  - then we will review, land PR 1870, 1869
> >>- Bulk insert V2 PR 1834, lower risk, independent PR, well tested
> >> already
> >>  - Dependent PR 1149 to be landed,
> >>  - and modes to be respected in V2 impl as well (At risk)
> >>- Upgrade Downgrade Hooks, PR 1858 : (In Code review)
> >>- HUDI-1054- Marker list perf improvement, Udit has a PR out
> >>- HUDI-115 : Overwrite with... ordering issue, Sudha has a PR nearing
> >>landing
> >>- HUDI-1098 : Marker file issue with non-existent files. (In Code
> >> review)
> >>- Spark Streaming + Async Compaction , test complete, code review
> >>comments and land PR 1752 (About to land)
> >>- Spark DataSource/Hive MOR Incremental Query HUDI-920 (At risk)
> >>- Flink/Multi Engine refactor, will need a large rebase and rework,
> >>review, land (At risk for 0.6.0)
> >>- BloomIndex V2 - Global index implementation. (At risk)
> >>- HUDI-845 : Parallel writing i.e allow multiple writers (Pushed out
> of
> >>0.6.0)
> >>- HUDI-860 : Small File Handling without memory caching (Pushed out
> of
> >>0.6.0)
> >>
> >>
> >> Thanks,
> >> Sudha
> >>
> >> On Mon, Aug 3, 2020 at 3:41 PM Vinoth Chandar 
> wrote:
> >>
> >> > +1 (we need to formalize this well)
> >> > But having just blockers land first, would help not just with
> rebasing,
> >> but
> >> > also wind down towards cutting an RC by end of week.
> >> >
> >> >
> >> > On Mon, Aug 3, 2020 at 2:53 PM Bhavani Sudha  >
> >> > wrote:
> >> >
> >> > > Hello all,
> >> > >
> >> > > As we are all hustling towards getting the blockers in, I wanted to
> >> > propose
> >> > > a code/merge freeze until we cut a release for 0.6.0  and restrict
> it
> >> to
> >> > > only merging blockers identified for this release. It would reduce
> >> > rebasing
> >> > > time for blockers in progress. If we feel some issue is a serious
> >> blocker
> >> > > we can discuss it here and bump it's priority.
> >> > >
> >> > > Please share your thoughts or concerns.
> >> > >
> >> > > Thanks,
> >> > > Sudha
> >> > >
> >> > >
> >> > > On Mon, Aug 3, 2020 at 8:19 AM Vinoth Chandar 
> >> wrote:
> >> > >
> >> > > > Given enough time has passed, Sudha can be our RM for 0.6.0.
> >> > > >
> >> > > > On the release blocker progress, we landed few blockers over the
> >> > weekend,
> >> > > > with some almost ready for landing
> >> > > >
> >> > > > Will send out a status update again tomorrow night PST!
> >> > > >
> >> > > > On Mon, Aug 3, 2020 at 8:17 AM Vinoth Chandar 
> >> > wrote:
> >> > > >
> >> > > > > Hi anton.
> >> > > > >
> >> > > > > We were hoping to cut a release by last weekend. New target is
> >> this
> >> > > > > weekend!
> >> > > > > (tbh we were thrown off a bit due to COVID in Q2, given a lot of
> >> > > > > PMC/Committers had additional kid care duties. Now we are back
> to
> >> > > normal
> >> > > > > cadence)
> >> > > > >
> >> > > > > Going forward, I plan to start a discussion around planning,
> >> > > prioritizing
> >> > > > > and other release processes after 0.6.0. Would be great to have
> >> the
> >> > > > > community

Re: Incremental query on partition column

2020-08-14 Thread David Rosalia
Hello,

I am Siva's colleague and I am working on the problem below as well.

I would like to describe what we are trying to achieve with Hudi as well as our 
current way of working and our GDPR and "Right To Be Forgotten " compliance 
policies.

Our requirements :
- We wish to apply a strict interpretation of the RTBF.  In other words, when 
we remove a person's data, it should be throughout the historical data and not 
just the latest snapshot.
- We wish to use Hudi to reduce our storage requirements using upserts and 
don't want to have duplicates between commits.
- We wish to retain history for persons who have not requested to be forgotten 
and therefore we do not want to delete commit files from the history as some 
have proposed.

We have tried a couple of solutions, but so far without success :
- replay the data omitting the data of the persons who have requested to be 
forgotten.  We wanted to manipulate the commit times to rebuild the history.
We found that we couldn't manipulate the commit times and retain the history.

- replay the data omitting the data of the persons who have requested to be 
forgotten, but writing to a date-based partition folder using the 
"partitionpath" parameter.
We found that commits using upserts between the partitionpath folders, do not 
ignore data that is unchanged between 2 commit dates as when using the default 
commit file system, so we will not save on our storage or speed up our  
processing using this technique.

So basically we would like to find a way to apply a strict RTBF, GDPR, maintain 
history and time-travel (large history) and save storage space using Hudi.

Can anyone see a way to achieve this?

Kind Regards,
David Rosalia


Get Outlook for Android


From: Vinoth Chandar 
Sent: Friday, August 14, 2020 8:26:22 AM
To: dev@hudi.apache.org 
Subject: Re: Incremental query on partition column

Hi,

On re-ingesting, do you mean to say you want to overwrite the table, while
not getting the changes in the incremental query?  This has not come up
before.
As you can imagine, it'd tricky scenario, where we need some special
handling/action type introduced.

yes, yes on the next two questions.
Commit. time can be controlled if using the HoodieWriteClient API, not on
datasource/deltastreamer atm

On Thu, Aug 13, 2020 at 12:13 AM Sivaprakash 
wrote:

> Hi,
>
>
> What is the design that can be used/implemented when we re-ingest the data
> without affecting incremental query?
>
>
>
>- Is it possible to maintain a delta dataset across partitions (
>hoodie.datasource.write.partitionpath.field) ? In my case it is a date.
>- Can I do a snapshot query on across and specific partitions?
>- Or, possible to control Hudi's commit time?
>
>
> Thanks
>