Re: [VOTE] Move content off cWiki

2021-07-19 Thread vbal...@apache.org
 
+1 - Approve the move


On Monday, July 19, 2021, 04:04:39 PM PDT, Prashant Wason 
 wrote:  
 
 +1 - Approve the move

On Mon, Jul 19, 2021 at 3:44 PM Vinoth Chandar  wrote:

> Hi all,
>
> Starting a vote based on the DISCUSS thread here [1], to consolidate
> content from cWiki into Github wiki and project's master branch (for design
> docs)
>
> Please chime with a
>
> +1 - Approve the move
> -1  - Disapprove the move (please state your reasoning)
>
> The vote will use lazy consensus, needing three +1s to pass, remaining open
> for 72 hours.
>
> Thanks
> Vinoth
>
> [1]
>
> https://lists.apache.org/thread.html/rb0a96bc10788c9635cc1a35ade7d5d42997a5c9591a5ec5d5a99adf0%40%3Cdev.hudi.apache.org%3E
>
  

Re: Welcome our PMC Member, Raymond Xu

2021-07-18 Thread vbal...@apache.org
 Hearty Congratulations Raymond !! 

Balaji.V

On Sunday, July 18, 2021, 12:48:25 AM PDT, Dianjin Wang 
 wrote:  
 
 Congratulations!

Best,
Dianjin Wang


On Sat, Jul 17, 2021 at 8:28 AM Vinoth Chandar  wrote:

> Folks,
>
> I am incredibly happy to share the addition of Raymond Xu to the Hudi PMC.
> Raymond has been a valuable member of our community, over the past few
> years now. Always hustlin and taking on the most underappreciated, but
> extremely valuable aspects of the project, mostly recently with getting our
> tests working smoothly on Azure CI!
>
> Please join me in congratulating Raymond!
>
> Onwards,
> Vinoth
>
  

Re: Welcome New Committers: Pengzhiwei and DannyChan

2021-07-16 Thread vbal...@apache.org
 Many Congratulations to both of you !! Great contributions. Well deserved !!
Balaji.V
On Friday, July 16, 2021, 03:08:44 PM PDT, Bhavani Sudha 
 wrote:  
 
 Big congratulations to both of you. Very well deserved!

Cheers,
Sudha

On Fri, Jul 16, 2021 at 8:56 AM Sivabalan  wrote:

> Not to hijack the limelight from *Pengzhiwei *and* DannyChan.* btw, Big
> Kudos to the Chinese community at large. Great adoption and good going :)
> Really excited for the future of Hudi across the globe ! :) btw, fyi, I
> don't get to see the image you attached leesf.
>
>
> On Fri, Jul 16, 2021 at 11:29 AM Sivabalan  wrote:
>
> > Congrats guys! Well deserved.
> >
> > On Fri, Jul 16, 2021 at 9:12 AM Gary Li  wrote:
> >
> >> Congrats Zhiwei and Danny! It's awesome to work with you guys.
> >>
> >> Best,
> >> Gary
> >>
> >>
> >> On Fri, Jul 16, 2021 at 7:55 PM wangxianghu  wrote:
> >>
> >> > Congratulations!well deserved !
> >> >
> >> > > 在 2021年7月16日,18:52,vino yang  写道:
> >> > >
> >> > > Congratulation
> >> >
> >> >
> >>
> > --
> > Regards,
> > -Sivabalan
> >
>
>
> --
> Regards,
> -Sivabalan
>
  

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-02 Thread vbal...@apache.org
 +1 for both A and B. Makes sense to centralize bug tracking and RFCs in github.
Balaji.V 


On Friday, July 2, 2021, 06:44:06 PM PDT, Vinoth Chandar 
 wrote:  
 
 Raymond - +1 on your thoughts.

Once we have more voices and alignment, we can do one final RFC on cWiki
covering everything.

Can more people please chime in. Ideally we will put this to a VOTE

On Fri, Jul 2, 2021 at 12:54 PM Raymond Xu 
wrote:

> +1 for both A and B
>
> Also a related suggestion:
> we can put the release notes and new feature highlights in the release
> notes section in GitHub releases instead of separately writing them in the
> asf-site
>
>
> On Fri, Jul 2, 2021 at 11:25 AM Prashant Wason 
> wrote:
>
> > +1 for complete Github migration. JIRA is too cumbersome and painful to
> > use.
> >
> > Github PRs and wiki also improve visibility of the project and I think
> may
> > increase community feedback and participation as its simpler to use.
> >
> > Prashant
> >
> >
> > On Thu, Jul 1, 2021 at 8:41 PM Vinoth Chandar  wrote:
> >
> > > Hi all,
> > >
> > > When we incubated Hudi, we made some initial choices around
> collaboration
> > > tools of choice. I am wondering if there are still optimal, given the
> > scale
> > > of the community at this point.
> > >
> > > Specifically, two points.
> > >
> > > A) Our issue tracker is JIRA, while we just use Github Issues for
> support
> > > triage. While JIRA is pretty advanced and gives us the ability to track
> > > releases, versions and kanban boards, there are few practical
> operational
> > > problems.
> > >
> > > - Developers often open bug fixes/PR which all need to be continuously
> > > tagged against a release version (fix version)
> > > - Referencing JIRAs from Pull Requests is great (we cannot do things
> like
> > > `fixes #1234` to close issues when PR lands, not an easy way to click
> and
> > > get to the JIRA)
> > > - Many more developers have a github account, to contribute to Hudi
> > though,
> > > they need an additional sign-up on jira.
> > >
> > > So wondering if we should just use one thing - Github Issues, and build
> > > scripts/hubot or something to get the missing project management from
> > > boards.
> > >
> > > B) Our design docs are on cWiki. Even though we link it off the site,
> > from
> > > my experience, many do not discover them.
> > > For large PRs, we need to manually enforce that design and code are in
> > sync
> > > before we land. If we can, I would love to make RFC being in good
> shape a
> > > pre-requisite for landing the PR.
> > > Once again, separate signup is needed to write design docs or comment
> on
> > > them.
> > >
> > > So, wondering if we can move our process docs etc into Github Wiki and
> > RFCs
> > > to the master branch in a rfc folder, and we just use github PRs to
> raise
> > > RFCs and discuss them.
> > >
> > > This all also makes it easy for us to measure community activity and
> keep
> > > streamlining our processes.
> > >
> > > personally, these different channels are overwhelming to me at-least :)
> > >
> > > Love to hear thoughts. Please specify if you are for,against each of A
> > and
> > > B.
> > >
> > >
> > > Thanks
> > > Vinoth
> > >
> >
>
  

Re: Welcome new committers and PMC Members!

2021-05-11 Thread vbal...@apache.org
 Many Congratulations Gary Li and Wenning Ding. Well deserved !!
Balaji.V
On Tuesday, May 11, 2021, 01:06:47 PM PDT, Bhavani Sudha 
 wrote:  
 
 Congratulations @Gary Li and @Wenning Ding!
On Tue, May 11, 2021 at 12:42 PM Vinoth Chandar  wrote:

Hello all, 
Please join me in congratulating our newest set of committers and PMCs. 
Wenning Ding (Committer) Wenning has been a consistent contributor to Hudi, 
over the past year or so. He has added some critical bug fixes, lots of good 
contributions around Spark! 
Gary Li (PMC Member) Gary is a regular feature on all our support channels. He 
has contributed numerous features to Hudi, and evangelized across many 
companies including Bosch/Bytedance. Most of all, he is a solid team player and 
an asset to the project. 
Thanks so much for your continued contributions, to make Hudi better and 
better! 
ThanksVinoth

  

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread vbal...@apache.org
 ++1. The rewording makes total sense
Balaji.V
On Tuesday, April 13, 2021, 07:45:16 AM PDT, Gary Li  
wrote:  
 
 Awesome summary of Hudi! +1 as well. 

Gary Li
On 2021/04/13 14:13:24, Rubens Rodrigues  wrote: 
> Excellent, I agree
> 
> Em ter, 13 de abr de 2021 07:23, vino yang  escreveu:
> 
> > +1 Excited by this new vision!
> >
> > Best,
> > Vino
> >
> > Dianjin Wang  于2021年4月13日周二 下午3:53写道:
> >
> > > +1  The new brand is straightforward, a better description of Hudi.
> > >
> > > Best,
> > > Dianjin Wang
> > >
> > >
> > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha 
> > > wrote:
> > >
> > > > +1 . Cannot agree more. I think this makes total sense and will provide
> > > for
> > > > a much better representation of the project.
> > > >
> > > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > Reading one more article today, positioning Hudi, as just a table
> > > format,
> > > > > made me wonder, if we have done enough justice in explaining what we
> > > have
> > > > > built together here.
> > > > > I tend to think of Hudi as the data lake platform, which has the
> > > > following
> > > > > components, of which - one if a table format, one is a transactional
> > > > > storage layer.
> > > > > But the whole stack we have is definitely worth more than the sum of
> > > all
> > > > > the parts IMO (speaking from my own experience from the past 10+
> > years
> > > of
> > > > > open source software dev).
> > > > >
> > > > > Here's what we have built so far.
> > > > >
> > > > > a) *table format* : something that stores table schema, a metadata
> > > table
> > > > > that stores file listing today, and being extended to store column
> > > ranges
> > > > > and more in the future (RFC-27)
> > > > > b) *aux metadata* : bloom filters, external record level indexes
> > today,
> > > > > bitmaps/interval trees and other advanced on-disk data structures
> > > > tomorrow
> > > > > c) *concurrency control* : we always supported MVCC based log based
> > > > > concurrency (serialize writes into a time ordered log), and we now
> > also
> > > > > have OCC for batch merge workloads with 0.8.0. We will have
> > multi-table
> > > > and
> > > > > fully non-blocking writers soon (see future work section of RFC-22)
> > > > > d) *updates/deletes* : this is the bread-and-butter use-case for
> > Hudi,
> > > > but
> > > > > we support primary/unique key constraints and we could add foreign
> > keys
> > > > as
> > > > > an extension, once our transactions can span tables.
> > > > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > > > files,
> > > > > cleans, compacts, clusters data, bootstraps existing data - all these
> > > > > actions working off each other without blocking one another. (for
> > most
> > > > > parts).
> > > > > f) *data services*: we also have higher level functionality with
> > > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > > > > coming, ...and more), incremental ETL support, de-duplication, commit
> > > > > callbacks, pre-commit validations are coming, error tables have been
> > > > > proposed. I could also envision us building towards streaming egress,
> > > > data
> > > > > monitoring.
> > > > >
> > > > > I also think we should build the following (subject to separate
> > > > > DISCUSS/RFCs)
> > > > >
> > > > > g) *caching service*: Hudi specific caching service that can hold
> > > mutable
> > > > > data and serve oft-queried data across engines.
> > > > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> > > > turn
> > > > > it into a scalable, sharded metastore, that all engines can use to
> > > obtain
> > > > > any metadata.
> > > > >
> > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
> > opposed
> > > to
> > > > > "ingests & manages storage of large analytical datasets over DFS
> > (hdfs
> > > or
> > > > > cloud stores)." and convey the scope of our vision,
> > > > > given we have already been building towards that. It would also
> > provide
> > > > new
> > > > > contributors a good lens to look at the project from.
> > > > >
> > > > > (This is very similar to for e.g, the evolution of Kafka from a
> > pub-sub
> > > > > system, to an event streaming platform - with addition of
> > > > > MirrorMaker/Connect etc. )
> > > > >
> > > > > Please share your thoughts!
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > >
> > >
> >
> 
  

Re: Request to join Project Committer Group

2021-04-01 Thread vbal...@apache.org
 
Welcome Harshit !! I have given contributor acces to Apache Hudi jira.
Balaji.V On Wednesday, March 31, 2021, 03:02:45 AM PDT, Danny Chan 
 wrote:  
 
 cc @vinoth

Best,
Danny Chan

harshit mittal  于2021年3月31日周三 下午3:18写道:

> Hi,
> I'd like to be added to the project committer group. Could somebody help me
> with this request?(jiraId: hmittal83, cwiki userId: hmittal83).
> --
> Best,
> Harshit
>
  

Re: [VOTE] Release 0.8.0, release candidate #1

2021-03-31 Thread vbal...@apache.org
 +1 binding
Compilation Succeeded.Release Validation Succeeded
```balaji-varadarajan--C02CV6A6MD6R:scripts balaji.varadarajan$ 
./release/validate_staged_release.sh --release=0.8.0 --rc_num=1 
/tmp/validation_scratch_dir_001 ~/code/oss/upstream_hudi/scriptsDownloading 
from svn co https://dist.apache.org/repos/dist//dev/hudiValidating 
hudi-0.8.0-rc1 with release type "dev"Checking Checksum of Source Release 
Checksum Check of Source Release - [OK]
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current 
                                Dload  Upload   Total   Spent    Left  Speed100 
38466  100 38466    0     0  91585      0 --:--:-- --:--:-- --:--:-- 
91368Checking Signature Signature Check - [OK]
Checking for binary files in source release No Binary Files in Source Release? 
- [OK]
Checking for DISCLAIMER DISCLAIMER file exists ? [OK]
Checking for LICENSE and NOTICE License file exists ? [OK] Notice file exists ? 
[OK]
Performing custom Licensing Check  Licensing Check Passed [OK]
Running RAT Check RAT Check Passed [OK]
~/code/oss/upstream_hudi/scripts
```
On Wednesday, March 31, 2021, 01:13:55 PM PDT, nishith agarwal 
 wrote:  
 
 +1 binding

1. Compilation [OK]
2. Quick start (Spark 2.x, 3.x) [OK]
3. Signature [OK]

Thanks,
Nishith

On Wed, Mar 31, 2021 at 8:35 AM vino yang  wrote:

> +1 binding
>
> - ran `mvn clean package -DskipTests` [OK]
> - quick start (Spark 2.x, 3.x) [OK]
> - checked signature [OK]
>
> Best,
> Vino
>
>
> Sivabalan  于2021年3月31日周三 下午12:32写道:
>
> > +1 binding
> >
> > - Compilation Ok
> > - Quick start utils w/ spark3 Ok
> > - checksum Ok
> > - release validation script Ok
> > - Ran hudi test suite jobs. {COW, MOR} * {regular, metadata_enabled} 50
> > iterations w/ validating cleaning and archival. Ok
> >
> > ---
> > Checksum
> > shasum -a 512 hudi-0.8.0-rc1.src.tgz > sha512
> > diff sha512 hudi-0.8.0-rc1.src.tgz.sha512
> >
> > gpg --verify hudi-0.8.0-rc1.src.tgz.asc
> > gpg: assuming signed data in 'hudi-0.8.0-rc1.src.tgz'
> > gpg: Signature made Mon Mar 29 10:58:46 2021 EDT
> > gpg:                using RSA key
> E2A9714E0FBA3A087BDEE655E72873D765D6C406
> > gpg: Good signature from "YanJia Li " [unknown]
> > gpg: WARNING: This key is not certified with a trusted signature!
> > gpg:          There is no indication that the signature belongs to the
> > owner.
> > Primary key fingerprint: E2A9 714E 0FBA 3A08 7BDE  E655 E728 73D7 65D6
> C406
> >
> > Validation script:
> > ./release/validate_staged_release.sh --release=0.8.0 --rc_num=1
> > --release_type=dev
> > /tmp/validation_scratch_dir_001
> > ~/Documents/personal/projects/siva_hudi/temp_hudi/hudi-0.8.0-rc1/scripts
> > Downloading from svn co https://dist.apache.org/repos/dist//dev/hudi
> > Validating hudi-0.8.0-rc1 with release type "dev"
> > Checking Checksum of Source Release
> > Checksum Check of Source Release - [OK]
> >
> >  % Total    % Received % Xferd  Average Speed  Time    Time    Time
> >  Current
> >                                  Dload  Upload  Total  Spent    Left
> >  Speed
> > 100 38466  100 38466    0    0  171k      0 --:--:-- --:--:-- --:--:--
> >  171k
> > Checking Signature
> > Signature Check - [OK]
> >
> > Checking for binary files in source release
> > No Binary Files in Source Release? - [OK]
> >
> > Checking for DISCLAIMER
> > DISCLAIMER file exists ? [OK]
> >
> > Checking for LICENSE and NOTICE
> > License file exists ? [OK]
> > Notice file exists ? [OK]
> >
> > Performing custom Licensing Check
> > Licensing Check Passed [OK]
> >
> > Running RAT Check
> > RAT Check Passed [OK]
> >
> >
> >
> >
> >
> >
> > On Tue, Mar 30, 2021 at 3:48 PM Bhavani Sudha 
> > wrote:
> >
> > > +1 (binding)
> > >
> > > - compile ok
> > > - quickstart ok
> > > - checksum ok
> > > - ran some ide tests - ok
> > > - release validation script - ok
> > > /tmp/validation_scratch_dir_001 ~/Downloads/hudi-0.8.0-rc1/scripts
> > > Downloading from svn co https://dist.apache.org/repos/dist//dev/hudi
> > > Validating hudi-0.8.0-rc1 with release type "dev"
> > > Checking Checksum of Source Release
> > > Checksum Check of Source Release - [OK]
> > >
> > >  % Total    % Received % Xferd  Average Speed  Time    Time    Time
> > >  Current
> > >                                  Dload  Upload  Total  Spent    Left
> > >  Speed
> > > 100 38466  100 38466    0    0  77709      0 --:--:-- --:--:--
> --:--:--
> > > 77709
> > > Checking Signature
> > > Signature Check - [OK]
> > >
> > > Checking for binary files in source release
> > > No Binary Files in Source Release? - [OK]
> > >
> > > Checking for DISCLAIMER
> > > DISCLAIMER file exists ? [OK]
> > >
> > > Checking for LICENSE and NOTICE
> > > License file exists ? [OK]
> > > Notice file exists ? [OK]
> > >
> > > Performing custom Licensing Check
> > > Licensing Check Passed [OK]
> > >
> > > Running RAT Check
> > > RAT Check Passed [OK]
> > >
> > >
> > >
> > > On Mon, Mar 29, 2021 at 9:35 AM Gary Li 
> > 

Re: [VOTE] Release 0.7.0, release candidate #2

2021-01-23 Thread vbal...@apache.org
 
+1 (binding)
```1. Ran release validation script successfully.2. Build successful3. 
Quickstart succeeded. 

Checking Checksum of Source Release
 Checksum Check of Source Release - [OK]
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current 
                                Dload  Upload   Total   Spent    Left  Speed100 
34972  100 34972    0     0  88987      0 --:--:-- --:--:-- --:--:-- 
88987Checking Signature Signature Check - [OK]
Checking for binary files in source release
 No Binary Files in Source Release? - [OK]
Checking for DISCLAIMER DISCLAIMER file exists ? [OK]
Checking for LICENSE and NOTICE License file exists ? [OK] Notice file exists ? 
[OK]
Performing custom Licensing Check Licensing Check Passed [OK]
Running RAT Check RAT Check Passed [OK]
~/code/oss/upstream_hudi/scripts```On Saturday, January 23, 2021, 05:55:10 
AM PST, Sivabalan  wrote:  
 
 Got it, I didn't do -1, but just wanted to remind you, so that you don't
miss it when you redo the steps again to promote the final one.

+1 binding.
But do ensure when you release, the staged repo (promoted candidate) has
only one set of artifacts and it's a new repo.


On Sat, Jan 23, 2021 at 2:03 AM nishith agarwal  wrote:

> +1 binding
>
> - Build Successful
> - Release validation script Successful
> - Quick start runs Successfully
>
> Checking Checksum of Source Release
> Checksum Check of Source Release - [OK]
>
>  % Total    % Received % Xferd  Average Speed  Time    Time    Time
>  Current
>                                  Dload  Upload  Total  Spent    Left
>  Speed
> 100 34972  100 34972    0    0  96076      0 --:--:-- --:--:-- --:--:--
> 96076
> Checking Signature
> Signature Check - [OK]
>
> Checking for binary files in source release
> No Binary Files in Source Release? - [OK]
>
> Checking for DISCLAIMER
> DISCLAIMER file exists ? [OK]
>
> Checking for LICENSE and NOTICE
> License file exists ? [OK]
> Notice file exists ? [OK]
>
> Performing custom Licensing Check
> Licensing Check Passed [OK]
>
> Running RAT Check
> RAT Check Passed [OK]
>
> Thanks,
> Nishith
>
> On Fri, Jan 22, 2021 at 9:28 PM Vinoth Chandar  wrote:
>
> > Thanks Siva! I am not sure if thats a required aspect for the binding
> vote.
> > Its a minor aspect that does not interfere with testing/validation in
> > anyway. The actual release artifact needs to be rebuilt and repushed
> anyway
> > from a separate repo. Like I noted, I found the wiki instructions bit
> > ambiguous and I intend to make it clearer going forward so we can avoid
> > this in future.
> >
> > I request everyone to consider this explanation, when casting your vote.
> >
> > Thanks
> > Vinoth
> >
> >
> > On Fri, Jan 22, 2021 at 8:35 PM Sivabalan  wrote:
> >
> > > - checksums and signatures [OK]
> > > - successfully built [OK]
> > > - ran quick start guide [OK]
> > > - Ran release validation guide [OK]
> > > - Ran test suite job w/ inserts, upserts, deletes and validation(spark
> > sql
> > > and hive). Also same job w/ metadata enabled as well [OK]
> > >
> > > - Artifacts in staging repo : should be in separate repo where only rc2
> > is
> > > present. Right now, I see both rc1 and rc2 are present in the same
> repo.
> > >
> > > Will add my binding vote once artifacts are fixed.
> > >
> > >
> > >
> > > On Fri, Jan 22, 2021 at 9:17 PM Udit Mehrotra 
> wrote:
> > >
> > > > +1
> > > > - Build successful
> > > > - Ran quickstart against S3
> > > > - Additional manual tests with MOR
> > > > - Additional manual testing with and without Metadata based listing
> > > enabled
> > > > - Release validation script successful
> > > >
> > > > Validating hudi-0.7.0-rc2 with release type "dev"
> > > > Checking Checksum of Source Release
> > > > -e Checksum Check of Source Release - [OK]
> > > >
> > > >  % Total    % Received % Xferd  Average Speed  Time    Time
>  Time
> > > >  Current
> > > >                                  Dload  Upload  Total  Spent
> Left
> > > >  Speed
> > > > 100 34972  100 34972    0    0  70937      0 --:--:-- --:--:--
> > --:--:--
> > > > 70793
> > > > Checking Signature
> > > > -e Signature Check - [OK]
> > > >
> > > > Checking for binary files in source release
> > > > -e No Binary Files in Source Release? - [OK]
> > > >
> > > > Checking for DISCLAIMER
> > > > -e DISCLAIMER file exists ? [OK]
> > > >
> > > > Checking for LICENSE and NOTICE
> > > > -e License file exists ? [OK]
> > > > -e Notice file exists ? [OK]
> > > >
> > > > Performing custom Licensing Check
> > > > -e Licensing Check Passed [OK]
> > > >
> > > > Running RAT Check
> > > > -e RAT Check Passed [OK]
> > > >
> > > > Thanks,
> > > > Udit
> > > >
> > > > On Fri, Jan 22, 2021 at 12:41 PM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > Please review and vote on the release candidate #2 for the version
> > > 0.7.0,
> > > > > as follows:
> > > > >
> > > > > [ ] +1, Approve the release
> > > > >
> > > > > [ ] -1, Do not approve the release (please provide specific
> co

20201013 Weekly Sync Minutes

2020-10-13 Thread vbal...@apache.org
Please find the meeting notes below:
20201013 Weekly Sync Minutes - HUDI - Apache Software Foundation

Thanks,Balaji.V

Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

2020-09-04 Thread vbal...@apache.org
 Regarding queries taking a hit, we need to distinguish queries which do not 
need _hoodie_record_key column vs those that need them. I would think the most 
common case would be the ones that does not required _hoodie_record_key. This 
is similar to how hive query integration for bootstrapped table is supported. 
We should implement on similar lines to preserve fast path.
At the end of the day, this is going to be an optional feature. For cases, 
where the storage overhead of _hoodie_record_key is higher, this feature would 
help.
Thanks,Balaji.V
On Wednesday, September 2, 2020, 05:07:49 PM PDT, Gary Li 
 wrote:  
 
 Yes it works this way, IMO the performance of reading certain fields from the 
row and then to build the key from there is not as efficient as having a 
separate key from the actual record. But if we only make it virtual in the 
parquet then this should be fine.

For parquet, the downside I could see is we won’t be able to use ColumnBatch to 
fetch the key anymore. To get the key, we might need to scan the entire row. 
But since most of the data are not being touched during the upsert then I could 
see the storage benefit over the read performance.

>From the reader side, are we trying to make _hoodie_record_key still exist in 
>the table view or excluding it from the query side? If we want to still 
>include it, then we might need to support the virtual feature for multiple 
>query engines to get the consistent view.

No mean to stop this making progress. Just some questions pop into my head. I 
think my questions are all solvable. Happy to discuss more in the RFC if we 
move forward :)

Best,
Gary

Gary Li

From: Vinoth Chandar 
Sent: Wednesday, September 2, 2020 11:07:23 PM
To: dev@hudi.apache.org 
Subject: Re: [DISCUSS] Support for `_hoodie_record_key` as a virtual column

Hi Gary,

It should not be that bad, right. We can just create a new implementation
of HoodieRecord where getKey() can dynamically fetch a field's value as the
recordKey.
Then the rest of the code should work as-is?

On Tue, Sep 1, 2020 at 5:22 PM Gary Li  wrote:

> Thanks for the proposal. I am a bit concerning about this feature.
> _hoodie_record_key is the primary key of the Hudi table and we need this
> field for indexing, sorting, merging e.t.c. We are using _hoodie_record_key
> very frequently so I can see the overhead on both write side and read side.
> I am not sure if this is worth it. Make _hoodie_partition_path virtual make
> more sense to me.
>
> On the implementation side, if we support this as an option, we need
> special handles in many places. For example, the LogScanner,
> _hoodie_record_key is the key of the lookup Hashmap and this Map could be
> spilled to the disk as well. There should be significant amount of work
> from my perspective.
>
> -1 for making _hoodie_record_key virtual.
> +1 for _hoodie_partition_path
>
> Best Regards,
> Gary Li
>
>
> On 8/22/20, 9:09 PM, "Sivabalan"  wrote:
>
>    Aah, yes. That’s right.
>
>    On Sat, Aug 22, 2020 at 2:43 AM Vinoth Chandar 
> wrote:
>
>    > All of the remaining meta fields compress very very nicely. They have
>    >
>    > almost no overhead.
>    >
>    >
>    >
>    > On Fri, Aug 21, 2020 at 12:00 PM Abhishek Modi  >
>    >
>    > wrote:
>    >
>    >
>    >
>    > > @sivabalan the current plan is to only add this for
> hoodie_record_key.
>    > But
>    >
>    > > I'm hoping to make the implementation general enough to add other
> columns
>    >
>    > > as well going forward :)
>    >
>    > >
>    >
>    > > On Fri, Aug 21, 2020 at 11:49 AM Sivabalan 
> wrote:
>    >
>    > >
>    >
>    > > > +1 for virtual record keys. Do you also propose to generalize
> this for
>    >
>    > > > partition path as well ?
>    >
>    > > >
>    >
>    > > >
>    >
>    > > > On Fri, Aug 21, 2020 at 4:20 AM Pratyaksh Sharma <
>    > pratyaks...@gmail.com>
>    >
>    > > > wrote:
>    >
>    > > >
>    >
>    > > > > This is a good option to have. :)
>    >
>    > > > >
>    >
>    > > > > On Thu, Aug 20, 2020 at 11:25 PM Vinoth Chandar <
> vin...@apache.org>
>    >
>    > > > wrote:
>    >
>    > > > >
>    >
>    > > > > > IIRC _hoodie_record_key was supposed to this standardized key
>    > field.
>    >
>    > > :)
>    >
>    > > > > > Anyways, it's good to provide this option to the user.
>    >
>    > > > > > So +1 for. RFC/further discussion.
>    >
>    > > > > >
>    >
>    > > > > > To level set, I want to also share some of the benefits of
> having
>    > an
>    >
>    > > > > > explicit key column.
>    >
>    > > > > > a) if you build your data lake using a bunch of hudi tables,
> now
>    > you
>    >
>    > > > > have a
>    >
>    > > > > > standardized data model
>    >
>    > > > > > b) Even if your key generator changes, it does not affect the
>    >
>    > > existing
>    >
>    > > > > > data's keys. and updates will be matched correctly.
>    >
>    > > > > >
>    >
>    > > > > > On Thu, Aug 20, 2020 a

Re: HUDI-1232

2020-08-28 Thread vbal...@apache.org
 Hi Selvaraj,
We had fixed relevant perf issue in  0.6.0 ([HUDI-1144] Speedup spark read 
queries by caching metaclient in HoodieROPathFilter (#1919)). Can you please 
try 0.6.0
Balaji.V
On Friday, August 28, 2020, 01:31:42 PM PDT, selvaraj periyasamy 
 wrote:  
 
 I have created this https://issues.apache.org/jira/browse/HUDI-1232 ticket
for tracking a couple of issues.

One of the concerns I have in my use cases is that, have a COW type table
name called TRR.  I see below pasted logs rolling for all individual
partitions even though my write is on only a couple of partitions  and it
takes upto 4 to 5  mins. I pasted only a few of them alone. I am wondering
, in the future , I will have 3 years worth of data, and writing will be
very slow every time I write into only a couple of partitions.

20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
java.util.stream.ReferencePipeline$Head@fed0a8b
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200714/01, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
from base path:
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
core-site.xml, mapred-default.xml, m
apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
ugi=svchdc36q@V
ISA.COM (auth:KERBEROS)]]]
20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
java.util.stream.ReferencePipeline$Head@285c67a9
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200714/02, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
from base path:
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
core-site.xml, mapred-default.xml, m
apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1,
ugi=svchdc36q@V
ISA.COM (auth:KERBEROS)]]]
20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of
type COPY_ON_WRITE from
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants
java.util.stream.ReferencePipeline$Head@2edd9c8
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for
partition :20200714/03, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata
from base path:
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1
files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS:
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml,
core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem:
[DFS[DFSClient[clientName

0.6.0 Bug : [HUDI-1230] Spark Data Source Batch Write on MOR table not shutting down

2020-08-26 Thread vbal...@apache.org
Dear Hudi Users,
We noticed an issue with 0.6.0 release and would like to notify you all.
https://issues.apache.org/jira/browse/HUDI-1230
This affects Spark Datasource batch writes on MOR tables only. This problem 
will NOT be seen when inline compaction is enabled. Spark Structured Streaming 
Writes on MOR table is also unaffected. 
The implication of this bug is that spark-submit jobs  running DataSource batch 
writes on MOR table will not shut down after the job is complete. 

As a work around, please set this redundant hoodie config in your spark-submit 
job when running batch writes in Spark.
hoodie.datasource.compaction.async.enable=false

As mentioned above, if inline compaction is enabled, the above workaround is 
not needed at all.
Udit has already opened a PR to address this issue : 
https://github.com/apache/hudi/pull/2046
Thanks,Balaji.V






Re: [DISCUSS] Codestyle: force multiline indentation

2020-08-18 Thread vbal...@apache.org
 +1 on standardizing code formatting. On Tuesday, August 18, 2020, 03:58:42 
PM PDT, Vinoth Chandar  wrote:  
 
 can more people please chime in?  This will affect all of us on a daily
basis :)

On Thu, Aug 13, 2020 at 8:25 AM Gary Li  wrote:

> Vote for mvn spotless:apply to do the auto fix.
>
> On Thu, Aug 13, 2020 at 1:13 AM Vinoth Chandar  wrote:
>
> > Hi,
> >
> > Anyone has thoughts on this?
> >
> > esp leesf/vinoyang, given you both drove much of the initial cleanups.
> >
> > On Mon, Aug 10, 2020 at 7:16 PM Shiyan Xu 
> > wrote:
> >
> > > in that case, yes, all for automation.
> > >
> > > On Mon, Aug 10, 2020 at 7:12 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Overall, I think we should standardize this across the project.
> > > > But most importantly, may be revive the long dormant spotless effort
> > > first
> > > > to enable autofixing of checkstyle issues, before we add more
> checking?
> > > >
> > > > On Mon, Aug 10, 2020 at 7:04 PM Shiyan Xu <
> xu.shiyan.raym...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I noticed that throughout the codebase, when method arguments wrap
> > to a
> > > > new
> > > > > line, there are cases where indentation is 4 and other cases align
> > the
> > > > > wrapped line to the previous line of argument.
> > > > >
> > > > > The latter is caused by intelliJ settings of "Align when multiline"
> > > > > enabled. This won't be flagged by checkstyle due to not setting
> > > > > *forceStrictCondition* to *true*
> > > > >
> > > > >
> > > >
> > >
> >
> https://checkstyle.sourceforge.io/config_misc.html#Indentation_Properties
> > > > >
> > > > > I'm suggesting setting this to true to avoid the discrepancy and
> > > > redundant
> > > > > diffs in PR caused by individual IDE settings. People who have set
> > > "Align
> > > > > when multiline" will need to disable it to pass the checkstyle
> > > > validation.
> > > > >
> > > > > WDYT?
> > > > >
> > > > > Best,
> > > > > Raymond
> > > > >
> > > >
> > >
> >
>
  

Re: [DISCUSS] Release 0.6.0 timelines

2020-08-12 Thread vbal...@apache.org
 > > > > PMC/Committers had additional kid care duties. Now we are back to
>> > > normal
>> > > > > cadence)
>> > > > >
>> > > > > Going forward, I plan to start a discussion around planning,
>> > > prioritizing
>> > > > > and other release processes after 0.6.0. Would be great to have
>> the
>> > > > > community weigh in even more in these things upfront.
>> > > > >
>> > > > > Thanks
>> > > > > Vinoth
>> > > > >
>> > > > > On Fri, Jul 31, 2020 at 6:49 PM Anton Zuyeu <
>> anton.zu...@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > >> Hi All,
>> > > > >>
>> > > > >> I apologize for possibly dumb question but when was 0.6.0
>> planned to
>> > > be
>> > > > >> released? Can't find any dates on Hudi related pages.
>> > > > >>
>> > > > >> On Thu, Jul 30, 2020 at 10:36 AM Vinoth Chandar <
>> vin...@apache.org>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Is anyone able to help with the at risk items? :)
>> > > > >> >
>> > > > >> > On Thu, Jul 30, 2020 at 7:07 AM leesf 
>> > wrote:
>> > > > >> >
>> > > > >> > > @Vinoth Chandar  Thanks for the reminder,
>> > > marked
>> > > > >> to
>> > > > >> > > blocker, and next week would be ok to me.
>> > > > >> > >
>> > > > >> > > Vinoth Chandar  于2020年7月30日周四 上午11:35写道:
>> > > > >> > >
>> > > > >> > > > @leesf   can we please mark the
>> relevant
>> > > > >> > ticket(s)
>> > > > >> > > > with blocker priority, so it's easier to track?
>> > > > >> > > >
>> > > > >> > > > Looks like we are nearing a choice for RM.
>> > > > >> > > > Any more thoughts on timelines? Looks like everyone so far
>> is
>> > > > >> leaning
>> > > > >> > > > towards completeness of the release over doing it sooner?
>> > > > >> > > >
>> > > > >> > > > On Wed, Jul 29, 2020 at 6:36 PM vino yang <
>> > > yanghua1...@gmail.com>
>> > > > >> > wrote:
>> > > > >> > > >
>> > > > >> > > > > +1 on Sudha being RM for the release. And looking
>> forward to
>> > > > >> 0.6.0.
>> > > > >> > > > >
>> > > > >> > > > > Best,
>> > > > >> > > > > Vino
>> > > > >> > > > >
>> > > > >> > > > > leesf  于2020年7月30日周四 上午9:15写道:
>> > > > >> > > > >
>> > > > >> > > > > > +1 on Sudha on being RM, and PR#1810
>> > > > >> > > > > > https://github.com/apache/hudi/pull/1810 (abstract
>> hive
>> > > sync
>> > > > >> > module)
>> > > > >> > > > > would
>> > > > >> > > > > > also goes to this release.
>> > > > >> > > > > >
>> > > > >> > > > > > Sivabalan  于2020年7月30日周四 上午2:18写道:
>> > > > >> > > > > >
>> > > > >> > > > > > > +1 on Sudha being RM for the release. Makes sense to
>> > push
>> > > > the
>> > > > >> > > release
>> > > > >> > > > > by
>> > > > >> > > > > > a
>> > > > >> > > > > > > week.
>> > > > >> > > > > > >
>> > > > >> > > > > > > On Wed, Jul 29, 2020 at 1:35 AM vbal...@apache.org <
>> > > > >> > > > vbal...@apache.org
>> > > > >> > > > > >
>> > > > >> > > > > > > wrote:
>> > > > >> > > > > > >
>> > > > >> > > > > > > >  +1 on Sudha o

Re: [DISCUSS] Release 0.6.0 timelines

2020-07-28 Thread vbal...@apache.org
 +1 on Sudha on being RM for this release. Also agree on pushing the release 
date by a week.
Balaji.V
On Tuesday, July 28, 2020, 10:08:41 PM PDT, Bhavani Sudha 
 wrote:  
 
 Thanks Vinoth for the update. I can volunteer to RM this release.

Understand 0.6.0 release is delayed than what we originally discussed. Q2
has been really hard with COVID and everything going on. Given that we are
at this point, I feel by delaying the RC by a week or so more if we can get
some of the 'At risk' items in, I would vote for that. That is just my
personal opinion. I ll let others chime in.

Thanks,
Sudha

On Tue, Jul 28, 2020 at 9:48 PM Vinoth Chandar  wrote:

> Hello all,
>
> Just wanted to kickstart a thread to firm up the RC cut date for 0.6.0 and
> pick a RM. (any volunteers?, if not I self nominate myself)
>
> Here's an update on where we are at with the remaining release blockers. I
> have marked items as "At risk" assuming we cut RC sometime next week.
> Please chime in with your thoughts. Ideally, we don't take any more
> blockers. If we also want to knock off the at risk items, then we would
> at-least push dates by another week (my guess).
>
> 0.6.0 Release blocker status (board
> <
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=397&projectKey=HUDI&view=detail&selectedIssue=HUDI-69
> >)
> ,
>
>    - Spark Datasource/MOR https://github.com/apache/hudi/pull/1848 needs
> to
>    be tested by gary/balaji
>    - Bootstrap
>      - Vinoth working on code review, tests for PR 1876,
>      - then udit will rework PR 1702
>      - then we will review, land PR 1870, 1869
>      - Also need to fix HUDI-999, HUDI-1021
>    - Bulk insert V2 PR 1834, lower risk, independent PR, well tested
> already
>      - Dependent PR 1149 to be landed,
>      - and modes to be respected in V2 impl as well (At risk)
>    - Upgrade Downgrade Hooks, PR 1858 : Siva has a PR out, code completing
>    this week
>    - HUDI-1054- Marker list perf improvement, Udit has a PR out
>    - HUDI-115 : Overwrite with... ordering issue, Sudha has a PR nearing
>    landing
>    - HUDI-1098 : Marker file issue with non-existent files. Siva to begin
>    impl
>    - Spark Streaming + Async Compaction , test complete, code review
>    comments and land PR 1752
>    - Spark DataSource/Hive MOR Incremental Query HUDI-920 (At risk)
>    - Flink/Multi Engine refactor, will need a large rebase and rework,
>    review, land (At risk for 0.6.0, high scope, may not have enough time)
>    - BloomIndex V2 - Global index implementation. (At risk)
>    - HUDI-845 : Parallel writing i.e allow multiple writers (At risk)
>    - HUDI-860 : Small File Handling without memory caching (At risk)
>
>
> Thanks
> Vinoth
>
  

Re: [DISCUSS] Adding Metrics to Hudi Common

2020-07-28 Thread vbal...@apache.org
 +1. Would love to see observability metrics exposed for file system RPC calls. 
This would greatly help in figuring out RPC performance and bottlenecks across 
varied file-systems that Hudi supports. 
On Tuesday, July 28, 2020, 08:24:54 AM PDT, Nishith  
wrote:  
 
 +1 

Having the metrics flexibly in common will help in building observability in 
other modules.

Thanks,
Nishith

> On Jul 28, 2020, at 7:28 AM, Vinoth Chandar  wrote:
> 
> +1 as well.
> 
> Given we support many reporters now. Could you please further
> improve/retain modularity.
> 
>> On Mon, Jul 27, 2020 at 6:30 PM vino yang  wrote:
>> 
>> Hi Modi,
>> 
>> +1 for this proposal.
>> 
>> I agree with your opinion that the metric report should not only report the
>> client's metrics.
>> 
>> And we should decouple the implementation of metrics from the client module
>> so that it could be developed independently.
>> 
>> Best,
>> Vino
>> 
>> Abhishek Modi  于2020年7月28日周二 上午4:17写道:
>> 
>>> Hi Everyone!
>>> 
>>> I'm hoping to have a discussion around adding a lightweight metrics class
>>> to Hudi Common. There are parts of Hudi Common that have large
>> performance
>>> implications, and I think adding metrics to these parts will help us
>> track
>>> Hudi's health in production and help us understand the performance
>>> implications of changes we make.
>>> 
>>> I've opened a Jira on this topic -
>>> https://issues.apache.org/jira/browse/HUDI-1025. This jira
>>> specifically suggests adding HoodieWrapperFileSystem as this class has
>>> performance implications not just for Hudi, but also for the underlying
>>> DFS.
>>> 
>>> Looking forward to everyone's opinions on this :)
>>> 
>>> Best,
>>> Modi
>>> 
>>   

Re: Kafka Hudi pipeline design

2020-07-24 Thread vbal...@apache.org
 We are actively looking at how to support parallel writing. But, as you can 
imagine, when it comes to updates and avoiding duplicates for insert, only one 
writer needs to be running. One of the design choices we have made so far was 
to avoid any external dependency in running hudi to avoid any operational 
burden for users.  Having said that, there are users who have used coordination 
service like zookeeper to guarantee mutex on top of hudi jobs (typically run 
using Airflow like coordination service). You can model your pipelines to 
achieve this. But, just to reiterate, we are looking more closely at how to 
best support concurrent ingestion.
Please consider this when you are designing your pipelines.
Balaji.VOn Tuesday, July 21, 2020, 11:06:49 PM PDT, Lian Jiang 
 wrote:  
 
 Thanks Balaji.
Appreciate the answer and the jira creation. Below is the improved design after 
some investigation. 




The differences between it and my previous diagram are:1. one delta streamer 
produces one hudi dataset (as opposed to one delta streamer produces multiple 
hudi dataset). Delta streamer's --target-table option indicates that one delta 
streamer job produces one hudi dataset. Also --source-class option indicates 
that one delta streamer job can only have one source. So one delta streamer 
cannot support streaming and backfill at the same time.
2. each delta streamer will have its own event extractor plugin to extract the 
desired type of events.3. all delta streamers will sync hudi data sets to Hive 
so that the users can query via hive without worrying whether the underlying 
format is parquet or hudi.4. Each hudi dataset's backfill is handled by a 
separate backfill job, assuming the backfill job and delta streamer can work 
correctly when writing into the same dataset concurrently.

Hope this design makes more sense than my previous one. I will inform you of 
any issues in development.



Regarding your feedback,
" faithfully append event stream logs to S3 before you materialize in different 
order, you can try the "insert" mode in hudi, which would give you small file 
size handling."
I may need both "upsert" and "insert" for different hudi datasets. I will 
definitely prefer "insert" mode for appending only user cases.
"With 0.6, we are planning to allow multiple writers as long as there is 
guarantee that writers will be writing to different partitions. I think this 
will fit your requirement and also keep one timeline."
This is interesting. I want to expand my use cases a little since I am 
wondering how I can guarantee writers writing to different partitions.Case 1: 
(mentioned above) the streaming delta streamer and the backfill job writes into 
the same hudi dataset. I control both jobs.Case 2: the delta streamer keeps 
ingestion and a CCPA/GDPR job deletes some customer data from the same hudi 
dataset from time to time. The CCPA job could be from another infra team.
In case 1, how do I control my jobs to guarantee delta streamer and backfill 
job writing different partitions, especially there could be late arrival events 
that could be written into a random early partition.In case 2, it will be hard 
for different teams' jobs to coordinate with each other to avoid partition 
conflict.
As you can see, it may not be easy for applications to provide such guarantee. 
Is it possible that the hudi writers can coordinate themselves by using some 
locking mechanism? IMHO, it is ok to sacrifice some performance to make the 
concurrent writing correct.
Appreciate your insight.

RegardsLian






On Tue, Jul 21, 2020 at 2:13 AM Balaji Varadarajan  
wrote:

 Please see answers inline...

    On Sunday, July 19, 2020, 10:08:09 PM PDT, Lian Jiang 
 wrote:  

 Hi,
I have a kafka topic using a kafka s3 connector to dump data into s3 hourly in 
parquet format. These parquet files are partitioned in ingestion time and each 
record has fields which are deeply nested jsons. Each record is a monolithic 
data containing multiple events each has its own event time. This causes two 
issues: 1. slow query by event time; 2. hard to use due to many levels of 
exploding. I plan to use the below design to solve these problems. 

In this design, I still use the s3 parquet dumped by the Kafka S3 connector as 
a backfill for the hudi pipeline. This is because the S3 connector pipeline is 
easier then the hudi pipeline to set up and will work before the hudi pipeline 
is working. Also, the s3 connector pipeline may be more reliable than the hudi 
pipeline due to the potential bugs in delta streamer.The delta streamer will 
decompose the monolithic kafka record into multiple event streams. Each event 
stream is written into one hudi dataset partition and sorted by its 
corresponding event time. Such hudi datasets are synced with hive which is 
exposed for user query so that they don't need to care whether the underlying 
table format is parquet or hudi.Hopefully, such design improves the query 
performance due to the fact tha

20200714 Weekly Sync Minutes

2020-07-21 Thread vbal...@apache.org
It was a very short meeting. The major highlight was : AWS Athena officially 
supporting Apache Hudi as a queryable source.
 https://cwiki.apache.org/confluence/display/HUDI/20200721+Weekly+Sync+Minutes
Thanks,Balaji.V


Re: the contributor permission

2020-07-21 Thread vbal...@apache.org
 
Welcome to Hudi. I have added your jira id. 
Balaji.VOn Tuesday, July 21, 2020, 10:19:21 AM PDT, zjing...@sina.com 
 wrote:  
 
 Hi,I want to contribute to Apache Hudi. Would you please give me the 
contributor permission? My JIRA ID is AndyZhang0419  

Re: [DISCUSS] Organizing ourselves for scale

2020-07-14 Thread vbal...@apache.org
 
+1 on the roles and responsibilities definition. I personally think this brings 
structure and clarity to different tracks. 
 It would be interesting to hear other's thoughts on this and on ideas on 
scaling different tracks.
Balaji.V On Sunday, July 12, 2020, 08:07:22 PM PDT, Vinoth Chandar 
 wrote:  
 
 Hi all,

We have grown quite a bit as a community this year. I found myself
personally, spread thin amongst too many roles and often prioritizing for
the short term over long term. Given we all have limited time/resources, I
think it may be pragmatic to assume not all of us have time for all the
roles.  We need to have some structure for us to know where we lack and
where we are doing okay. I felt, a good starting point is writing down the
various roles we are all assuming.

https://cwiki.apache.org/confluence/display/HUDI/Community

I have also tagged PMC/Committers who have expressed interest in given
roles - merely for purposes of knowing who else can take up tasks within
that role if you cannot find time to do it yourself that week.  These roles
are just for guiding ourselves. I did not intend to prescribe anything
beyond that.

Please chime in with your thoughts. Suggest any other roles we are missing,
anything to improve. and if you want to get tagged to a role etc.

Thanks
Vinoth
  

Re: Keeping Hive in Sync

2020-07-08 Thread vbal...@apache.org
I don't remember the root cause completely Vinoth. I guess it was due to some 
protocol mismatch. 
Balaji.V   On Tuesday, July 7, 2020, 10:25:48 PM PDT, Vinoth Chandar 
 wrote:  
 
 Hi,

Yes. It can be an issue, probably good to get the table written using hive
style partitioning. I will check  on this more and get back to you

Balaji, do you know top of your head?

Thanks
Vinoth

On Sat, Jul 4, 2020 at 11:22 PM selvaraj periyasamy <
selvaraj.periyasamy1...@gmail.com> wrote:

> Add some more info, my join condition would look for 180 days range
> folders.
>
> On Sat, Jul 4, 2020 at 11:13 PM selvaraj periyasamy <
> selvaraj.periyasamy1...@gmail.com> wrote:
>
> > Team,
> >
> > I have a question on keeping hive in sync.  Due to a shared Hadoop
> > Environment restricting me from using hudi 0.5.1 or higher version i
> ended
> > up using 0.5.0.  Currently my hadoop cluster is having hive 1.2.x , which
> > is not supporting Hudi to keep hive in sync.
> >
> > So , I am not using the hive feature. I am reading it as below.
> >
> >
> > sparkSession.
> > read.
> > format("org.apache.hudi").
> > load("/projects/cdp/data/base/request_application/*/*").
> > createOrReplaceTempView(s"base_request_application")
> >
> >
> > I am going to store 3 years worth of data partitioned by day/hour. When I
> > load 3 years data, I would have (3*365*24) = 26280 directories. Using the
> > above approach and reading every time, I see all the directories names
> are
> > indexed. Would it impact the perfromance during joining with other table,
> > if i dont use hive way of partition pruning?
> >
> > Thanks,
> > Selva
> >
> >
>
  

Re: DISCUSS code, config, design walk through sessions

2020-07-06 Thread vbal...@apache.org
 +1.
On Monday, July 6, 2020, 09:11:47 AM PDT, Bhavani Sudha 
 wrote:  
 
 +1 this is a great idea!

On Mon, Jul 6, 2020 at 7:54 AM vino yang  wrote:

> +1
>
> Adam Feldman  于2020年7月6日周一 下午9:55写道:
>
> > Interested
> >
> > On Mon, Jul 6, 2020, 08:29 Sivabalan  wrote:
> >
> > > +1 for sure
> > >
> > > On Mon, Jul 6, 2020 at 4:42 AM Gurudatt Kulkarni 
> > > wrote:
> > >
> > > > +1
> > > > Really a great idea. Will help in understanding the project better.
> > > >
> > > > On Mon, Jul 6, 2020 at 1:35 PM Pratyaksh Sharma <
> pratyaks...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > This is a great idea and really helpful one.
> > > > >
> > > > > On Mon, Jul 6, 2020 at 1:09 PM  wrote:
> > > > >
> > > > > > +1
> > > > > > It can also attract more partners to join us.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 07/06/2020 15:34, Ranganath Tirumala wrote:
> > > > > > +1
> > > > > >
> > > > > > On Mon, 6 Jul 2020 at 16:59, David Sheard <
> > > > > > david.she...@datarefactory.com.au>
> > > > > > wrote:
> > > > > >
> > > > > > > Perfect
> > > > > > >
> > > > > > > On Mon, 6 Jul. 2020, 1:30 pm Vinoth Chandar, <
> vin...@apache.org>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > As we scale the community, its important that more of us are
> > able
> > > > to
> > > > > > help
> > > > > > > > users, users becoming contributors.
> > > > > > > >
> > > > > > > > In the past, we have drafted faqs, trouble shooting guides.
> > But I
> > > > > feel
> > > > > > > > sometimes, more hands on walk through sessions over video
> could
> > > > help.
> > > > > > > >
> > > > > > > > I am happy to spend 2 hours each on code/configs,
> > > > > > > design/perf/architecture.
> > > > > > > > Have the session be recorded as well for future.
> > > > > > > >
> > > > > > > > What does everyone think?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Vinoth
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > Ranganath Tirumala
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>  

20200623 Weekly Sync Minutes

2020-06-23 Thread vbal...@apache.org
Please find today's meeting minutes below:
https://cwiki.apache.org/confluence/display/HUDI/20200623+Weekly+Sync+Minutes

Thanks,Balaji.V

Re: [DISCUSS] Publishing benchmarks for releases

2020-06-22 Thread vbal...@apache.org
 
+1 on adding benchmarks.On Sunday, June 21, 2020, 11:18:05 PM PDT, Mario de 
Sá Vera  wrote:  
 
 +1 for performance reports

On Mon, 22 Jun 2020, 02:41 vino yang,  wrote:

> +1 as well,
>
> it would be helpful to measure the performance between different versions.
>
> Shiyan Xu  于2020年6月22日周一 上午8:37写道:
>
> > +1 definitely useful info.
> >
> > On Sun, Jun 21, 2020 at 4:56 PM Sivabalan  wrote:
> >
> > > Hey folks,
> > >    Is it a common practise to publish benchmarks for releases? I have
> > put
> > > up an initial PR  to add jmh
> > > benchmark support to a couple of Hudi operations. If the community
> feels
> > > positive on publishing benchmarks, we can add support for more
> operations
> > > and for every release, we could publish some benchmark numbers.
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>  

Re: [DISCUSS] Introduce a write committed callback hook

2020-06-21 Thread vbal...@apache.org
 
+1. This would be a really good feature to have when building dependent ETL 
pipelines.

On Friday, June 19, 2020, 05:13:45 PM PDT, vino yang  
wrote:  
 
 Hi all,

Currently, we have a need to incrementally process and build a new table
based on an original hoodie table. We expect that after a new commit is
completed on the original hoodie table, it could be retrieved ASAP, so that
it can be used for incremental view queries. Based on the existing
capabilities, one approach we can use is to continuously poll Hoodie's
Timeline to check for new commits. This is a very common processing way,
but it will cause unnecessary waste of resources.

We expect to introduce a proactive notification(event callback) mechanism.
For example, a hook can be introduced after a successful commit. External
processors interested in the commit, such as scheduling systems, can use
the hook as their own trigger. When a certain commit is completed, the
scheduling system can pull up the task of obtaining incremental data
through the API in the callback. Thereby completing the processing of
incremental data.

There is currently a `postCommit` method in Hudi's client module, and the
existing implementation is mainly used for compression and cleanup after
commit. And the triggering time is a little early. Not after everything is
processed, we found that it may still cause the rollback of the commit due
to the exception. We need to find a new location to trigger this hook to
ensure that the commit is deterministic.

This is one of our scene requirements, and it will be a very useful feature
combined with the incremental query, it can make the incremental processing
more timely.

We hope to hear what the community thinks of this proposal. Any comments
and opinions are appreciated.

Best,
Vino
  

Re: Re:Re: [DISCUSS] Regarding nightly builds

2020-06-21 Thread vbal...@apache.org
 +1. It is a good idea to run hudi-test-suite on a daily basis with expanded 
tests.
Balaji.VOn Sunday, June 21, 2020, 08:16:39 AM PDT, Trevor-zhang 
<957029...@qq.com> wrote:  
 
 +1 as well.

-- 原始邮件 --
发件人: "vino yang" https://github.com/apachehudi-ci
[2]:
https://cwiki.apache.org/confluence/display/FLINK/2020/03/22/Migrating+Flink%27s+CI+Infrastructure+from+Travis+CI+to+Azure+Pipelines

Vinoth Chandar 

Re: [VOTE] Release 0.5.3, release candidate #2

2020-06-11 Thread vbal...@apache.org
 
+1(binding)
1. Ran integration tests locally2. Manually reviewed the commits landing into 
0.5.3 by comparing against 0.5.2 3. Ran deltastreamer in continuous mode with 
async compaction for couple of hours on test data and verified no errors.4. 
Release Validation script passed locally. ```
varadarb-C02SH0P1G8WL:scripts varadarb$ ./release/validate_staged_release.sh 
--release=0.5.3 --rc_num=2

/tmp/validation_scratch_dir_001 ~/projects/new_ws/hudi/scripts

Checking Checksum of Source Release

  Checksum Check of Source Release - [OK]




  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                 Dload  Upload   Total   Spent    Left  Speed

100 26722  100 26722    0     0  48234      0 --:--:-- --:--:-- --:--:-- 48234

Checking Signature

  Signature Check - [OK]




Checking for binary files in source release

  No Binary Files in Source Release? - [OK]




Checking for DISCLAIMER

  DISCLAIMER file exists ? [OK]




Checking for LICENSE and NOTICE

  License file exists ? [OK]

  Notice file exists ? [OK]




Performing custom Licensing Check 

  Licensing Check Passed [OK]




Running RAT Check

  RAT Check Passed [OK]




~/projects/new_ws/hudi/scripts

varadarb-C02SH0P1G8WL:scripts varadarb$ echo $?

0
```
On Wednesday, June 10, 2020, 02:57:35 PM PDT, Sivabalan 
 wrote:  
 
 Hi everyone,

Please review and vote on the release candidate #2 for the version 0.5.3,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

 The complete staging area is available for your review, which includes:

* JIRA release notes [1],
* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint 001B66FA2B2543C151872CCC29A4FD82F1508833 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "release-0.5.3-rc2" [5],

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.


Thanks,
Release Manager

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12348256

[2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.5.3-rc2/

[3] https://dist.apache.org/repos/dist/release/hudi/KEYS

[4] https://repository.apache.org/content/repositories/orgapachehudi-1023/

[5] https://github.com/apache/hudi/tree/release-0.5.3-rc2
  

Re: CI/Master tests failing

2020-06-04 Thread vbal...@apache.org
 
I fixed a a bunch of issues around flakiness (PR-1697) and have landed the 
change. There is still flakiness with CI  possibly related to leaks (HUDI-997) 
in unit-tests in hudi-client. At this time, I would need help to have someone 
take up HUDI-997 to make CI test stable. Any volunteers ?
Balaji.V 

On Tuesday, June 2, 2020, 11:36:37 AM PDT, Vinoth Chandar 
 wrote:  
 
 Hi all,

This is PSA.. We are observing some flakiness with master and the last
three PR merges have failed. balaji is looking at the fix/issue..

But in the meantime, I'd ask committers to temporarily not merge more PRs
until this is resolved. It will help us fix this early.

Error looks something like this

[ERROR] Error occurred in starting fork, check output in log
14007[ERROR] Process Exit Code: 1
14008[ERROR] Crashed tests:
14009[ERROR] org.apache.hudi.table.TestHoodieMergeOnReadTable
14010[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException:
The forked VM terminated without properly saying goodbye. VM crash or
System.exit called?
14011[ERROR] Command was /bin/sh -c cd
/home/travis/build/apache/hudi/hudi-client &&
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2g -jar
/home/travis/build/apache/hudi/hudi-client/target/surefire/surefirebooter6746374343787497574.jar
/home/travis/build/apache/hudi/hudi-client/target/surefire
2020-06-02T12-27-45_220-jvmRun1 surefire6644367455247671601tmp
surefire_37399402556296347739tmp
14012[ERROR] Error occurred in starting fork, check output in log
14013[ERROR] Process Exit Code: 1
14014[ERROR] Crashed tests:
14015[ERROR] org.apache.hudi.table.TestHoodieMergeOnReadTable
14016[ERROR]     at
org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:690)
14017[ERROR]     at
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:285)
14018[ERROR]     at
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:248)
14019[ERROR]     at
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1217)
14020[ERROR]     at
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1063)
14021[ERROR]     at
org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:889)
14022[ERROR]     at
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
14023[ERROR]     at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
14024[ERROR]     at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
14025[ERROR]     at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
14026[ERROR]     at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
14027[ERROR]     at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
14028[ERROR]     at
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
14029[ERROR]     at
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
14030[ERROR]     at 
org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:309)
14031[ERROR]     at 
org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:194)
14032[ERROR]     at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:107)
14033[ERROR]     at org.apache.maven.cli.MavenCli.execute(MavenCli.java:955)
14034[ERROR]     at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:290)
14035[ERROR]     at org.apache.maven.cli.MavenCli.main(MavenCli.java:194)
14036[ERROR]     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
14037[ERROR]     at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
14038[ERROR]     at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
14039[ERROR]     at java.lang.reflect.Method.invoke(Method.java:498)
14040[ERROR]     at
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
14041[ERROR]     at
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
14042[ERROR]     at
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
14043[ERROR]     at
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
14044[ERROR] -> [Help 1]
14045[ERROR]
  

Re: [VOTE] Release 0.5.3, release candidate #1

2020-06-04 Thread vbal...@apache.org
 -1. 
Siva, We found an issue that needs to be ported to  0.5.3.
Jira : https://jira.apache.org/jira/browse/HUDI-990
I will work with you to port the change to 0.5.3. We would need create a new 
release candidate for this.
Balaji.V

On Tuesday, June 2, 2020, 08:51:54 PM PDT, yajunf...@163.com 
 wrote:  
 
 +1



yajunf...@163.com
 
From: Sivabalan
Date: 2020-06-03 11:22
To: dev
Subject: [VOTE] Release 0.5.3, release candidate #1
Hi everyone,
 
Please review and vote on the release candidate #1 for the version 0.5.3,
as follows:
 
[ ] +1, Approve the release
 
[ ] -1, Do not approve the release (please provide specific comments)
 
 
 
The complete staging area is available for your review, which includes:
 
* JIRA release notes [1],
 
* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint 001B66FA2B2543C151872CCC29A4FD82F1508833 [3],
 
* all artifacts to be deployed to the Maven Central Repository [4],
 
* source code tag "release-0.5.3-rc1" [5],
 
 
 
The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.
 
 
 
Thanks,
 
Release Manager
 
 
 
[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12348256
 
[2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.5.3-rc1/
 
[3] https://dist.apache.org/repos/dist/release/hudi/KEYS
 
[4] https://repository.apache.org/content/repositories/orgapachehudi-1022/
 
[5] https://github.com/apache/hudi/tree/release-0.5.3-rc1
  

Re: [DISCUSS] querying commit metadata from spark DataSource

2020-05-31 Thread vbal...@apache.org
 
I strongly recommend using a separate datasource relation (option 1) to query 
timeline. It is elegant and fits well with spark APIs.
Thanks.Balaji.VOn Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth Chandar 
 wrote:  
 
 Hi satish,

Are you looking for similar functionality as HoodieDatasourceHelpers?

We have historically relied on cli to inspect the table, which does not
lend it self well to programmatic access.. overall in like option 1 -
allowing the timeline to be queryable with a standard schema does seem way
nicer.

I am wondering though if we should introduce a new view. Instead we can use
a different data source name -
spark.read.format(“hudi-timeline”).load(basepath). We can start by just
allowing querying of active timeline and expand this to archive timeline?

What do other Think?




On Fri, May 29, 2020 at 2:37 PM Satish Kotha 
wrote:

> Hello folks,
>
> We have a use case to incrementally generate data for hudi table (say
> 'table2')  by transforming data from other hudi table(say, table1). We want
> to atomically store commit timestamps read from table1 into table2 commit
> metadata.
>
> This is similar to how DeltaStreamer operates with kafka offsets. However,
> DeltaStreamer is java code and can easily query kafka offset processed by
> creating metaclient for target table. We want to use pyspark and I don't
> see a good way to query commit metadata of table1 from DataSource.
>
> I'm considering making one of the below changes to hoodie to make this
> easier.
>
> Option1: Add new relation in hudi-spark to query commit metadata. This
> relation would present a 'metadata view' to query and filter metadata.
>
> Option2: Add other DataSource options on top of incremental querying to
> allow fetching from source table. For example, users can specify
> 'hoodie.consume.metadata.table: table2BasePath'  and issue incremental
> query on table1. Then, IncrementalRelation would go read table2 metadata
> first to identify 'consume.start.timestamp' and start incremental read on
> table1 with that timestamp.
>
> Option 2 looks simpler to implement. But, seems a bit hacky because we are
> reading metadata from table2 when data souce is table1.
>
> Option1 is a bit more complex. But, it is cleaner and not tightly coupled
> to incremental reads. For example, use cases other than incremental reads
> can leverage same relation to query metadata
>
> What do you guys think? Let me know if there are other simpler solutions.
> Appreciate any feedback.
>
> Thanks
> Satish
>  

Re: [DISCUSS] Logos on project front page.

2020-05-12 Thread vbal...@apache.org
 
I agree on following the best practices.
Balaji.VOn Tuesday, May 12, 2020, 06:52:59 PM PDT, Vinoth Chandar 
 wrote:  
 
 Hello all,

This was raised during the graduation discussion. We have been referred to
[1]. The doc ends saying. "These best practices for linking to outside
pages on project websites are meant as suggestions for projects. PMCs are
free to adopt (or not) any of these suggestions for their sites.".

But I would prefer to play by the best practices if we can..

Can you all chime in with your thoughts?



[1] https://www.apache.org/foundation/marks/linking
  

Re: [DISCUSS] should we do a 0.5.3 patch set release ?

2020-05-06 Thread vbal...@apache.org
 +1 for releasing 0.5.3.
Balaji.V
On Wednesday, May 6, 2020, 10:36:54 PM PDT, Y Ethan Guo 
 wrote:  
 
 +1

On Wed, May 6, 2020 at 6:29 PM vino yang  wrote:

> +1 for 0.5.3 as well
>
> Nishith  于2020年5月7日周四 上午8:16写道:
>
> > +1 on the idea
> >
> > Sent from my iPhone
> >
> > > On May 6, 2020, at 3:09 PM, Shiyan Xu 
> > wrote:
> > >
> >
>  

Re: [Discussion] Abstract common meta sync module support multiple meta service

2020-04-29 Thread vbal...@apache.org
 +1. Will review this RFC in couple of days.
Balaji.V
On Wednesday, April 29, 2020, 06:13:00 PM PDT, hddong 
 wrote:  
 
 +1

vino yang  于2020年4月28日周二 下午11:49写道:

> +1
>
> leesf  于2020年4月28日周二 下午7:40写道:
>
> > +1 from me as well.
> >
> > Vinoth Chandar  于2020年4月28日周二 上午3:48写道:
> >
> > > +1
> > >
> > > Will get around to reviewing this more closely this week.
> > >
> > > On Mon, Apr 27, 2020 at 11:11 AM Gary Li 
> > wrote:
> > >
> > > > Hi Wei,
> > > >
> > > > Thanks for the proposal. +1 from my side. This is definitely a very
> > > useful
> > > > feature.
> > > >
> > > > Best Regards,
> > > > Gary Li
> > > >
> > > >
> > > > On 4/27/20, 5:16 AM, "wei li"  wrote:
> > > >
> > > >    Currently Hudi only supports sync
> > > >    dataset metadata to Hive through hive jdbc and IMetaStoreClient.
> > When
> > > >    you need to sync
> > > >    to other frameworks, such as aws glue, aliyun DataLake analytics,
> > > etc.
> > > >    You need to copy a lot of code from HoodieHiveClient, which
> > creates a
> > > >    lot of redundant code.
> > > >    So need to redesign the hudi-hive-sync module to support
> > > >    other frameworks and reuse current code as much as possible. Only
> > the
> > > >    interface is provided by Hudi, and the implement is customized by
> > > > different
> > > >    services as hive 、aws glue、aliyun DataLake analytics.
> > > >
> > > >    I created an RFC with more details
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+17+Abstract+common+meta+sync+module+support+multiple+meta+service
> > > >    . Any feedback is appreciated.
> > > >
> > > >    Best Regards,
> > > >    Wei Li.
> > > >
> > >
> >
>  

Re: [DISCUSS] Readiness for graduation to TLP

2020-04-28 Thread vbal...@apache.org
 +1. I strongly think we are ready for graduation.
On Tuesday, April 28, 2020, 07:38:16 AM PDT, lamberken 
 wrote:  
 
 +1

On 2020/04/28 05:05:44, Vinoth Chandar  wrote: 
> Hello all,
> 
> I would like to start a discussion on our readiness to pursue graduation to
> TLP and potentially follow up with a VOTE with a formal resolution. To seed
> the discussion, our  community's achievements since entering the Incubator
> in early 2018 include the following:
> 
> - Accepted > 500 patches from 90 contributors, including 15+ new  design
> proposals
> - Performed 3 releases with 3 different release managers
> - Invited 5 new committers (all of them accepted)
> - invited 3 of those new committers to join the PMC (all of them accepted)
> - Migrated our web site to ASF infrastructure [1]
> - Migrated developer conversations to the list at dev@hudi.apache.org
> - Migrated all issue tracking to JIRA [2]
> - Apache Hudi name search has been approved [3]
> - We have built a meritocratic, open collaborative process, the Apache way
> - Our PMC is diverse and consists of members from ~10 organizations
> 
> Please chime in with your thoughts.
> 
> Thanks
> Vinoth
> 
  

Re: [DISCUSS] moving blog from cwiki to website

2020-04-22 Thread vbal...@apache.org
 +1 on moving blogs to website.
On Wednesday, April 22, 2020, 08:35:02 AM PDT, leesf  
wrote:  
 
 +1

vino yang  于2020年4月22日周三 下午1:50写道:

> +1 from my side.
>
> Pratyaksh Sharma  于2020年4月22日周三 下午1:38写道:
>
> > +1
> >
> > I have seen other Apache projects having blogs on their website like
> Apache
> > Pinot.
> >
> > On Wed, Apr 22, 2020 at 11:05 AM Bhavani Sudha Saktheeswaran
> >  wrote:
> >
> > > +1
> > >
> > > On Tue, Apr 21, 2020 at 10:23 PM tison  wrote:
> > >
> > > > Hi Vinoth,
> > > >
> > > > +1 for moving blogs.
> > > >
> > > > cwiki looks belong to developer's scope and the first experience of
> > users
> > > > is more likely our website.
> > > >
> > > > Best,
> > > > tison.
> > > >
> > > >
> > > > Vinoth Chandar  于2020年4月22日周三 下午1:09写道:
> > > >
> > > > > Hi community,
> > > > >
> > > > > What does everyone feel about moving blogs we have on cwiki now
> over
> > to
> > > > > site so they are better discovered?
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > >
> > >
> >
>  

Re: [DISCUSS] Insert Overwrite with snapshot isolation

2020-04-16 Thread vbal...@apache.org
 Satish,
Thanks for the proposal. I think a RFC would be useful here. Let me know your 
thoughts. It would be good to nail other details like whether/how to deal with 
external index management with this API.
Thanks,Balaji.V
On Thursday, April 16, 2020, 10:46:19 AM PDT, Balaji Varadarajan 
 wrote:  
 
 
+1 from me. This is a really cool feature. 
Yes, A new file slice (empty parquet) is indeed generated for every file group 
in a partition. 
Regarding cleaning these "empty" file slices eventually by cleaner (to avoid 
cases where there are too many of them lying around) in a safe way, we can 
encode some MAGIC in the write-token component for Hudi readers to skip these 
files so that they can be safely removed. 
For metadata management, I think it would be useful to distinguish between this 
API and other insert APIs. At the very least, we would need a different 
operation type which can be achieved with same API (with flags).
Balaji.V

    On Thursday, April 16, 2020, 09:54:09 AM PDT, Vinoth Chandar 
 wrote:  
 
 Hi Satish,

Thanks for starting this..  Your use-cases do sounds very valuable to
support. So +1 from me.

IIUC, you are implementing a partition level overwrite, where existing
filegroups will be retained, but instead of merging, you will just reuse
the file names and write the incoming records into new file slices?
You probably already thought of this, but one thing to watch out for is :
we should generate a new file slice for every file group in a partition..
Otherwise, old data will be visible to queries.

if so, that makes sense to me.  We can discuss more on whether we can
extend the bulk_insert() API with additional flags instead of a new
insertOverwrite() API..

Others, thoughts?

Thanks
Vinoth

On Wed, Apr 15, 2020 at 11:03 AM Satish Kotha 
wrote:

> Hello
>
> I want to discuss adding a new high level API 'insertOverwrite' on
> HoodieWriteClient. This API can be used to
>
>    -
>
>    Overwrite specific partitions with new records
>    -
>
>      Example: partition has  'x' records. If insert overwrite is done with
>      'y' records on that partition, the partition will have just 'y'
> records (as
>      opposed to  'x union y' with upsert)
>      -
>
>    Overwrite entire table with new records
>    -
>
>      Overwrite all partitions in the table
>
> Usecases:
>
> - Tables where the majority of records change every cycle. So it is likely
> efficient to write new data instead of doing upserts.
>
> -  Operational tasks to fix a specific corrupted partition. We can do
> 'insert overwrite'  on that partition with records from the source. This
> can be much faster than restore and replay for some data sources.
>
> The functionality will be similar to hive definition of 'insert overwite'.
> But, doing this in Hoodie will provide better isolation between writer and
> readers. I can share possible implementation choices and some nuances if
> the community thinks this is a useful feature to add.
>
>
> Appreciate any feedback.
>
>
> Thanks
>
> Satish
>
    

Re: [VOTE] Release 0.5.2-incubating, release candidate #2

2020-03-21 Thread vbal...@apache.org
 +1 (binding)

Ran following checks:
1. Checked out RC candidate source code and compiled successfully
2. Ran Apache Hudi quickstart steps successfully on 0.5.2-incubating-rc23. Ran 
release validation script successfully.
(base) varadarb-C02SH0P1G8WL:scripts varadarb$ 
./release/validate_staged_release.sh --release=0.5.2 --rc_num=2

Checking Checksum of Source Release


Checksum Check of Source Release - [OK]



Checking Signature

Signature Check - [OK]

Checking for binary files in source release


No Binary Files in Source Release? - [OK]

Checking for DISCLAIMER

DISCLAIMER file exists ? [OK]

Checking for LICENSE and NOTICE

License file exists ? [OK]

Notice file exists ? [OK]

Performing custom Licensing Check 

Licensing Check Passed [OK]

Running RAT Check
RAT Check Passed [OK]
Balaji.V

On Saturday, March 21, 2020, 08:23:44 AM PDT, Vinoth Chandar 
 wrote:  
 
 +1 binding

Repeated tests from RC1

On Sat, Mar 21, 2020 at 5:44 AM vino yang  wrote:

> +1 binding
>
> - checked signature & checksum
> - maven clean package -DskipTests
> - ran `release/validate_staged_release.sh`
> - check RAT (OK)
>
> Best,
> Vino
>
> Suneel Marthi  于2020年3月21日周六 下午8:33写道:
>
> > +1 binding
> >
> > - checked NOTICE and LICENSE
> > - verified checksum and signature
> > - mvn clean install
> >
> >
> > On Sat, Mar 21, 2020 at 7:01 AM leesf  wrote:
> >
> > > +1 (binding)
> > >
> > > - verified checksum and signature [OK]
> > > - mvn clean install -DskipTests [OK]
> > > - checked the modules in
> > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapachehudi-1019/org/apache/hudi/
> > >  [OK]
> > >
> > > Best,
> > > Leesf
> > >
> > > vino yang  于2020年3月21日周六 下午5:20写道:
> > >
> > > > Hi everyone,
> > > >
> > > >
> > > > We have prepared the third apache release candidate for Apache Hudi
> > > > (incubating). The version is: 0.5.2-incubating-rc2. Please review and
> > > vote
> > > > on the release candidate #2 for the version 0.5.2, as follows:
> > > >
> > > > [ ] +1, Approve the release
> > > >
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > >
> > > > * JIRA release notes [1],
> > > > * the official Apache source release and binary convenience releases
> to
> > > be
> > > > deployed to dist.apache.org [2], which are signed with the key with
> > > > fingerprint C3A96EC77149571AE89F82764C86684D047DE03C [3],
> > > >
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > > * source code tag "release-0.5.2-incubating-rc2" [5],
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > > Thanks,
> > > >
> > > > Vino
> > > >
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12346606
> > > >
> > > > [2]
> > > >
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/incubator/hudi/hudi-0.5.2-incubating-rc2/
> > > >
> > > > [3] https://dist.apache.org/repos/dist/release/incubator/hudi/KEYS
> > > >
> > > > [4]
> > > https://repository.apache.org/content/repositories/orgapachehudi-1019
> > > >
> > > > [5]
> > > >
> > >
> >
> https://github.com/apache/incubator-hudi/tree/release-0.5.2-incubating-rc2
> > > >
> > >
> >
>  

Re: updatePartitionsToTable() is time consuming and redundant.

2020-03-19 Thread vbal...@apache.org
 
Resurrecting this old thread and adding Udit.
Udit, 
I am not able to reproduce this issue with HDFS. Are you seeing this pattern 
where there are redundant alter-partitions call. 
Although not related, I was looking into 
https://jira.apache.org/jira/browse/HUDI-325 and am wondering if we are seeing 
any discrepancies in hive-syncing between HDFS and non-HDFS clusters.
Balaji.V 
On Wednesday, February 19, 2020, 11:36:08 AM PST, vbal...@apache.org 
 wrote:  
 
 
Hi Pratyaksh/Purushotham,
I spent some time in the morning trying to reproduce this locally but unable 
to. There is an unit-test TestHiveSyncTool.testSyncIncremental which is quite 
close to the setup we need to repro. 
I added the below check and it passed (meaning works as expected with no 
unnecessary update partitions call). Can you use the below code to try 
reproducing it locally and in the real ecosystem to see what is happening.
Balaji.V
```System.out.println("DUPLICATE CHECK");
String commitTime3 = "102";
TestUtil.addCOWPartitions(1, true, dateTime, commitTime3);
hiveClient = new HoodieHiveClient(TestUtil.hiveSyncConfig, 
TestUtil.getHiveConf(), TestUtil.fileSystem);
writtenPartitionsSince = 
hiveClient.getPartitionsWrittenToSince(Option.of(commitTime2));
System.out.println("Added Partitions :" + writtenPartitionsSince);
assertEquals(1, writtenPartitionsSince.size());
hivePartitions = 
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName);
partitionEvents = hiveClient.getPartitionEvents(hivePartitions, 
writtenPartitionsSince);
assertEquals("No partition events", 0, partitionEvents.size());

tool = new HiveSyncTool(TestUtil.hiveSyncConfig, TestUtil.getHiveConf(), 
TestUtil.fileSystem);
tool.syncHoodieTable();
// Sync should add the one partition
assertEquals(6, 
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName).size());
assertEquals("The last commit that was sycned should be 102", commitTime3,
    
hiveClient.getLastCommitTimeSynced(TestUtil.hiveSyncConfig.tableName).get());
    On Wednesday, February 19, 2020, 04:08:39 AM PST, Pratyaksh Sharma 
 wrote:  
 
 Hi Balaji,

We are using Hadoop 3.1.0.

Here is the output of the function you wanted to see -

Path is : /data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Is Absolute :true
Stripped Path
=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Stripped path does not contain scheme and authority.

On Mon, Feb 17, 2020 at 2:46 AM Balaji Varadarajan
 wrote:

>
> Sorry for the delay. From the logs, it is clear that the stored partition
> key and  lookup key are not exactly same. One has scheme and authority in
> its URI while the other is not. This is the reason why we are updating the
> same partition again.
> Some of the methods used here comes from hadoop-common and related
> packages. With Hadoop 2.7.3, I am NOT able to reproduce this issue locally.
> I used the below code to try to repro. Which version of Hadoop are you
> using in runtime. Can you  check if the stripped path (see test code below)
> still contains scheme and authority.
>
> ```public void testit() {
>    Path path = new
> Path("s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt"
>        + "=20191117\n");
>    System.out.println("Path is : " + path.toUri().getPath());
>    System.out.println("Is Absolute :" + path.isUriPathAbsolute());
>    String stripped =
> Path.getPathWithoutSchemeAndAuthority(path).toUri().getPath();
>    System.out.println("Stripped Path =" + stripped);
> }
> ```
> Balaji.V
>
>
>    On Wednesday, February 5, 2020, 12:53:57 AM PST, Purushotham
> Pushpavanthar  wrote:
>
>  Hi Balaji/Vinoth,
>
> Below is the log we obtained from Hudi.
>
> 20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to
> be 20200122094611
> 20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is
> 20200122094611, Getting commits since then
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180108, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180221, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180102, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102,
> New
>

Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread vbal...@apache.org
 Prashanth,
My concern was we should not be losing metadata about clean operation. 

But there is a way, As long as we are faithfully copying the clean metadata 
that tracks the files which got cleaned and storing in restore metadata, we 
should be able to keep metadata in sync.
Balaji.V



On Wednesday, March 18, 2020, 11:54:11 AM PDT, Prashant Wason 
 wrote:  
 
 Thanks for the info Vinoth / Balaji.

To me it feels a split between easier-to-understand design and
current-implementation. I feel it is simpler to reason (based on how file
systems work in general) that restoreToInstant is a complete point-in-time
shift to the past (like restoring a file system from a snapshot/backup).

If I have restored the Table to commitTime=005, then having any instants
with commitTime > 005 are confusing as it implies that even though my table
is at an older time, some future operations will be applied onto it at some
point.

I will have to read more about incremental timeline syncing and timeline
server to understand how it uses the clean instants. BTW, the comment on
the function HoodieWriteClient::restoreToInstant reads "NOTE : This action
requires all writers (ingest and compact) to a table to be stopped before
proceeding". So probably the embedded timeline server can recreate the view
next time it comes back up?

Thanks
Prashant


On Wed, Mar 18, 2020 at 11:37 AM Balaji Varadarajan
 wrote:

>  Prashanth,
> I think we should not be reverting clean operations here. Cleans are done
> on the oldest file slices and a restore/rollback is not completely undoing
> the work of clean that happened before it.
> For incremental timeline syncing, embedded timeline server needs to read
> these clean metadata to sync its cached file-system view.
> Let me know your thoughts.
> Balaji.V
>    On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason
>  wrote:
>
>  HI Team,
>
> I noticed that when a table is restored to a previous commit (
> HoodieWriteClient::restoreToInstant
> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_blob_master_hudi-2Dclient_src_main_java_org_apache_hudi_client_HoodieWriteClient.java-23L735&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=c89AU9T1AVhM4r2Xi3ctZA&m=ASTWkm7UUMnhZ7sBzpXGPkTc1PhNTJeO7q5IXlBCprY&s=43rqua7SdhvO91hA0ZhOPNQw8ON1nL3bAsCue5o8aYw&e=
> >),
> only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
> their corresponding files are deleted from the timeline. If there are some
> CLEAN instants, they are left over.
>
> Is there a reason why CLEAN are not removed? Won't they be referring to
> files  which are no longer present and hence not useful?
>
> Thanks
> Prashant
>  

Re: [VOTE] Release 0.5.2-incubating, release candidate #1

2020-03-13 Thread vbal...@apache.org
 
+1 (binding)

1. Checked out RC candidate source code and compiled successfully
2. Ran Apache Hudi quickstart steps successfully on 0.5.2-rc13. Ran release 
validation script successfully.
(base) varadarb-C02SH0P1G8WL:scripts varadarb$ 
./release/validate_staged_release.sh --release=0.5.2 --rc_num=1

Checking Checksum of Source Release


  Checksum Check of Source Release - [OK]



Checking Signature

  Signature Check - [OK]

Checking for binary files in source release


  No Binary Files in Source Release? - [OK]

Checking for DISCLAIMER

  DISCLAIMER file exists ? [OK]

Checking for LICENSE and NOTICE

  License file exists ? [OK]

  Notice file exists ? [OK]

Performing custom Licensing Check 

  Licensing Check Passed [OK]

Running RAT Check
RAT Check Passed [OK]
Balaji.V

On Thursday, March 12, 2020, 10:50:39 PM PDT, Vinoth Chandar 
 wrote:  
 
 +1 binding

10:05:53 [hudi-0.5.2]$ shasum -a 512 hudi-${RC_VERSION}-${RC_NUM}.src.tgz >
sha512
10:06:01 [hudi-0.5.2]$ diff sha512
hudi-${RC_VERSION}-${RC_NUM}.src.tgz.sha512.txt | wc -l
      0
10:06:14 [hudi-0.5.2]


10:17:11 [hudi-0.5.2]$ gpg --verify
hudi-${RC_VERSION}-${RC_NUM}.src.tgz.asc.txt
hudi-${RC_VERSION}-${RC_NUM}.src.tgz
gpg: Signature made Thu Mar 12 01:24:37 2020 PDT
gpg:                using RSA key C3A96EC77149571AE89F82764C86684D047DE03C
gpg: Good signature from "vinoyang (apache gpg) "
[unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the
owner.
Primary key fingerprint: C3A9 6EC7 7149 571A E89F  8276 4C86 684D 047D E03C
10:21:43 [hudi-0.5.2]$

10:22:04 [hudi-0.5.2]$ tar -zxvf hudi-${RC_VERSION}-${RC_NUM}.src.tgz
10:22:22 [hudi-0.5.2]$ # Notice, DISCLAIMER-WIP, LICENSE
10:22:24 [hudi-0.5.2]$ ls hudi-${RC_VERSION}-${RC_NUM}/NOTICE
hudi-0.5.2-incubating-rc1/NOTICE
10:22:31 [hudi-0.5.2]$ ls hudi-${RC_VERSION}-${RC_NUM}/DISC*
hudi-0.5.2-incubating-rc1/DISCLAIMER
10:22:36 [hudi-0.5.2]$ ls hudi-${RC_VERSION}-${RC_NUM}/LICENSE
hudi-0.5.2-incubating-rc1/LICENSE

10:23:00 [hudi-0.5.2]$ find hudi-${RC_VERSION}-${RC_NUM}/ -name *.jar | wc
-l
      0
10:23:09 [hudi-0.5.2]$

10:23:09 [hudi-0.5.2]$ grep -LR "Licensed to the Apache Software
Foundation" hudi-${RC_VERSION}-${RC_NUM}
hudi-0.5.2-incubating-rc1/docker/demo/data/batch_2.json
hudi-0.5.2-incubating-rc1/docker/demo/data/batch_1.json
hudi-0.5.2-incubating-rc1/DISCLAIMER
hudi-0.5.2-incubating-rc1/NOTICE
hudi-0.5.2-incubating-rc1/hudi-common/src/test/resources/sample.data
hudi-0.5.2-incubating-rc1/hudi-common/src/main/java/org/apache/hudi/common/util/ObjectSizeCalculator.java
hudi-0.5.2-incubating-rc1/hudi-utilities/src/test/resources/IncrementalPull.sqltemplate
10:23:30 [hudi-0.5.2]$ grep -e "AvroConversionHelper" -e
"ObjectSizeCalculator"  hudi-${RC_VERSION}-${RC_NUM}/LICENSE
This product includes code from
https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/objectsize/ObjectSizeCalculator.java
with the following license
* org.apache.hudi.AvroConversionHelper copied from classes in
org/apache/spark/sql/avro package
10:23:48 [hudi-0.5.2]$

10:24:37 [scripts]$ ./release/validate_staged_release.sh --release=0.5.2
--rc_num=1
/tmp/validation_scratch_dir_001
~/Cache/hudi-0.5.2/hudi-0.5.2-incubating-rc1/scripts
Checking Checksum of Source Release
Checksum Check of Source Release - [OK]

  % Total    % Received % Xferd  Average Speed  Time    Time    Time
 Current
                                Dload  Upload  Total  Spent    Left
 Speed
100 21027  100 21027    0    0  50297      0 --:--:-- --:--:-- --:--:--
50303
Checking Signature
Signature Check - [OK]

Checking for binary files in source release
No Binary Files in Source Release? - [OK]

Checking for DISCLAIMER
DISCLAIMER file exists ? [OK]

Checking for LICENSE and NOTICE
License file exists ? [OK]
Notice file exists ? [OK]

Performing custom Licensing Check
Licensing Check Passed [OK]

Running RAT Check
RAT Check Passed [OK]

10:26:15 [scripts]$


On Thu, Mar 12, 2020 at 7:15 PM Suneel Marthi  wrote:

> +1 binding
>
> 1. Verified Sigs and hashes
> 2. Downloaded tar and ran a maven compile
> 3. Verified the NOTICE and License files.
> 4. Ran thru the Quickstart guide.
>
>
>
> On Thu, Mar 12, 2020 at 9:01 PM vino yang  wrote:
>
> > Hi everyone,
> >
> >
> > We have prepared the third apache release candidate for Apache Hudi
> > (incubating). The version is: 0.5.2-incubating-rc1. Please review and
> vote
> > on the release candidate #1 for the version 0.5.2, as follows:
> >
> > [ ] +1, Approve the release
> >
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > The complete staging area is available for your review, which includes:
> >
> > * JIRA release notes [1],
> > * the official Apache source release and binary convenience releases to
> be
> > deployed to dist.apache.org [2], which are signed with the key with
> > fingerprint C3A96EC77149571AE89F82764C86684D047DE03C [3],
> >
> > * all artifacts to be d

Re: Need clarity on these test cases in TestHoodieDeltaStreamer

2020-02-26 Thread vbal...@apache.org
 
This change was done as part of adding delete API support : 
https://github.com/apache/incubator-hudi/commit/7031445eb3cae5a4557786c7eb080944320609aa
 
I don't remember the reason behind this. 
Sivabalan, Can you explain the reason when you get a chance.
Thanks,Balaji.V
On Wednesday, February 26, 2020, 06:03:53 AM PST, Pratyaksh Sharma 
 wrote:  
 
 Anybody got a chance to look at this?

On Mon, Feb 24, 2020 at 1:04 AM Pratyaksh Sharma 
wrote:

> Hi,
>
> While working on one of my PRs, I am stuck with the following test cases
> in TestHoodieDeltaStreamer -
> 1. testUpsertsCOWContinuousMode
> 2. testUpsertsMORContinuousMode
>
> For both of them, at line [1] and [2], we are adding 200 to totalRecords
> while asserting record count and distance count respectively. I am unable
> to understand what do these 200 records correspond to. Any leads are
> appreciated.
>
> I feel probably I am missing some piece of code where I need to do changes
> for the above tests to pass.
>
> [1]
> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java#L425
> .
> [2]
> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java#L426
> .
>
>
  

Weekly sync notes 20201225

2020-02-25 Thread vbal...@apache.org
Please find the weekly sync notes here
20200225 Weekly Sync Minutes - HUDI - Apache Software Foundation

Thanks,Balaji.V

Re: [DISCUSS] How to correct the license header of entrypoint.sh script

2020-02-22 Thread vbal...@apache.org
 
+1 on ensuring all scripts in Hudi codebase follow same convention for 
licensing.
Balaji.VOn Saturday, February 22, 2020, 06:16:29 AM PST, Suneel Marthi 
 wrote:  
 
 Please go ahead and make the change @lamberken

I was just looking at scripts from Hive and Kafka projects, see below.

https://github.com/apache/hive/blob/master/bin/init-hive-dfs.sh
https://github.com/apache/hive/blob/master/bin/hive-config.sh

https://github.com/apache/kafka/blob/trunk/bin/connect-distributed.sh
https://github.com/apache/kafka/blob/trunk/bin/kafka-leader-election.sh

I suggest to fix all the script files to be consistent with apache license
guide.



On Sat, Feb 22, 2020 at 8:53 AM lamberken  wrote:

>
>
> Hi all,
>
>
> During the voting process on rc1 0.5.1-incubating release, Justin pointed
> out
> docker/hoodie/hadoop/base/entrypoint.sh has an incorrect license header,
> But, many script files used the same license header like "entrypoint.sh"
> has.
>
>
> From apache license guide[2], it says "The text should be enclosed in the
> appropriate comment syntax for the file format."
> So, need to remove the repeated "#", like following changes?
>
>
>
> 
> #  Licensed to the Apache Software Foundation (ASF) under one
> #  or more contributor license agreements.  See the NOTICE file
> #  distributed with this work for additional information
> #  regarding copyright ownership.  The ASF licenses this file
> #  to you under the Apache License, Version 2.0 (the
> #  "License"); you may not use this file except in compliance
> #  with the License.  You may obtain a copy of the License at
> #
> #      http://www.apache.org/licenses/LICENSE-2.0
> #
> #  Unless required by applicable law or agreed to in writing, software
> #  distributed under the License is distributed on an "AS IS" BASIS,
> #  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> #  See the License for the specific language governing permissions and
> # limitations under the License.
>
> 
>
>
> #
> #  Licensed to the Apache Software Foundation (ASF) under one
> #  or more contributor license agreements.  See the NOTICE file
> #  distributed with this work for additional information
> #  regarding copyright ownership.  The ASF licenses this file
> #  to you under the Apache License, Version 2.0 (the
> #  "License"); you may not use this file except in compliance
> #  with the License.  You may obtain a copy of the License at
> #
> #      http://www.apache.org/licenses/LICENSE-2.0
> #
> #  Unless required by applicable law or agreed to in writing, software
> #  distributed under the License is distributed on an "AS IS" BASIS,
> #  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> #  See the License for the specific language governing permissions and
> # limitations under the License.
> #
>
>
> Any thought are welcome, thanks.
>
>
> Thanks,
> Lamber-Ken
>
>
> [1]
> https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E
> [2] https://www.apache.org/licenses/LICENSE-2.0
>
>
  

Re: [DISCUSS] Code freeze date for next release(0.5.2)

2020-02-21 Thread vbal...@apache.org
 
+1 on 02/28 being code freeze date. As this is a release focussing on  
compliance, we would have to move some of the tickets from the list which are 
not being worked on and unrelated to compliance.On Friday, February 21, 
2020, 01:01:45 AM PST, vino yang  wrote:  
 
 Dear Community,

As discussed before[1], the proposed release date of *end of Feb* for Hudi
0.5.2 is getting closer. And we have some bug fixes and features since the
0.5.1 release about one month ago.

To make the release version more stable, I would suggest a bug fixing and
testing period of two weeks to be on the safe side. Given the testing
period, I would propose to do the code freeze on the 28th of Feb 23:59 PST
in order to keep the release date. It means that we would cut the Hudi
0.5.2 release branch on this date and no more feature contributions would
be accepted for this branch. And the uncompleted features would be shipped
with the next release.

There are 17 Jira issues unfinished[2] yet. All these issues have been
assigned. Hope everyone will do their best to catch up with the code
freezing time.

What do you think about the proposed code freeze date? Glad to hear your
thoughts.

[1]:
https://lists.apache.org/thread.html/r70c6741b7396d845d1eb79ddfed922287e9683ae399abd245497a8f8%40%3Cdev.hudi.apache.org%3E
[2]:
https://jira.apache.org/jira/issues/?jql=project%20%3D%20HUDI%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20fixVersion%20%3D%200.5.2

Best,
Vino
  

Re: updatePartitionsToTable() is time consuming and redundant.

2020-02-19 Thread vbal...@apache.org
 
Hi Pratyaksh/Purushotham,
I spent some time in the morning trying to reproduce this locally but unable 
to. There is an unit-test TestHiveSyncTool.testSyncIncremental which is quite 
close to the setup we need to repro. 
I added the below check and it passed (meaning works as expected with no 
unnecessary update partitions call). Can you use the below code to try 
reproducing it locally and in the real ecosystem to see what is happening.
Balaji.V
```System.out.println("DUPLICATE CHECK");
String commitTime3 = "102";
TestUtil.addCOWPartitions(1, true, dateTime, commitTime3);
hiveClient = new HoodieHiveClient(TestUtil.hiveSyncConfig, 
TestUtil.getHiveConf(), TestUtil.fileSystem);
writtenPartitionsSince = 
hiveClient.getPartitionsWrittenToSince(Option.of(commitTime2));
System.out.println("Added Partitions :" + writtenPartitionsSince);
assertEquals(1, writtenPartitionsSince.size());
hivePartitions = 
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName);
partitionEvents = hiveClient.getPartitionEvents(hivePartitions, 
writtenPartitionsSince);
assertEquals("No partition events", 0, partitionEvents.size());

tool = new HiveSyncTool(TestUtil.hiveSyncConfig, TestUtil.getHiveConf(), 
TestUtil.fileSystem);
tool.syncHoodieTable();
// Sync should add the one partition
assertEquals(6, 
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName).size());
assertEquals("The last commit that was sycned should be 102", commitTime3,

hiveClient.getLastCommitTimeSynced(TestUtil.hiveSyncConfig.tableName).get());
On Wednesday, February 19, 2020, 04:08:39 AM PST, Pratyaksh Sharma 
 wrote:  
 
 Hi Balaji,

We are using Hadoop 3.1.0.

Here is the output of the function you wanted to see -

Path is : /data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Is Absolute :true
Stripped Path
=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191117

Stripped path does not contain scheme and authority.

On Mon, Feb 17, 2020 at 2:46 AM Balaji Varadarajan
 wrote:

>
> Sorry for the delay. From the logs, it is clear that the stored partition
> key and  lookup key are not exactly same. One has scheme and authority in
> its URI while the other is not. This is the reason why we are updating the
> same partition again.
> Some of the methods used here comes from hadoop-common and related
> packages. With Hadoop 2.7.3, I am NOT able to reproduce this issue locally.
> I used the below code to try to repro. Which version of Hadoop are you
> using in runtime. Can you  check if the stripped path (see test code below)
> still contains scheme and authority.
>
> ```public void testit() {
>    Path path = new
> Path("s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt"
>        + "=20191117\n");
>    System.out.println("Path is : " + path.toUri().getPath());
>    System.out.println("Is Absolute :" + path.isUriPathAbsolute());
>    String stripped =
> Path.getPathWithoutSchemeAndAuthority(path).toUri().getPath();
>    System.out.println("Stripped Path =" + stripped);
> }
> ```
> Balaji.V
>
>
>    On Wednesday, February 5, 2020, 12:53:57 AM PST, Purushotham
> Pushpavanthar  wrote:
>
>  Hi Balaji/Vinoth,
>
> Below is the log we obtained from Hudi.
>
> 20/01/22 10:30:03 INFO HiveSyncTool: Last commit time synced was found to
> be 20200122094611
> 20/01/22 10:30:03 INFO HoodieHiveClient: Last commit time synced is
> 20200122094611, Getting commits since then
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180108, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180108
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180221, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180221
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20180102, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20180102
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191007, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191007
> 20/01/22 10:30:04 INFO HoodieHiveClient: Partition Location changes.
> StorageVal=20191128, Existing Hive
> Path=/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128,
> New
>
> Location=s3a://dataplatform.internal.warehouse/data/warehouse/hudi/external/prod_db/sales_order_item/dt=20191128
> 20/0

Re: [Help] Hudi NOTICE need more work

2020-02-18 Thread vbal...@apache.org
 Thanks Justin for the information. To clarify, In the case of Apache Hudi, we 
only do source release (source code + jars).  I am assuming bundling (in the 
link you provided) to refer to fat-jars that we publish. Kindly let me know if 
this is a wrong assumption.
Leesf, If the above statement is true,  For the fat-jars that we  publish 
(hudi-[hadoop-mr/hive/spark/utilities]-bundle), we need to copy the relevant 
portions of the NOTICE files for each included dependency(if present) to the 
top level NOTICE file.
Balaji.VOn Tuesday, February 18, 2020, 03:09:12 PM PST, Justin Mclean 
 wrote:  
 
 Hi,

See [1]

Thanks,
Justin

1. http://www.apache.org/dev/licensing-howto.html#alv2-dep
  

Re: [DISCUSS] Next Apache Release(0.5.2)

2020-02-18 Thread vbal...@apache.org
 
+1 on minor release focussing on Apache compliance. 
+1 on Vino yang to be Release Manager. 
The compliance issues reported on build process in 
https://lists.apache.org/list.html?gene...@incubator.apache.org:lte=1M:Hudi 
should also be looked upon and be on the jira list (if not already).
Thanks,Balaji.VOn Tuesday, February 18, 2020, 11:11:49 AM PST, Vinoth 
Chandar  wrote:  
 
 +1 on vinoyang as the release manager

+1 on making a shorter 0.5.2 release. My only suggestion is to have a
concrete focus for this release as "ensuring the hudi release is apache
compliant fully" (so it will count towards graduation).

if you all agree: top of my mind, we need to probably do/double-check the
following.

- DISCLAIMER-WIP :  We still have a WIP disclaimer, this probably excuses
some
- be fully compliant with the project maturity model
https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+Maturity+Matrix
, tick of the remaining items here.
- Fix the NOTICE issue
- Confirm with someone (mentors or incubator or asf docs) on what more
needs to be done, to be ASF compliant..
- We could also start tagging JIRAs with 0.5.2, with the above focus.





On Tue, Feb 18, 2020 at 5:23 AM vino yang  wrote:

> Hi Leesf,
>
> Thanks for kicking the discussion off.
>
> +1 for planning to release Hudi 0.5.2.
>
> The 0.5.2 version is a minor version before 0.6 version, more quickly
> release can solve some small problems.
>
> After releasing Hudi 0.5.1 version, we also fixed some bugs and developed
> some features. So it is very suitable.
>
> I am volunteering as a release manager for Hudi 0.5.2.
>
> WDYT?
>
> Best,
> Vino
>
> leesf  于2020年2月18日周二 下午7:42写道:
>
> > Hello all,
> >
> > In the spirit of making Apache Hudi (incubating) releases at regular
> > cadence,
> > we are starting this thread to kickstart the planning and preparatory
> work
> > for next release (0.5.2).
> >
> > As 0.5.2 is a minor release version and contains some features, bug
> fixes,
> > code cleanup and some apache compliance issues, and some of them
> > have been completed. So I would like to propose the next release date by
> > the end of this month(2.29). What do you think?
> >
> > As described in the release guide (see References), the first step would
> be
> > identify the release manager for 0.5.2. This is a consensus-based
> decision
> > of the entire community. The only requirements is that the release
> manager
> > be Apache Hudi Committer as they have permissions to perform some of the
> > release manager's work. The committer would still need to work with PPMC
> to
> > write to Apache release repositories.
> >
> > There’s no formal process, no vote requirements, and no timing
> requirements
> > when identifying release manager. Any objections should be resolved by
> > consensus before starting the release.
> >
> > In general, the community prefers to have a rotating set of 3-5 Release
> > Managers. Keeping a small core set of managers allows enough people to
> > build expertise in this area and improve processes over time, without
> > Release Managers needing to re-learn the processes for each release. That
> > said, if you are a committer interested in serving the community in this
> > way, please reach out to the community on the dev@ mailing list.
> >
> > If any Hudi committer is interested in being the next release manager,
> > please reply to this email.
> >
> > References:
> > Planned Tickets:  Jira Tickets
> > <
> >
> >
> https://jira.apache.org/jira/issues/?jql=project+%3D+HUDI+AND+fixVersion+%3D+
> > 0.5.
> > <
> >
> https://jira.apache.org/jira/issues/?jql=project+%3D+HUDI+AND+fixVersion+%3D+0.5.1
> > >
> > 2>
> > Release Guide:  Release Guide
> > <
> >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+%28incubating%29+-+Release+Guide
> > >
> >
> > Thanks,
> > Leesf
> > (On behalf of Apache Hudi PPMC)
> >
>  

Re: Discussion Thread: HUDI File Listing and Query Planning Improvements

2020-02-17 Thread vbal...@apache.org
 
Big +1 on the requirement. This would also help datasets using cloud storage by 
avoiding costly listings there.  Will look closely on the design and 
implementation in RFC to comment.
Balaji.VOn Monday, February 17, 2020, 02:06:59 PM PST, Balajee 
Nagasubramaniam  wrote:  
 
 Abstract

In the current implementation, HUDI Writer Client (in the write path) and
HUDI queries (through Inputformat in the read path) have to perform a “list
files” operation on the file system to get the current view of the file
system.  In HDFS, listing all the files in the dataset is a NameNode
intensive operation for large data sets. For example, one of our HUDI
datasets has thousands of date partitions with each partition having
thousands of data files.

With this effort, we want to:

  1. Eliminate the requirement of “list files” operation
      1. This will be done by proactively maintaining metadata about the
      list of files
      2. Reading the file list from a single file should be faster than
      large number of NameNode operations
  2. Create Column Indexes for better query planning and faster lookups by
  Readers
      1. For a column in the dataset, min/max range per Parquet file can be
      maintained.
      2. Just by reading this index file, the query planning system should
      be able to get the view of potential Parquet files for a range query.
      3. Reading Column information from an index file should be faster
      than reading the individual Parquet Footers.

This should provide the following benefits:

  1. Reducing the number of file listing operations improves NameNode
  scalability and reduces NameNode burden.
  2. Query Planner is optimized as the planning is done by reading 1
  metadata file and is mostly bounded regardless of the size of the dataset
  3. Can allow for performing partition path agnostic queries in a
  performant way


We seek Hudi development community's input on this proposal, to explore
this further and to implement a solution that is beneficial to the Hudi
community, meeting various use cases/requirements.

https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements

Thanks,
Balajee, Prashant and Nishith  

Re: Please welcome our new PPMCs and Committer

2020-02-14 Thread vbal...@apache.org
 Congratulations to Leesf, Vino Yang and Siva.
+1 Very well deserved :) Looking forward to your continued contributions.
Balaji.V
On Friday, February 14, 2020, 12:11:18 PM PST, Bhavani Sudha 
 wrote:  
 
 Hearty congratulations to all of you - @leesf   @vinoyang
and @Sivabalan . Very well deserved.

Thanks,
Sudha

On Fri, Feb 14, 2020 at 11:58 AM Vinoth Chandar  wrote:

> Hello all,
>
> I am incredibly excited to share that we have two new PPMC members :
> *leesf*
> and *vinoyang*, who have been doing such sustained, great work on the
> project over a good part of the last year! I and rest of the PPMC, do hope
> there a bigger and better things to come!
>
> We also have a new committer : *Sivabalan*, who has stepped up to own the
> indexing component in the past few months, and has already delivered
> several key contributions and currently driving some foundational work on
> record level indexing.
>
> Please join me in congratulating them!
>
> Thanks
> Vinoth
>
  

Re: Commit time issue in DeltaStreamer (Real-Time)

2019-12-27 Thread vbal...@apache.org
 
Hi Shahida,
To confirm if it is similar to the issue reported in PR-1128, I would need to 
see some more stack trace. Would you mind opening a jira with more log messages 
and stack traces when shutting down. 
Balaji.V On Friday, December 27, 2019, 10:58:42 AM PST, vbal...@apache.org 
 wrote:  
 
 
Looking into this.
Balaji.V    On Friday, December 27, 2019, 08:10:45 AM PST, Shahida Khan 
 wrote:  
 
 @lamberken.. thank you for clarification .. will check the same...

@Vinoth, this might sound a very dumb question, but still if you can help
me with this; do you have any idea from where or which module is
responsible for this issue ..???



On Fri, 27 Dec 2019 at 9:10 PM, Vinoth Chandar  wrote:

> Hi Shahida,
>
> It seems like a bug from the rename changes we did recently. Could you
> please raise a JIRA, tagged to 0.5.1 release?
> Should be easy to fix, but DeltaStreamer should not be reading the clean
> timeline.. (Also btw this is also related to the one issue that you filed).
>
> Balaji, could you take a look at this when you get a chance and see if this
> is related.
>
> Thanks
> Vinoth
>
>
>
> On Fri, Dec 27, 2019 at 5:38 AM lamberken  wrote:
>
> >
> >
> > Hi @Shahida Khan,
> >
> >
> > In the past few days, I faced similar issue. This bug seems happened
> after
> > HUDI-398 merged.
> > You can try to build source before that commit, then continue your work.
> >
> >
> > Here are the details:
> >
> >
> https://lists.apache.org/thread.html/f7834b3389e67b2b66b65386f59eb6646942206865133300c0416a6a%40%3Cdev.hudi.apache.org%3E
> >
> >
> > best,
> > lamber-ken
> > On 12/27/2019 21:02,Shahida Khan wrote:
> > @lamberken, when I have checked, folder .aux was empty ...
> > :(
> >
> > On Fri, 27 Dec 2019 at 6:28 PM, lamberken  wrote:
> >
> >
> >
> > Hi @Shahida Khan,
> >
> >
> > I have a question that the size of *.clean.requested files is 0 ?
> >
> >
> > best,
> > lamber-ken
> >
> >
> >
> >
> > On 12/27/2019 19:54,Shahida Khan wrote:
> > Hi,
> >
> > Greetings!!
> > I have currently using Delta Streamer and upserting data via hudi in
> > real-time.
> > Have used the latest master branch.
> > Job was running fine from last 10days, suddenly, most of the streaming
> job
> > started failing and below is the error which I am facing :
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *java.util.concurrent.ExecutionException:
> > org.apache.hudi.exception.HoodieException: Could not read commit
> > details from
> >
> >
> hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
> > at
> >
> >
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> > at
> > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> at
> >
> >
> org.apache.hudi.utilities.deltastreamer.AbstractDeltaStreamerService.waitForShutdown(AbstractDeltaStreamerService.java:72)
> > at
> >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:117)
> > at
> >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:297)
> > at
> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at
> > java.lang.reflect.Method.invoke(Method.java:498)  at
> >
> >
> >
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)Caused
> > by: org.apache.hudi.exception.HoodieException: Could not read commit
> > details from
> >
> >
> hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
> > at
> >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:411)
> > at
> >
> >
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > at
> > java.lang.Thread.run(Thread.java:748)*
> >
> >
> > It seems issue has already been raised : hudi-1128
> > <https://github.com/apache/incubator-hudi/pull/1128/files>
> > Is this issue related to same which i am facing..??
> >
> >
> > *Regards,*
> > *Shahida R. Khan*
> >
> > --
> > Regards,
> > Shahida Rashid Khan
> > 9167538366
> >
> >
> >
> >
> > kindly ignore typo error  Sent from handheld device ...*
> >
>
-- 
Regards,
Shahida Rashid Khan
9167538366




kindly ignore typo error  Sent from handheld device ...*    

Re: Commit time issue in DeltaStreamer (Real-Time)

2019-12-27 Thread vbal...@apache.org
 
Looking into this.
Balaji.VOn Friday, December 27, 2019, 08:10:45 AM PST, Shahida Khan 
 wrote:  
 
 @lamberken.. thank you for clarification .. will check the same...

@Vinoth, this might sound a very dumb question, but still if you can help
me with this; do you have any idea from where or which module is
responsible for this issue ..???



On Fri, 27 Dec 2019 at 9:10 PM, Vinoth Chandar  wrote:

> Hi Shahida,
>
> It seems like a bug from the rename changes we did recently. Could you
> please raise a JIRA, tagged to 0.5.1 release?
> Should be easy to fix, but DeltaStreamer should not be reading the clean
> timeline.. (Also btw this is also related to the one issue that you filed).
>
> Balaji, could you take a look at this when you get a chance and see if this
> is related.
>
> Thanks
> Vinoth
>
>
>
> On Fri, Dec 27, 2019 at 5:38 AM lamberken  wrote:
>
> >
> >
> > Hi @Shahida Khan,
> >
> >
> > In the past few days, I faced similar issue. This bug seems happened
> after
> > HUDI-398 merged.
> > You can try to build source before that commit, then continue your work.
> >
> >
> > Here are the details:
> >
> >
> https://lists.apache.org/thread.html/f7834b3389e67b2b66b65386f59eb6646942206865133300c0416a6a%40%3Cdev.hudi.apache.org%3E
> >
> >
> > best,
> > lamber-ken
> > On 12/27/2019 21:02,Shahida Khan wrote:
> > @lamberken, when I have checked, folder .aux was empty ...
> > :(
> >
> > On Fri, 27 Dec 2019 at 6:28 PM, lamberken  wrote:
> >
> >
> >
> > Hi @Shahida Khan,
> >
> >
> > I have a question that the size of *.clean.requested files is 0 ?
> >
> >
> > best,
> > lamber-ken
> >
> >
> >
> >
> > On 12/27/2019 19:54,Shahida Khan wrote:
> > Hi,
> >
> > Greetings!!
> > I have currently using Delta Streamer and upserting data via hudi in
> > real-time.
> > Have used the latest master branch.
> > Job was running fine from last 10days, suddenly, most of the streaming
> job
> > started failing and below is the error which I am facing :
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *java.util.concurrent.ExecutionException:
> > org.apache.hudi.exception.HoodieException: Could not read commit
> > details from
> >
> >
> hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
> > at
> >
> >
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> > at
> > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> at
> >
> >
> org.apache.hudi.utilities.deltastreamer.AbstractDeltaStreamerService.waitForShutdown(AbstractDeltaStreamerService.java:72)
> > at
> >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:117)
> > at
> >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:297)
> > at
> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at
> > java.lang.reflect.Method.invoke(Method.java:498)  at
> >
> >
> >
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)Caused
> > by: org.apache.hudi.exception.HoodieException: Could not read commit
> > details from
> >
> >
> hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
> > at
> >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:411)
> > at
> >
> >
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > at
> > java.lang.Thread.run(Thread.java:748)*
> >
> >
> > It seems issue has already been raised : hudi-1128
> > 
> > Is this issue related to same which i am facing..??
> >
> >
> > *Regards,*
> > *Shahida R. Khan*
> >
> > --
> > Regards,
> > Shahida Rashid Khan
> > 9167538366
> >
> >
> >
> >
> > kindly ignore typo error  Sent from handheld device ...*
> >
>
-- 
Regards,
Shahida Rashid Khan
9167538366




kindly ignore typo error  Sent from handheld device ...*  

Re: Re:Re: [DISCUSS] RFC-12 : Efficient migration of large parquet tables to Apache Hudi

2019-12-17 Thread vbal...@apache.org
 Thanks everyone for reviewing the RFC. I will address the comments in the wiki 
once I am back from vacation. Meanwhile, I have created subtasks for this 
effort in https://jira.apache.org/jira/browse/HUDI-242
Thanks,Balaji.V

 


On Sunday, December 15, 2019, 07:24:08 PM PST, Sivabalan 
 wrote:  
 
 Nice one Balaji. have left few comments. Overall looks good :)

On Sun, Dec 15, 2019 at 9:30 AM Balaji Varadarajan
 wrote:

>  Hi Nicholas,
> Once I get high level comments on the RFC,  we can have concrete subtasks
> around this.
> Balaji.V    On Saturday, December 14, 2019, 07:04:52 PM PST, 蒋晓峰 <
> programg...@163.com> wrote:
>
>  Hi Balaji,
> About plan of "Efficient migration of large parquet tables to Apache
> Hudi", have you split the plan into multiple subtasks?
> Thanks,
> Nicholas
>
>
> At 2019-12-14 00:18:12, "Vinoth Chandar"  wrote:
> >+1 (per asf policy)
> >
> >+100 per my own excitement :) .. Happy to review this!
> >
> >On Fri, Dec 13, 2019 at 3:07 AM Balaji Varadarajan 
> >wrote:
> >
> >> With Apache Hudi growing in popularity, one of the fundamental
> challenges
> >> for users has been about efficiently migrating their historical
> datasets to
> >> Apache Hudi. Apache Hudi maintains per record metadata to perform core
> >> operations such as upserts and incremental pull. To take advantage of
> >> Hudi’s upsert and incremental processing support, users would need to
> >> rewrite their whole dataset to make it a Hudi table. This RFC provides a
> >> mechanism to efficiently migrate their datasets without the need to
> rewrite
> >> the entire dataset.
> >>
> >>  Please find the link for the RFC below.
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi
> >>
> >> Please review and let me know your thoughts.
> >>
> >> Thanks,
> >> Balaji.V
> >>
>



-- 
Regards,
-Sivabalan  

[20191210] Meeting Notes

2019-12-10 Thread vbal...@apache.org
Hello all,

Please find the meeting minutes for today's weekly sync meeting in the link 
below
 https://cwiki.apache.org/confluence/display/HUDI/20191210+Weekly+Sync+Minutes

Thanks,
Balaji.V


Re: [DISCUSS] Scaling community support

2019-12-10 Thread vbal...@apache.org
 
Regarding (1), I support the "on-call" model for answering to dev@ emails, 
triaging GH and Jira. This would help reduce context-switch for the contributor 
community as a whole. Also, Trying to answer questions about Hudi is a good way 
to ramp up on the internals of it.  A 2 day schedule would be good way to start 
and we can try this to see how this oncall model works for us.
Regarding (2), Code-reviews are critical to building our community. Maybe, we 
can try a recognition model for most active code-reviewers every month would 
help here. We can send a recognition/appreciation  email identifying new and 
most active code-reviewers.
Regarding (3),  How about we assume that if a ticket owner is not responding 
for more than 1 or 2 weeks, then they are not working on this and we can 
re-assign if it is a critical feature that needs to go into a release. The 
response from ticket owners need not be a complete answer but just enough 
communication that they are actively working on it. I agree with @leesf that 
this depends on each person's situation but a clear communication regarding ETA 
expectations and sticking to it would help in project planning. 
Thanks,Balaji.V
 

Answering issues on this mailing list or GH issues or occasionally
slack. We need a clear owner to triage the problem, reproduce it if needed,
either provide suggestions or file a JIRA - AND always look for ways to
update the FAQ. We need a clear hand off process also.
On Saturday, December 7, 2019, 12:01:28 PM PST, Vinoth Chandar 
 wrote:  
 
 Hello all,

As we grow, we need a scalable way for new users/contributors to either
easily use Hudi or ramp up on the project. Last month alone, we had close
to 1600 notifications on commits@. and few hundred emails on this list. In
addition, to authoring RFCs and implementing JIRAs we need to share the
following responsibilities amongst us to be able to scale this process.

1) Answering issues on this mailing list or GH issues or occasionally
slack. We need a clear owner to triage the problem, reproduce it if needed,
either provide suggestions or file a JIRA - AND always look for ways to
update the FAQ. We need a clear hand off process also.
2) Code review process currently spreads the load amongst all the
committers. But PRs vary dramatically in their complexity and we need more
committers who can review any part of the codebase.
3) Responding to pings/clarifications and unblocking . IMHO committers
should prioritize this higher than working on their own stuff (I know I
have been doing this at some cost to my productivity on the project). This
is the only way to scale and add new committers. committers need to be
nurturing in this process.

I don't have a clear proposals for scaling 2 & 3, which fall heavily on
committers.. Love to hear suggestions.

But for 1, I propose we have 2-3 day "Support Rotations" where any
contributor can assume responsibility for support the community. This
brings more focus to support and also fast tracks learning/ramping for the
person on the rotation. It also minimizes interruptions for other folks and
we gain more velocity. I am sure this is familiar to a lot of you at your
own companies. We have at-least 10-15 active contributors at this point..
So  the investment is minimal : doing this once a month.

 A committer and a PMC member will always be designated secondary/backup in
case the primary cannot field a question. I am happy to additionally
volunteer as "always on rotation" as a third level backup, to get this
process booted up.

Please let me know what you all think. Please be specific in what issue
[1][2] or [3] you are talking about in your feedback

thanks
vinoth
  

Re: [QUESTION] Why is TimestampBasedKeyGenerator part of hudi-utilities?

2019-12-04 Thread vbal...@apache.org
 Hi Gurudatt,
Good point. There is no specific technical reason behind keeping this class in 
hudi-utilities. We can move the core logic to hudi-spark package. Would you be 
interested in filing a jira and submitting a PR ?
There are delta-streamer specific configs (TimestampBasedKeyGenerator.Config) 
which we need to retain for backwards compatibility reasons but introduce 
separate configs - DataSourceWriteOptions (hoodie.datasource.xxx) for 
configuring TimestampBasedKeyGenerator as part of datasource write. 
Balaji.V
On Wednesday, December 4, 2019, 10:30:40 PM PST, Gurudatt Kulkarni 
 wrote:  
 
 Hi All,

All other key generators are part of hudi-spark
except TimestampBasedKeyGenerator, it causes issue while using just
hudi-spark directly in a spark job. Any specific reason for this? Can we
move this to hudi-spark module?

Regards,
Gurudatt
  

[20191126] Meeting Notes

2019-11-27 Thread vbal...@apache.org
Hello all,

Please find the meeting minutes for today's weekly sync meeting in the link 
below
 https://cwiki.apache.org/confluence/display/HUDI/20191126+Weekly+Sync+Minutes

Thanks,
Balaji.V



Re: EMR + HUDI

2019-11-15 Thread vbal...@apache.org
This is massive news !!  Many thanks to Udit, Rahul and AWS team for working 
with us patiently and making HUDI part of EMR.  This is indeed a marathon 
effort !! 
Looking forward to continued collaboration in providing a great data lake 
experience and keep improving Hudi overall !!
Balaji.V
   On Friday, November 15, 2019, 10:40:39 AM PST, Bhavani Sudha Saktheeswaran 
 wrote:  
 
 This is great news. Kudos to all contributors.

On Fri, Nov 15, 2019 at 10:22 AM Vinoth Chandar  wrote:

> Hello all,
>
> In case you did not notice, AWS EMR now has Hudi support, which should make
> life easier for folks on AWS.
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.aws.amazon.com_emr_latest_ReleaseGuide_emr-2Dhudi.html&d=DwIBaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=1m4UXNjVbRS_Qp6qjI1zUHFGQ8u3bUZyCQo-KkrDgLM&s=_B0PqvZladwQZwEFjj1rwFU1hkjz65hKZaY3lywdTv8&e=
>
> Thanks to our wonderful contributors from AWS (Udit & team) for making it
> happen
>
> Thanks
> Vinoth
>  

Re: [DISCUSS] Intent to RFC: Restructuring and auto-generation of docs

2019-11-13 Thread vbal...@apache.org
 +1 on the initiative.
I have give cwiki access to both Ethan and Vinoth
Balaji.V

On Wednesday, November 13, 2019, 09:14:16 AM PST, Vinoth Chandar 
 wrote:  
 
 Thanks for initiating this, Ethan. Will send detailed comments in a while.

@raymond, I actually think this deserves an RFC for two reasons.
(1) docs is as important as code and something developers have to deal with
all the time. so good to get broad feedback on this.
(2) We actually expanded RFCs to include even "ideas", providing way for
someone to write down thoughts and share with the group in a structured
way. Epics can be created if the RFC ultimately results in planned work.
What do you think?

@balaji or @sudha. Seems like I lost admin rights on cwiki space somehow
(may be the recent change on role based administration?) . Can you please
provide ethan access and also reinstate my permissions? :)


On Wed, Nov 13, 2019 at 8:01 AM Raymond Xu  wrote:

> Hi Ethan,
>
> I like the idea and I'm all for it. Actually this is one of the roadmap
> items under "Ease of Use" for 0.5.1
>
> My only concern is: does this fit into an RFC? I believe an RFC is about
> adding a new feature to the framework while having better docs fall within
> the dev experience realm IMO. Nonetheless, what you described definitely
> deserves an Epic in the JIRA board at least.
>
> Best,
> Raymond
>
> On Wed, Nov 13, 2019 at 4:37 AM leesf  wrote:
>
> > +1. It is very practical and thanks for driving the discussion.
> >
> > vinoth and balaji would give your cwiki permission.
> >
> > Best,
> > Leesf
> >
> > vino yang  于2019年11月13日周三 下午5:02写道:
> >
> > > Hi Ethan,
> > >
> > > Thanks for starting this discussion thread.
> > > +1 from my side
> > >
> > > Best,
> > > Vino
> > >
> > > Y Ethan Guo  于2019年11月13日周三 下午4:05写道:
> > >
> > > > Hey Folks,
> > > >
> > > > I plan to start an RFC in the Docs Overhaul track. The scope of this
> > RFC
> > > > will be the restructuring and auto-generation of docs, with the
> > following
> > > > goals:
> > > >
> > > > - Make it easier for users to understand Hudi's main features and
> > > access
> > > > docs of each release
> > > > - Separate the actual documentation specific to each release (onto
> > > > master) from Hudi's landing page (asf-site, to include general
> > > > information
> > > > about Hudi) going forward
> > > > - Manually generate the docs pages of previous three releases (
> > > > https://issues.apache.org/jira/browse/HUDI-226
> > )
> > > > - Restructure the page layout for users to check out docs of
> > > > different releases
> > > > - Add online javadocs (
> > > > https://issues.apache.org/jira/browse/HUDI-319
> > )
> > > > - Reduce the amount of maintenance work on docs update and release
> for
> > > > developers/committers
> > > > - Add a script for generating release docs pages, which can be used
> > > > when cutting a release
> > > > - Add a script for updating/deploying the landing page of Hudi
> > > > website (https://issues.apache.org/jira/browse/HUDI-132
> > )
> > > >
> > > > Let me know if the scope is good and any other relevant work can also
> > be
> > > > added in this RFC.
> > > >
> > > > I need cwiki write permission to create a new RFC. My cwiki handle is
> > > > yihua.
> > > >
> > > > Thanks,
> > > > - Ethan
> > > >
> > >
> >
>
>
> --
> *Raymond Xu*
> Sr. Software Engineer | Zendesk Inc.
> LinkedIn  | GitHub
> 
>  

Re: Unable to run Integration tests

2019-11-01 Thread vbal...@apache.org
 Agree that we need to keep hdfs data transient across integration test runs. I 
have removed the volumes in the compose file and updated the PR 
https://github.com/apache/incubator-hudi/pull/989 
Hopefully, this should fix the flakiness.
Balaji.V

 On Friday, November 1, 2019, 08:26:38 AM PDT, Vinoth Chandar 
 wrote:  
 
 Update on this thread..  There has been progress and we have few fixes being 
tested
https://github.com/vinothchandar/incubator-hudi/tree/hudi-312-flaky-tests 
https://github.com/apache/incubator-hudi/pull/989 

It boiled down the remnants from the previous run hanging around and causing 
invalid states. We also had some threadpool that was n't closed upon such an 
unexpected error causing the jvm to hang around. 
@Balaji Varadarajan  I think its best to rebuild and publish new images which 
use local storage for hdfs . wdyt? 

Also filed a few follow ups : HUDI-322, HUDI-323 


On Sat, Oct 26, 2019 at 9:36 AM Vinoth Chandar  wrote:

Disabling UI is not doing the trick. I think it gets stuck while starting up 
(and not while exiting like I assumed incorrectly before). 

On Fri, Oct 25, 2019 at 9:00 AM Vinoth Chandar  wrote:

Could we disable the UI and try again? Its either the jetty threads or the two 
HDFS threads that's hanging on. Cannot understand why the JVM would n't exit 
otherwise. 
On Fri, Oct 25, 2019 at 5:27 AM Bhavani Sudha  wrote:

https://gist.github.com/bhasudha/5aac43d93a942f68bcab413a26229292
 Took a thread dump. Seems like jetty threads are not shutting down? Dont
see any hudi/spark related activity that is pending. Only threads in
RUNNABLE state are jetty ones

On Fri, Oct 25, 2019 at 1:54 AM Pratyaksh Sharma 
wrote:

> Hi Vinoth,
>
> > can you try
> - Do : docker ps -a and make sure there are no lingering containers.
> - if so, run : cd docker; ./stop_demo.sh
> - cd ..
> - mvn clean verify -DskipUTs=true -B
>
> I ran the above 3 times. Twice it was successful but once it incurred the
> same errors I listed in previous mail.
>
> On Fri, Oct 25, 2019 at 8:26 AM Vinoth Chandar <
> mail.vinoth.chan...@gmail.com> wrote:
>
> > Got the integ test to hang once, at the same spot as Pratyaksh
> mentioned..
> > So it would be a good candidate to drill into.
> >
> > @nishith in this state, the containers are all open. So you could just
> hop
> > in and stack trace to see whats going on.
> >
> >
> > On Thu, Oct 24, 2019 at 9:14 AM Nishith  wrote:
> >
> > > I’m going to look into the flaky tests on Travis sometime today.
> > >
> > > -Nishith
> > >
> > > Sent from my iPhone
> > >
> > > > On Oct 23, 2019, at 10:23 PM, Vinoth Chandar 
> > wrote:
> > > >
> > > > Just to make sure we are on the same page,
> > > >
> > > > can you try
> > > > - Do : docker ps -a and make sure there are no lingering containers.
> > > > - if so, run : cd docker; ./stop_demo.sh
> > > > - cd ..
> > > > - mvn clean verify -DskipUTs=true -B
> > > >
> > > > and this always gets stuck? The failures on CI seem to be random
> > > timeouts.
> > > > Not very related to this.
> > > >
> > > > FWIW I ran the above 3 times, without glitches so far.. So if you can
> > > > confirm then it ll help
> > > >
> > > >> On Wed, Oct 23, 2019 at 7:04 AM Vinoth Chandar 
> > > wrote:
> > > >>
> > > >> I saw someone else share the same experience. Can't think of
> anything
> > > that
> > > >> could have caused this to become flaky recently.
> > > >> I already created https://issues.apache.org/jira/browse/HUDI-312
> > > >> <
> > >
> >
> https://issues.apache.org/jira/browse/HUDI-312?filter=12347468&jql=project%20%3D%20HUDI%20AND%20fixVersion%20%3D%200.5.1%20AND%20(status%20%3D%20Open%20OR%20status%20%3D%20%22In%20Progress%22)%20ORDER%20BY%20assignee%20ASC
> > >
> > > to
> > > >> look into some flakiness on travis.
> > > >>
> > > >> any volunteers to drive this? (I am in the middle of fleshing out an
> > > RFC)
> > > >>
> > > >> On Wed, Oct 23, 2019 at 6:43 AM Pratyaksh Sharma <
> > pratyaks...@gmail.com
> > > >
> > > >> wrote:
> > > >>
> > > >>> It gets stuck forever while running the following -
> > > >>>
> > > >>> Container : /adhoc-1, Running command :spark-submit --class
> > > >>> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> > > >>>
> > >
> /var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar
> > > >>> --storage-type MERGE_ON_READ  --source-class
> > > >>> org.apache.hudi.utilities.sources.JsonDFSSource
> > > --source-ordering-field ts
> > > >>> --target-base-path /user/hive/warehouse/stock_ticks_mor
> > --target-table
> > > >>> stock_ticks_mor --props /var/demo/config/dfs-source.properties
> > > >>> --schemaprovider-class
> > > >>> org.apache.hudi.utilities.schema.FilebasedSchemaProvider
> > > >>> --disable-compaction  --enable-hive-sync  --hoodie-conf
> > > >>> hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:1
> > > >>> --hoodie-conf hoodie.datasource.hive_sync.username=hive
> > --hoodie-conf
> > > >>> hoodie.datasource.hive_sync.password=hive  --hoodie-conf
> > > >>> hoo

Re: Compile failing on master

2019-10-18 Thread vbal...@apache.org
 I just tried deleting confluent jars from my .m2 repo and reran "mvn clean 
package -DskipTests". I am able to download it without problem (See logs below)
My mvn version is : Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T11:33:14-07:00)
Try manually curling the url (see logs below) within docker to see if this is 
some network issue.
Balaji.V

[INFO] ---< org.apache.hudi:hudi-utilities >---

[INFO] Building hudi-utilities 0.5.0-incubating-rc6                      [8/27]

[INFO] [ jar ]-

Downloading from confluent: 
https://packages.confluent.io/maven/io/confluent/kafka-avro-serializer/3.0.0/kafka-avro-serializer-3.0.0.pom

Downloaded from confluent: 
https://packages.confluent.io/maven/io/confluent/kafka-avro-serializer/3.0.0/kafka-avro-serializer-3.0.0.pom
 (2.4 kB at 7.4 kB/s)

Downloading from confluent: 
https://packages.confluent.io/maven/io/confluent/kafka-schema-registry-parent/3.0.0/kafka-schema-registry-parent-3.0.0.pom

Downloaded from confluent: 
https://packages.confluent.io/maven/io/confluent/kafka-schema-registry-parent/3.0.0/kafka-schema-registry-parent-3.0.0.pom
 (5.9 kB at 196 kB/s)

Downloading from confluent: 
https://packages.confluent.io/maven/io/confluent/kafka-schema-registry-client/3.0.0/kafka-schema-registry-client-3.0.0.pom

Downloaded from confluent: 
https://packages.confluent.io/maven/io/confluent/kafka-schema-registry-client/3.0.0/kafka-schema-registry-client-3.0.0.pom
 (1.6 kB at 86 kB/s)

Downloading from confluent: 
https://packages.confluent.io/maven/io/confluent/common-config/3.0.0/common-config-3.0.0.pom

Downloaded from confluent: 
https://packages.confluent.io/maven/io/confluent/common-config/3.0.0/common-config-3.0.0.pom
 (1.6 kB at 9.1 kB/s)

Downloading from confluent: 
https://packages.confluent.io/maven/io/confluent/common/3.0.0/common-3.0.0.pom

Downloaded from confluent: 
https://packages.confluent.io/maven/io/confluent/common/3.0.0/common-3.0.0.pom 
(4.8 kB at 21 kB/s)

Downloading from confluent: 
https://packages.confluent.io/maven/io/confluent/common-utils/3.0.0/common-utils-3.0.0.pom

Downloaded from confluent: 
https://packages.confluent.io/maven/io/confluent/common-utils/3.0.0/common-utils-3.0.0.pom
 (2.1 kB at 8.4 kB/s)

Downloading from confluent: 
https://packages.confluent.io/maven/io/confluent/kafka-avro-serializer/3.0.0/kafka-avro-serializer-3.0.0.jar

Downloading from confluent: 
https://packages.confluent.io/maven/io/confluent/common-config/3.0.0/common-config-3.0.0.jar

Downloading from confluent: 
https://packages.confluent.io/maven/io/confluent/common-utils/3.0.0/common-utils-3.0.0.jar

Downloading from confluent: 
https://packages.confluent.io/maven/io/confluent/kafka-schema-registry-client/3.0.0/kafka-schema-registry-client-3.0.0.jar

Downloaded from confluent: 
https://packages.confluent.io/maven/io/confluent/kafka-schema-registry-client/3.0.0/kafka-schema-registry-client-3.0.0.jar
 (36 kB at 426 kB/s)

Downloaded from confluent: 
https://packages.confluent.io/maven/io/confluent/common-utils/3.0.0/common-utils-3.0.0.jar
 (18 kB at 78 kB/s)

Downloaded from confluent: 
https://packages.confluent.io/maven/io/confluent/kafka-avro-serializer/3.0.0/kafka-avro-serializer-3.0.0.jar
 (24 kB at 100 kB/s)

Downloaded from confluent: 
https://packages.confluent.io/maven/io/confluent/common-config/3.0.0/common-config-3.0.0.jar
 (19 kB at 51 kB/s)
 On Friday, October 18, 2019, 06:17:45 AM PDT, Gurudatt Kulkarni 
 wrote:  
 
 Hi Balaji / Pratyaksh,

I just pulled maven:3.6.2-jdk-8 docker image and cloned the repo in it and
ran `mvn clean install -DskipTests -DskipITs` but it has the same issue.


On Fri, Oct 18, 2019 at 4:36 PM Pratyaksh Sharma 
wrote:

> Hi Balaji,
>
> I do not have custom settings.xml file and faced the same problem. I was
> able to fix it by first running the command as mentioned on the page
> <https://hudi.apache.org/quickstart.html> -
>
> mvn clean install -DskipTests -DskipITs
>
> After running this command, you can run the other commands without any
> error.
>
> On Fri, Oct 18, 2019 at 4:24 PM vbal...@apache.org 
> wrote:
>
> >
> > I have seen this happens sometimes if you have a custom
> ~/.m2/settings.xml
> > file. Try removing it and check.
> > Balaji.V    On Friday, October 18, 2019, 03:31:09 AM PDT, Gurudatt
> > Kulkarni  wrote:
> >
> >  Hi All,
> >
> > I ran `mvn compile` on master, but it fails to build completely because
> it
> > was unable to find Confluent dependencies. I added
> > https://packages.confluent.io/maven2 in repositories in the main pom.xml
> > and it built successfully. Just curious, how you guys are compiling? Can
> we
> > add confluent maven repository to the main pom.xml file?
> >
> > Here's the error tha

Re: Compile failing on master

2019-10-18 Thread vbal...@apache.org
 
I have seen this happens sometimes if you have a custom ~/.m2/settings.xml 
file. Try removing it and check.
Balaji.VOn Friday, October 18, 2019, 03:31:09 AM PDT, Gurudatt Kulkarni 
 wrote:  
 
 Hi All,

I ran `mvn compile` on master, but it fails to build completely because it
was unable to find Confluent dependencies. I added
https://packages.confluent.io/maven2 in repositories in the main pom.xml
and it built successfully. Just curious, how you guys are compiling? Can we
add confluent maven repository to the main pom.xml file?

Here's the error that I got

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process
(process-resource-bundles) on project hudi-cli: Failed to resolve
dependencies for one or more projects in the reactor. Reason: Missing:
[ERROR] --
[ERROR] 1) io.confluent:kafka-avro-serializer:jar:3.0.0
[ERROR]
[ERROR]  Try downloading the file manually from the project website.
[ERROR]
[ERROR]  Then, install it using the command:
[ERROR]      mvn install:install-file -DgroupId=io.confluent
-DartifactId=kafka-avro-serializer -Dversion=3.0.0 -Dpackaging=jar
-Dfile=/path/to/file
[ERROR]
[ERROR]  Alternatively, if you host your own repository you can deploy the
file there:
[ERROR]      mvn deploy:deploy-file -DgroupId=io.confluent
-DartifactId=kafka-avro-serializer -Dversion=3.0.0 -Dpackaging=jar
-Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]
[ERROR]
[ERROR]  Path to dependency:
[ERROR]  1) org.apache.hudi:hudi-cli:jar:0.5.1-SNAPSHOT
[ERROR]  2) org.apache.hudi:hudi-utilities:jar:0.5.1-SNAPSHOT
[ERROR]  3) io.confluent:kafka-avro-serializer:jar:3.0.0
[ERROR]
[ERROR] 2) io.confluent:common-config:jar:3.0.0
[ERROR]
[ERROR]  Try downloading the file manually from the project website.
[ERROR]
[ERROR]  Then, install it using the command:
[ERROR]      mvn install:install-file -DgroupId=io.confluent
-DartifactId=common-config -Dversion=3.0.0 -Dpackaging=jar
-Dfile=/path/to/file
[ERROR]
[ERROR]  Alternatively, if you host your own repository you can deploy the
file there:
[ERROR]      mvn deploy:deploy-file -DgroupId=io.confluent
-DartifactId=common-config -Dversion=3.0.0 -Dpackaging=jar
-Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]
[ERROR]
[ERROR]  Path to dependency:
[ERROR]  1) org.apache.hudi:hudi-cli:jar:0.5.1-SNAPSHOT
[ERROR]  2) org.apache.hudi:hudi-utilities:jar:0.5.1-SNAPSHOT
[ERROR]  3) io.confluent:common-config:jar:3.0.0
[ERROR]
[ERROR] 3) io.confluent:kafka-schema-registry-client:jar:3.0.0
[ERROR]
[ERROR]  Try downloading the file manually from the project website.
[ERROR]
[ERROR]  Then, install it using the command:
[ERROR]      mvn install:install-file -DgroupId=io.confluent
-DartifactId=kafka-schema-registry-client -Dversion=3.0.0 -Dpackaging=jar
-Dfile=/path/to/file
[ERROR]
[ERROR]  Alternatively, if you host your own repository you can deploy the
file there:
[ERROR]      mvn deploy:deploy-file -DgroupId=io.confluent
-DartifactId=kafka-schema-registry-client -Dversion=3.0.0 -Dpackaging=jar
-Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]
[ERROR]
[ERROR]  Path to dependency:
[ERROR]  1) org.apache.hudi:hudi-cli:jar:0.5.1-SNAPSHOT
[ERROR]  2) org.apache.hudi:hudi-utilities:jar:0.5.1-SNAPSHOT
[ERROR]  3) io.confluent:kafka-schema-registry-client:jar:3.0.0
[ERROR]
[ERROR] 4) io.confluent:common-utils:jar:3.0.0
[ERROR]
[ERROR]  Try downloading the file manually from the project website.
[ERROR]
[ERROR]  Then, install it using the command:
[ERROR]      mvn install:install-file -DgroupId=io.confluent
-DartifactId=common-utils -Dversion=3.0.0 -Dpackaging=jar
-Dfile=/path/to/file
[ERROR]
[ERROR]  Alternatively, if you host your own repository you can deploy the
file there:
[ERROR]      mvn deploy:deploy-file -DgroupId=io.confluent
-DartifactId=common-utils -Dversion=3.0.0 -Dpackaging=jar
-Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]
[ERROR]
[ERROR]  Path to dependency:
[ERROR]  1) org.apache.hudi:hudi-cli:jar:0.5.1-SNAPSHOT
[ERROR]  2) org.apache.hudi:hudi-utilities:jar:0.5.1-SNAPSHOT
[ERROR]  3) io.confluent:common-utils:jar:3.0.0
[ERROR]
[ERROR] --
[ERROR] 4 required artifacts are missing.
[ERROR]
[ERROR] for artifact:
[ERROR]  org.apache.hudi:hudi-cli:jar:0.5.1-SNAPSHOT
[ERROR]
[ERROR] from the specified remote repositories:
[ERROR]  libs-milestone (https://repo.spring.io/libs-milestone/,
releases=true, snapshots=true),
[ERROR]  libs-release (https://repo.spring.io/libs-release/,
releases=true, snapshots=true),
[ERROR]  Maven Central (https://repo.maven.apache.org/maven2,
releases=true, snapshots=false),
[ERROR]  cloudera-repo-releases (
https://repository.cloudera.com/artifactory/public/, releases=true,
snapshots=false),
[ERROR]  apache.snapshots (https://repository.apache.org/snapshots,
releases=false, snapshots=true),
[ERROR]  central (https://repo.maven.apache.org/maven2, releases=true,
snapshots=false)
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven wit

Re: [VOTE] Release 0.5.0-incubating, release candidate #6

2019-10-16 Thread vbal...@apache.org
 
Forgot to mention that this release candidate addresses the licensing concerns 
that came up during voting in general@incubator. The email thread is in : 
https://lists.apache.org/thread.html/02d40e3dbababc069c5210928aa4dd335c41ab1837d5a894954f5c9f@%3Cgeneral.incubator.apache.org%3E

The PR which addresses it : https://github.com/apache/incubator-hudi/pull/953

Balaji.V

On Wednesday, October 16, 2019, 10:20:35 AM PDT, vbal...@apache.org 
 wrote:  
 
 Hi everyone,We have a new release candidate for first release of Apache Hudi 
(incubating). The version is : 0.5.0-incubating-rc6. To run automated source 
release validation script, please follow the below steps  
  - If you have not checkout out hudi, please do      
      - git clone g...@github.com:apache/incubator-hudi.git;

  - If you already have incubator-hudi, please do
  
  - git checkout master  && git pull origin master
  
  - cd incubator-hudi/scripts;
  - ./release/validate_staged_release.sh --release=0.5.0 --rc_num=6  

To compile, run "mvn compile". To run unit-test, run "mvn test"Please review 
and vote on the release candidate #6 for the version 0.5.0, as follows:[ ] +1, 
Approve the release [ ]   0 I don't feel strongly about it, but I'm okay with 
the release
[ ] -1, Do not approve the release (please provide specific comments)The 
complete staging area is available for your review, which includes:  
  - JIRA release notes [1]
  - The official Apache source release and binary convenience releases to be 
deployed to dist.apache.org [2], which are signed with the key with fingerprint 
AF9BAF79D311A3D3288E583F24A499037262AAA4  [3],  

  - all artifacts to be deployed to the Maven Central Repository [4]  

  - source code tag "release-0.5.0-incubating-rc6" [5]  

The vote will be open for at least 72 hours. 
It is adopted by majority approval, with at least 3 PMC affirmative votes.  
  - 
https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12346087
  - 
https://dist.apache.org/repos/dist/dev/incubator/hudi/hudi-0.5.0-incubating-rc6/
  - https://dist.apache.org/repos/dist/release/incubator/hudi/KEYS
  - https://repository.apache.org/content/repositories/orgapachehudi-1006/
  - https://github.com/apache/incubator-hudi/tree/release-0.5.0-incubating-rc6

Thanks,Balaji.V(on behalf of Apache Hudi PPMC)  

[VOTE] Release 0.5.0-incubating, release candidate #6

2019-10-16 Thread vbal...@apache.org
Hi everyone,We have a new release candidate for first release of Apache Hudi 
(incubating). The version is : 0.5.0-incubating-rc6. To run automated source 
release validation script, please follow the below steps   
   - If you have not checkout out hudi, please do  
  - git clone g...@github.com:apache/incubator-hudi.git;

   - If you already have incubator-hudi, please do
   
   - git checkout master  && git pull origin master
   
   - cd incubator-hudi/scripts;
   - ./release/validate_staged_release.sh --release=0.5.0 --rc_num=6   

To compile, run "mvn compile". To run unit-test, run "mvn test"Please review 
and vote on the release candidate #6 for the version 0.5.0, as follows:[ ] +1, 
Approve the release [ ]   0 I don't feel strongly about it, but I'm okay with 
the release
[ ] -1, Do not approve the release (please provide specific comments)The 
complete staging area is available for your review, which includes:   
   - JIRA release notes [1]
   - The official Apache source release and binary convenience releases to be 
deployed to dist.apache.org [2], which are signed with the key with fingerprint 
AF9BAF79D311A3D3288E583F24A499037262AAA4  [3],   

   - all artifacts to be deployed to the Maven Central Repository [4]   

   - source code tag "release-0.5.0-incubating-rc6" [5]   

The vote will be open for at least 72 hours. 
It is adopted by majority approval, with at least 3 PMC affirmative votes.   
   - 
https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12346087
   - 
https://dist.apache.org/repos/dist/dev/incubator/hudi/hudi-0.5.0-incubating-rc6/
   - https://dist.apache.org/repos/dist/release/incubator/hudi/KEYS
   - https://repository.apache.org/content/repositories/orgapachehudi-1006/
   - https://github.com/apache/incubator-hudi/tree/release-0.5.0-incubating-rc6

Thanks,Balaji.V(on behalf of Apache Hudi PPMC)

Apache Hudi (incubating) 0.5.0 [RC5] voting in general@incubator

2019-10-10 Thread vbal...@apache.org
Hello Mentors,
Today morning, we sent an email to general@incubator for second round of voting 
to release Apache Hudi (incubating) 0.5.0-incubating-rc5. Kindly review and 
vote on that that thread. 

Thanks,Balaji.V 


Re: [VOTE] Release 0.5.0-incubating, release candidate #5

2019-10-09 Thread vbal...@apache.org
 
Thanks everyone. The voting has concluded. 
We had 7 +1 votes out of which 3 are binding
   
   - leesf 
   - Gurudatt Kulkarni
   - Bhavani Sudha Saktheeswaran
   - Nishith Agarwal (Binding)
   - Vinoth Chandar (Binding)
   - Prasanna Rajaperumal (Binding)
   - Udit Mehrotra

There was one -0 vote from Thomas Weise. He has pointed out the missing 
"incubating" keyword in NOTICE file but also noted that this inconsistency is 
seen in other incubating projects.  This inconsistency has been fixed in master 
now 
(https://github.com/apache/incubator-hudi/commit/834c591955cfa8a9c5f286967d693932564b6764).
The vote has passed. Thanks everyone
Regards,Balaji.V

On Wednesday, October 9, 2019, 05:05:48 PM PDT, vbal...@apache.org 
 wrote:  
 
 
Thanks Thomas. It looks like data-sketches team will be fixing the NOTICE file 
in master. We will also fix this in master.  Also other incubating projects 
(incubator-gobblin) which have released earlier have similar NOTICE format (no 
"incubating"). 
I will wait for couple of hours to see if there are any replies to your email 
in general@incubator.  If not, I am inclined to moving forward with voting to 
general@.  If this inconsistency turns out to be show-stopper, having a failed 
vote in general@ would give us an opportunity to collect more feedback comment. 
This will help us reduce the overall turnaround time in releasing the first 
version.

Thanks,Balaji.V 
    On Wednesday, October 9, 2019, 10:05:22 AM PDT, Thomas Weise 
 wrote:  
 
 FYI there is inconsistency regarding "(incubating)" in the NOTICE file
between other projects I looked at. I asked the question on another release
VOTE on general@

I would recommend to make the change anyways, but depending on what
feedback there is regarding the above, can still consider move forward with
this RC (that's why -0)

On Wed, Oct 9, 2019 at 9:55 AM Thomas Weise  wrote:

> -0 (binding)
>
> NOTICE needs to include "(incubating)"
>
> Minor: VOTE refers to "The official Apache source release and binary
> convenience releases" but there is no binary release.
>
> Otherwise things look good.
>
> Signature check  PASS
> No binaries  PASS
> mvn test  PASS
>
>
> On Fri, Oct 4, 2019 at 6:15 PM vbal...@apache.org 
> wrote:
>
>> Hi everyone,We have a new release candidate for first release of Apache
>> Hudi (incubating). The version is : 0.5.0-incubating-rc5. Please note that
>> previous release candidates RC#3 and RC#4 were not sent for voting as we
>> discovered compliance issues before we could send them for voting. These
>> issues were subsequently fixed as part of PR-935 and  PR-939 and RC#5 has
>> been builtWe also have a new release validation script available in master
>> to automate the usual checks.  To run this
>>    - If you have not checkout out hudi, please do
>>      - git clone g...@github.com:apache/incubator-hudi.git;
>>
>>    - If you already have hudi, please do
>>
>>    - git checkout master  && git pull origin master
>>
>>    - cd incubator-hudi/scripts;
>>    - ./release/validate_staged_release.sh --release=0.5.0 --rc_num=5
>>
>> Please review and vote on the release candidate #5 for the version 0.5.0,
>> as follows:[ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>> The complete staging area is available for your review, which includes:
>>    - JIRA release notes [1]
>>    - The official Apache source release and binary convenience releases
>> to be deployed to dist.apache.org [2], which are signed with the key
>> with fingerprint AF9BAF79D311A3D3288E583F24A499037262AAA4  [3],
>>
>>    - all artifacts to be deployed to the Maven Central Repository [4]
>>
>>    - source code tag "release-0.5.0-incubating-rc5" [5]
>>
>> The vote will be open for at least 72 hours.
>> Please cast your votes before *Oct. 9 2019, 19:00 PST*.
>>
>> It is adopted by majority approval, with at least 3 PMC affirmative
>> votes.
>>    -
>> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12346087
>>    -
>> https://dist.apache.org/repos/dist/dev/incubator/hudi/hudi-0.5.0-incubating-rc5/
>>    - https://dist.apache.org/repos/dist/release/incubator/hudi/KEYS
>>    -
>> https://repository.apache.org/content/repositories/orgapachehudi-1005/
>>    -
>> https://github.com/apache/incubator-hudi/tree/release-0.5.0-incubating-rc5
>> Thanks,Balaji.V
>> <https://github.com/apache/incubator-hudi/tree/release-0.5.0-incubating-rc5Thanks,Balaji.V>
>>
>>
>
    

Re: [VOTE] Release 0.5.0-incubating, release candidate #5

2019-10-09 Thread vbal...@apache.org
 
Thanks Thomas. It looks like data-sketches team will be fixing the NOTICE file 
in master. We will also fix this in master.  Also other incubating projects 
(incubator-gobblin) which have released earlier have similar NOTICE format (no 
"incubating"). 
I will wait for couple of hours to see if there are any replies to your email 
in general@incubator.  If not, I am inclined to moving forward with voting to 
general@.  If this inconsistency turns out to be show-stopper, having a failed 
vote in general@ would give us an opportunity to collect more feedback comment. 
This will help us reduce the overall turnaround time in releasing the first 
version.

Thanks,Balaji.V 
On Wednesday, October 9, 2019, 10:05:22 AM PDT, Thomas Weise 
 wrote:  
 
 FYI there is inconsistency regarding "(incubating)" in the NOTICE file
between other projects I looked at. I asked the question on another release
VOTE on general@

I would recommend to make the change anyways, but depending on what
feedback there is regarding the above, can still consider move forward with
this RC (that's why -0)

On Wed, Oct 9, 2019 at 9:55 AM Thomas Weise  wrote:

> -0 (binding)
>
> NOTICE needs to include "(incubating)"
>
> Minor: VOTE refers to "The official Apache source release and binary
> convenience releases" but there is no binary release.
>
> Otherwise things look good.
>
> Signature check  PASS
> No binaries  PASS
> mvn test  PASS
>
>
> On Fri, Oct 4, 2019 at 6:15 PM vbal...@apache.org 
> wrote:
>
>> Hi everyone,We have a new release candidate for first release of Apache
>> Hudi (incubating). The version is : 0.5.0-incubating-rc5. Please note that
>> previous release candidates RC#3 and RC#4 were not sent for voting as we
>> discovered compliance issues before we could send them for voting. These
>> issues were subsequently fixed as part of PR-935 and  PR-939 and RC#5 has
>> been builtWe also have a new release validation script available in master
>> to automate the usual checks.  To run this
>>    - If you have not checkout out hudi, please do
>>      - git clone g...@github.com:apache/incubator-hudi.git;
>>
>>    - If you already have hudi, please do
>>
>>    - git checkout master  && git pull origin master
>>
>>    - cd incubator-hudi/scripts;
>>    - ./release/validate_staged_release.sh --release=0.5.0 --rc_num=5
>>
>> Please review and vote on the release candidate #5 for the version 0.5.0,
>> as follows:[ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>> The complete staging area is available for your review, which includes:
>>    - JIRA release notes [1]
>>    - The official Apache source release and binary convenience releases
>> to be deployed to dist.apache.org [2], which are signed with the key
>> with fingerprint AF9BAF79D311A3D3288E583F24A499037262AAA4  [3],
>>
>>    - all artifacts to be deployed to the Maven Central Repository [4]
>>
>>    - source code tag "release-0.5.0-incubating-rc5" [5]
>>
>> The vote will be open for at least 72 hours.
>> Please cast your votes before *Oct. 9 2019, 19:00 PST*.
>>
>> It is adopted by majority approval, with at least 3 PMC affirmative
>> votes.
>>    -
>> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12346087
>>    -
>> https://dist.apache.org/repos/dist/dev/incubator/hudi/hudi-0.5.0-incubating-rc5/
>>    - https://dist.apache.org/repos/dist/release/incubator/hudi/KEYS
>>    -
>> https://repository.apache.org/content/repositories/orgapachehudi-1005/
>>    -
>> https://github.com/apache/incubator-hudi/tree/release-0.5.0-incubating-rc5
>> Thanks,Balaji.V
>> <https://github.com/apache/incubator-hudi/tree/release-0.5.0-incubating-rc5Thanks,Balaji.V>
>>
>>
>
  

Re: [VOTE] Release 0.5.0-incubating, release candidate #2

2019-10-04 Thread vbal...@apache.org
 
Sure Thomas. I will treat this as a non-blocking change and remove the KEYS 
file from master 
Balaji.VOn Friday, October 4, 2019, 07:55:25 PM PDT, Thomas Weise 
 wrote:  
 
 One more question: What's the purpose of this KEYS file?
https://github.com/apache/incubator-hudi/blob/master/KEYS

It is *not a blocker*, but the source of truth for KEYS for the release
verification is dist and having the copy in git may just create confusion.

Thomas


On Wed, Oct 2, 2019 at 10:14 AM vbal...@apache.org 
wrote:

>
> Thanks Thomas for the review. I had created a new PR  to address your
> comments https://github.com/apache/incubator-hudi/pull/935
> Please review when you get a chance.
> ThanksBalaji.V
>    On Wednesday, October 2, 2019, 09:26:49 AM PDT, Thomas Weise <
> t...@apache.org> wrote:
>
>  I looked at the PR and I see a disturbing number of LICENSE file
> repetitions in it. There should be no need for that as LICENSE can be
> included automatically by the ASF parent pom (or similar project specific
> solution):
>
> https://github.com/apache/maven-apache-parent/blob/master/pom.xml#L308
>
> Please also check that the following was resolved:
>
> $ grep -R Uber .
> ./docker/hoodie/hadoop/prestobase/pom.xml:  ~ Copyright (c) 2016 Uber
> Technologies, Inc. (hoodie-dev-gr...@uber.com)
> ./pom.xml:      Uber
> ./pom.xml:      Uber
> ./pom.xml:      Uber
> ./pom.xml:      Uber
> ./pom.xml:      Uber
> ./pom.xml:      Uber
>
> Thanks,
> Thomas
>
>
>
> On Mon, Sep 30, 2019 at 11:30 AM Thomas Weise  wrote:
>
> > Sorry, I mistakenly assumed that this RC had fixes for the previously
> > discussed issues.
> >
> > --
> > sent from mobile
> >
> > On Mon, Sep 30, 2019, 10:06 AM vbal...@apache.org 
> > wrote:
> >
> >>
> >> Hi Thomas,
> >> Yes, Luciano also referred to this binary issue earlier. We had
> addressed
> >> the comments (including binary presence, RAT automation and release
> >> automation scripts and basic check) as part of
> >> https://github.com/apache/incubator-hudi/pull/918. Luciano had earlier
> >> reviewed the PR and I have addressed his follow-up comments. We had
> >> requested mentors to help review this PR to see if anything is still
> >> inconsistent.
> >> If there are no other comments on this PR till this afternoon, I will
> >> address any pending comments, create a new RC candidate and will send an
> >> email along with scripted basic validation to help check the new RC
> >> candidate.
> >> I do have a working document that I am making changes to capture release
> >> process. I will be publishing them to a wiki once the first release
> >> candidate is approved.
> >> Balaji.V
> >>
> >>
> >>
> >>
> >>    On Monday, September 30, 2019, 08:44:12 AM PDT, Thomas Weise <
> >> t...@apache.org> wrote:
> >>
> >>  The source release contains at least one binary:
> >>
> >> hudi-0.5.0-incubating-rc2 $ find . -name *.jar
> >> ./hudi-cli/lib/dnl/utils/textutils/0.3.3/textutils-0.3.3.jar
> >>
> >> There could be more, this was just the first check run.
> >>
> >> Have you already scripted building the release candidate from clean
> source
> >> and the basic checks?
> >> Ideally it's done consistently by the release managers and verified as
> >> part
> >> of voting.
> >>
> >>
> >> On Thu, Sep 26, 2019 at 2:49 PM Vinoth Chandar 
> wrote:
> >>
> >> > @mentors Hopefully we are very close. Your eyes on this will
> >> significantly
> >> > help us to get it right!
> >> >
> >> > On Thu, Sep 26, 2019 at 1:42 PM vbal...@apache.org <
> vbal...@apache.org>
> >> > wrote:
> >> >
> >> > >
> >> > > Thanks Luciano for the comments.
> >> > > I  looked at other projects that are currently incubating to see how
> >> they
> >> > > setup top-level LICENSE and NOTICE files. As you mentioned, these
> >> files
> >> > are
> >> > > generated for source release. I have updated HUDI's NOTICE and
> LICENSE
> >> > > files in the same way.
> >> > >  I have also addressed other comments. Please review the changes.
> >> > > Thanks,Balaji.V
> >> > > For Reference, NOTICE and LICENSE in other incubating projects
> >> > > 1. https://github.com/apache/incubator-gobblin/blob/master/LICENSE
> >> > > 2. https://github.com/apach

[VOTE] Release 0.5.0-incubating, release candidate #5

2019-10-04 Thread vbal...@apache.org
Hi everyone,We have a new release candidate for first release of Apache Hudi 
(incubating). The version is : 0.5.0-incubating-rc5. Please note that previous 
release candidates RC#3 and RC#4 were not sent for voting as we discovered 
compliance issues before we could send them for voting. These issues were 
subsequently fixed as part of PR-935 and  PR-939 and RC#5 has been builtWe also 
have a new release validation script available in master to automate the usual 
checks.  To run this   
   - If you have not checkout out hudi, please do  
  - git clone g...@github.com:apache/incubator-hudi.git;

   - If you already have hudi, please do
   
   - git checkout master  && git pull origin master
   
   - cd incubator-hudi/scripts;
   - ./release/validate_staged_release.sh --release=0.5.0 --rc_num=5   

Please review and vote on the release candidate #5 for the version 0.5.0, as 
follows:[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)
The complete staging area is available for your review, which includes:   
   - JIRA release notes [1]
   - The official Apache source release and binary convenience releases to be 
deployed to dist.apache.org [2], which are signed with the key with fingerprint 
AF9BAF79D311A3D3288E583F24A499037262AAA4  [3],   

   - all artifacts to be deployed to the Maven Central Repository [4]   

   - source code tag "release-0.5.0-incubating-rc5" [5]   

The vote will be open for at least 72 hours. 
Please cast your votes before *Oct. 9 2019, 19:00 PST*. 

It is adopted by majority approval, with at least 3 PMC affirmative votes.   
   - 
https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12346087
   - 
https://dist.apache.org/repos/dist/dev/incubator/hudi/hudi-0.5.0-incubating-rc5/
   - https://dist.apache.org/repos/dist/release/incubator/hudi/KEYS
   - https://repository.apache.org/content/repositories/orgapachehudi-1005/
   - https://github.com/apache/incubator-hudi/tree/release-0.5.0-incubating-rc5
Thanks,Balaji.V 


Re: [DISCUSS] cleaning up git history from Notice/License changes

2019-10-03 Thread vbal...@apache.org
 
+1 on both cleanup. This would keep the git history clean and consistent with 
contribution.
Balaji.VOn Thursday, October 3, 2019, 09:53:46 AM PDT, Vinoth Chandar 
 wrote:  
 
 Folks,

As we iterate across the RCs, we have added and removed to the
NOTICE/LICENSE files a lot. Does anyone feel the need to clean up the
history and do a one time force push? There is also an issue with github
contribution stats not showing up everyone's commit (due to email changes
etc). We could also tackle that

thanks
vinoth
  

Re: [VOTE] Release 0.5.0-incubating, release candidate #2

2019-10-02 Thread vbal...@apache.org
 
Thanks Thomas for the review. I had created a new PR  to address your comments 
https://github.com/apache/incubator-hudi/pull/935 
Please review when you get a chance.
ThanksBalaji.V
On Wednesday, October 2, 2019, 09:26:49 AM PDT, Thomas Weise 
 wrote:  
 
 I looked at the PR and I see a disturbing number of LICENSE file
repetitions in it. There should be no need for that as LICENSE can be
included automatically by the ASF parent pom (or similar project specific
solution):

https://github.com/apache/maven-apache-parent/blob/master/pom.xml#L308

Please also check that the following was resolved:

$ grep -R Uber .
./docker/hoodie/hadoop/prestobase/pom.xml:  ~ Copyright (c) 2016 Uber
Technologies, Inc. (hoodie-dev-gr...@uber.com)
./pom.xml:      Uber
./pom.xml:      Uber
./pom.xml:      Uber
./pom.xml:      Uber
./pom.xml:      Uber
./pom.xml:      Uber

Thanks,
Thomas



On Mon, Sep 30, 2019 at 11:30 AM Thomas Weise  wrote:

> Sorry, I mistakenly assumed that this RC had fixes for the previously
> discussed issues.
>
> --
> sent from mobile
>
> On Mon, Sep 30, 2019, 10:06 AM vbal...@apache.org 
> wrote:
>
>>
>> Hi Thomas,
>> Yes, Luciano also referred to this binary issue earlier. We had addressed
>> the comments (including binary presence, RAT automation and release
>> automation scripts and basic check) as part of
>> https://github.com/apache/incubator-hudi/pull/918. Luciano had earlier
>> reviewed the PR and I have addressed his follow-up comments. We had
>> requested mentors to help review this PR to see if anything is still
>> inconsistent.
>> If there are no other comments on this PR till this afternoon, I will
>> address any pending comments, create a new RC candidate and will send an
>> email along with scripted basic validation to help check the new RC
>> candidate.
>> I do have a working document that I am making changes to capture release
>> process. I will be publishing them to a wiki once the first release
>> candidate is approved.
>> Balaji.V
>>
>>
>>
>>
>>    On Monday, September 30, 2019, 08:44:12 AM PDT, Thomas Weise <
>> t...@apache.org> wrote:
>>
>>  The source release contains at least one binary:
>>
>> hudi-0.5.0-incubating-rc2 $ find . -name *.jar
>> ./hudi-cli/lib/dnl/utils/textutils/0.3.3/textutils-0.3.3.jar
>>
>> There could be more, this was just the first check run.
>>
>> Have you already scripted building the release candidate from clean source
>> and the basic checks?
>> Ideally it's done consistently by the release managers and verified as
>> part
>> of voting.
>>
>>
>> On Thu, Sep 26, 2019 at 2:49 PM Vinoth Chandar  wrote:
>>
>> > @mentors Hopefully we are very close. Your eyes on this will
>> significantly
>> > help us to get it right!
>> >
>> > On Thu, Sep 26, 2019 at 1:42 PM vbal...@apache.org 
>> > wrote:
>> >
>> > >
>> > > Thanks Luciano for the comments.
>> > > I  looked at other projects that are currently incubating to see how
>> they
>> > > setup top-level LICENSE and NOTICE files. As you mentioned, these
>> files
>> > are
>> > > generated for source release. I have updated HUDI's NOTICE and LICENSE
>> > > files in the same way.
>> > >  I have also addressed other comments. Please review the changes.
>> > > Thanks,Balaji.V
>> > > For Reference, NOTICE and LICENSE in other incubating projects
>> > > 1. https://github.com/apache/incubator-gobblin/blob/master/LICENSE
>> > > 2. https://github.com/apache/incubator-gobblin/blob/master/NOTICE
>> > > 3. https://github.com/apache/incubator-heron/blob/master/LICENSE
>> > > 4. https://github.com/apache/incubator-heron/blob/master/NOTICE
>> > >
>> > >    On Tuesday, September 24, 2019, 06:54:29 AM PDT, Luciano Resende <
>> > > luckbr1...@gmail.com> wrote:
>> > >
>> > >  I will look into this and get back to you tonight.
>> > >
>> > > On Mon, Sep 23, 2019 at 10:30 vbal...@apache.org 
>> > > wrote:
>> > >
>> > > >  Hi Luciano,
>> > > > I went through the licensing link you provided and have addressed
>> all
>> > the
>> > > > comments in this PR :
>> > https://github.com/apache/incubator-hudi/pull/918
>> > > > I have described the steps I used to generate the final NOTICE file.
>> > Can
>> > > > you please review this PR and see if it makes sense.
>> > > >
>> > > > T

Re: [VOTE] Release 0.5.0-incubating, release candidate #2

2019-09-30 Thread vbal...@apache.org
 
Hi Thomas,
Yes, Luciano also referred to this binary issue earlier. We had addressed the 
comments (including binary presence, RAT automation and release automation 
scripts and basic check) as part of 
https://github.com/apache/incubator-hudi/pull/918. Luciano had earlier reviewed 
the PR and I have addressed his follow-up comments. We had requested mentors to 
help review this PR to see if anything is still inconsistent.
If there are no other comments on this PR till this afternoon, I will address 
any pending comments, create a new RC candidate and will send an email along 
with scripted basic validation to help check the new RC candidate.
I do have a working document that I am making changes to capture release 
process. I will be publishing them to a wiki once the first release candidate 
is approved. 
Balaji.V




On Monday, September 30, 2019, 08:44:12 AM PDT, Thomas Weise 
 wrote:  
 
 The source release contains at least one binary:

hudi-0.5.0-incubating-rc2 $ find . -name *.jar
./hudi-cli/lib/dnl/utils/textutils/0.3.3/textutils-0.3.3.jar

There could be more, this was just the first check run.

Have you already scripted building the release candidate from clean source
and the basic checks?
Ideally it's done consistently by the release managers and verified as part
of voting.


On Thu, Sep 26, 2019 at 2:49 PM Vinoth Chandar  wrote:

> @mentors Hopefully we are very close. Your eyes on this will significantly
> help us to get it right!
>
> On Thu, Sep 26, 2019 at 1:42 PM vbal...@apache.org 
> wrote:
>
> >
> > Thanks Luciano for the comments.
> > I  looked at other projects that are currently incubating to see how they
> > setup top-level LICENSE and NOTICE files. As you mentioned, these files
> are
> > generated for source release. I have updated HUDI's NOTICE and LICENSE
> > files in the same way.
> >  I have also addressed other comments. Please review the changes.
> > Thanks,Balaji.V
> > For Reference, NOTICE and LICENSE in other incubating projects
> > 1. https://github.com/apache/incubator-gobblin/blob/master/LICENSE
> > 2. https://github.com/apache/incubator-gobblin/blob/master/NOTICE
> > 3. https://github.com/apache/incubator-heron/blob/master/LICENSE
> > 4. https://github.com/apache/incubator-heron/blob/master/NOTICE
> >
> >    On Tuesday, September 24, 2019, 06:54:29 AM PDT, Luciano Resende <
> > luckbr1...@gmail.com> wrote:
> >
> >  I will look into this and get back to you tonight.
> >
> > On Mon, Sep 23, 2019 at 10:30 vbal...@apache.org 
> > wrote:
> >
> > >  Hi Luciano,
> > > I went through the licensing link you provided and have addressed all
> the
> > > comments in this PR :
> https://github.com/apache/incubator-hudi/pull/918
> > > I have described the steps I used to generate the final NOTICE file.
> Can
> > > you please review this PR and see if it makes sense.
> > >
> > > Thanks,Balaji.V
> > >    On Friday, September 20, 2019, 03:47:56 PM PDT, Luciano Resende <
> > > luckbr1...@gmail.com> wrote:
> > >
> > >  Based on the current DISCLAIMER I am assuming fully compliant release.
> > >
> > > -1 (binding)
> > >
> > > Signatures ok,
> > >
> > > the source distribution contains a binary jar which is not allowed
> > > ./hudi-cli/lib/dnl/utils/textutils/0.3.3/textutils-0.3.3.jar
> > >
> > > Missing headers:
> > >  !? ./README.md
> > >  !? ./RELEASE_NOTES.md
> > > !? ./docker/hoodie/hadoop/prestobase/Dockerfile
> > > !? ./packaging/README.md
> > >
> > > Your notice has too many unnecessary mentions, please see the guide
> here
> > > http://www.apache.org/dev/licensing-howto.html
> > >
> > > Also, you should not add the additional lines such as
> > > "Licensed under the Apache License, Version 2.0 (the "License"); you
> > > may not use this file except in compliance with the License. You may
> > > obtain a copy of the License at"
> > >
> > > " Unless required by applicable law or agreed to in writing, software
> > > distributed under the License is distributed on an "AS IS" BASIS,
> > > WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > > implied. See the License for the specific language governing
> > > permissions and limitations under the License."
> > >
> > > these are already built into the license, and properly worded.
> > >
> > >
> > > On Tue, Sep 17, 2019 at 5:02 PM vbal...@apache.org  >
> > > wrote:
> > > >
> 

Re: [VOTE] Release 0.5.0-incubating, release candidate #2

2019-09-26 Thread vbal...@apache.org
 
Thanks Luciano for the comments.
I  looked at other projects that are currently incubating to see how they setup 
top-level LICENSE and NOTICE files. As you mentioned, these files are generated 
for source release. I have updated HUDI's NOTICE and LICENSE  files in the same 
way. 
 I have also addressed other comments. Please review the changes.
Thanks,Balaji.V
For Reference, NOTICE and LICENSE in other incubating projects
1. https://github.com/apache/incubator-gobblin/blob/master/LICENSE
2. https://github.com/apache/incubator-gobblin/blob/master/NOTICE
3. https://github.com/apache/incubator-heron/blob/master/LICENSE
4. https://github.com/apache/incubator-heron/blob/master/NOTICE

On Tuesday, September 24, 2019, 06:54:29 AM PDT, Luciano Resende 
 wrote:  
 
 I will look into this and get back to you tonight.

On Mon, Sep 23, 2019 at 10:30 vbal...@apache.org  wrote:

>  Hi Luciano,
> I went through the licensing link you provided and have addressed all the
> comments in this PR : https://github.com/apache/incubator-hudi/pull/918
> I have described the steps I used to generate the final NOTICE file. Can
> you please review this PR and see if it makes sense.
>
> Thanks,Balaji.V
>    On Friday, September 20, 2019, 03:47:56 PM PDT, Luciano Resende <
> luckbr1...@gmail.com> wrote:
>
>  Based on the current DISCLAIMER I am assuming fully compliant release.
>
> -1 (binding)
>
> Signatures ok,
>
> the source distribution contains a binary jar which is not allowed
> ./hudi-cli/lib/dnl/utils/textutils/0.3.3/textutils-0.3.3.jar
>
> Missing headers:
>  !? ./README.md
>  !? ./RELEASE_NOTES.md
> !? ./docker/hoodie/hadoop/prestobase/Dockerfile
> !? ./packaging/README.md
>
> Your notice has too many unnecessary mentions, please see the guide here
> http://www.apache.org/dev/licensing-howto.html
>
> Also, you should not add the additional lines such as
> "Licensed under the Apache License, Version 2.0 (the "License"); you
> may not use this file except in compliance with the License. You may
> obtain a copy of the License at"
>
> " Unless required by applicable law or agreed to in writing, software
> distributed under the License is distributed on an "AS IS" BASIS,
> WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> implied. See the License for the specific language governing
> permissions and limitations under the License."
>
> these are already built into the license, and properly worded.
>
>
> On Tue, Sep 17, 2019 at 5:02 PM vbal...@apache.org 
> wrote:
> >
> > Hi everyone,We have a new release candidate after addressing issues
> reported in first release candidate (see email thread)The new version is :
> 0.5.0-incubating-rc2. Please review and vote on the release candidate #2
> for version 0.5.0, as follows:
> > [ ] +1, Approve the release[ ] -1, Do not approve the release (please
> provide specific comments)The complete staging area is available for your
> review, which includes:
> >    - JIRA release notes [1]
> >    - The official Apache source release and binary convenience releases
> to be deployed to dist.apache.org [2], which are signed with the key with
> fingerprint AF9BAF79D311A3D3288E583F24A499037262AAA4  [3],
> >
> >    - all artifacts to be deployed to the Maven Central Repository [4]
> >
> >    - source code tag "release-0.5.0-incubating-rc2" [5]
> >
> > The vote will be open for at least 72 hours.
> > Please cast your votes before *Sep. 20 2019, 21:00 UTC*.
> >
> > It is adopted by majority approval, with at least 3 PMC affirmative
> votes.
> >    -
> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12346087
> >    -
> https://dist.apache.org/repos/dist/dev/incubator/hudi/hudi-0.5.0-incubating-rc2/
> >    - https://dist.apache.org/repos/dist/release/incubator/hudi/KEYS
> >    -
> https://repository.apache.org/content/repositories/orgapachehudi-1002/
> >    -
> https://github.com/apache/incubator-hudi/tree/release-0.5.0-incubating-rc2
> >
> >
> > P.S. : As this is a first time where Hudi community will be performing
> release voting, you can look at
> https://lists.apache.org/thread.html/75e40ed5a6e0c3174728a0bcfe86cbcd99ae4778ebe94b839f0674cd@%3Cdev.flink.apache.org%3E
> for some understanding of validations community does to cast their
> votes.Thanks,Balaji.V
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/

-- 
Sent from my Mobile device
  

Re: FAQ page

2019-09-23 Thread vbal...@apache.org
 
+1 Awesome job Vinoth and Nishith for compiling the initial version of FAQ. 
Agree on the idea of replying using FAQ. 
Balaji.VOn Monday, September 23, 2019, 04:41:03 PM PDT, Vinoth Chandar 
 wrote:  
 
 First version of the page is now fully completed.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185

Please try to use the FAQs when answering questions on ML and GH. It will
only get better if we manage this effectively and keep improving it.

On Sun, Sep 15, 2019 at 9:41 PM Vinoth Chandar  wrote:

> Thanks! Will work this week to fill out most answers!
> Your help reviewing would also be much appreciated.
> Will keep this thread posted..
>
> On Tue, Sep 10, 2019 at 6:10 PM vino yang  wrote:
>
>> Hi Vinoth,
>>
>> Great job! Thanks for your efforts!
>> I think this page is good for users and developers to let them know Hudi
>> well.
>>
>> Best,
>> Vino
>>
>>
>>
>> Vinoth Chandar  于2019年9月11日周三 上午2:27写道:
>>
>> > Hi all,
>> >
>> > I wrote a list of questions based on mailing list conversations and
>> issues.
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185
>> > <
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185
>> > >
>> >
>> > While I am still working through answers, I thought this can be a good
>> > community driven process.
>> >
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#Frequentlyaskedquestions(FAQ)-ContributingtoFAQ
>> > <
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#Frequentlyaskedquestions(FAQ)-ContributingtoFAQ
>> > >
>> >
>> > Please help by contributing answers or new questions if you can!
>> >
>> > thanks
>> > vinoth
>> >
>>
>  

Re: [VOTE] Release 0.5.0-incubating, release candidate #2

2019-09-23 Thread vbal...@apache.org
 Hi Luciano,
I went through the licensing link you provided and have addressed all the 
comments in this PR : https://github.com/apache/incubator-hudi/pull/918 
I have described the steps I used to generate the final NOTICE file. Can you 
please review this PR and see if it makes sense.

Thanks,Balaji.V
On Friday, September 20, 2019, 03:47:56 PM PDT, Luciano Resende 
 wrote:  
 
 Based on the current DISCLAIMER I am assuming fully compliant release.

-1 (binding)

Signatures ok,

the source distribution contains a binary jar which is not allowed
./hudi-cli/lib/dnl/utils/textutils/0.3.3/textutils-0.3.3.jar

Missing headers:
 !? ./README.md
 !? ./RELEASE_NOTES.md
!? ./docker/hoodie/hadoop/prestobase/Dockerfile
!? ./packaging/README.md

Your notice has too many unnecessary mentions, please see the guide here
http://www.apache.org/dev/licensing-howto.html

Also, you should not add the additional lines such as
"Licensed under the Apache License, Version 2.0 (the "License"); you
may not use this file except in compliance with the License. You may
obtain a copy of the License at"

" Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied. See the License for the specific language governing
permissions and limitations under the License."

these are already built into the license, and properly worded.


On Tue, Sep 17, 2019 at 5:02 PM vbal...@apache.org  wrote:
>
> Hi everyone,We have a new release candidate after addressing issues reported 
> in first release candidate (see email thread)The new version is : 
> 0.5.0-incubating-rc2. Please review and vote on the release candidate #2 for 
> version 0.5.0, as follows:
> [ ] +1, Approve the release[ ] -1, Do not approve the release (please provide 
> specific comments)The complete staging area is available for your review, 
> which includes:
>    - JIRA release notes [1]
>    - The official Apache source release and binary convenience releases to be 
>deployed to dist.apache.org [2], which are signed with the key with 
>fingerprint AF9BAF79D311A3D3288E583F24A499037262AAA4  [3],
>
>    - all artifacts to be deployed to the Maven Central Repository [4]
>
>    - source code tag "release-0.5.0-incubating-rc2" [5]
>
> The vote will be open for at least 72 hours.
> Please cast your votes before *Sep. 20 2019, 21:00 UTC*.
>
> It is adopted by majority approval, with at least 3 PMC affirmative votes.
>    - 
>https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12346087
>    - 
>https://dist.apache.org/repos/dist/dev/incubator/hudi/hudi-0.5.0-incubating-rc2/
>    - https://dist.apache.org/repos/dist/release/incubator/hudi/KEYS
>    - https://repository.apache.org/content/repositories/orgapachehudi-1002/
>    - 
>https://github.com/apache/incubator-hudi/tree/release-0.5.0-incubating-rc2
>
>
> P.S. : As this is a first time where Hudi community will be performing 
> release voting, you can look at 
> https://lists.apache.org/thread.html/75e40ed5a6e0c3174728a0bcfe86cbcd99ae4778ebe94b839f0674cd@%3Cdev.flink.apache.org%3E
>  for some understanding of validations community does to cast their 
> votes.Thanks,Balaji.V



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/  

Re: [VOTE] Release 0.5.0-incubating, release candidate #2

2019-09-21 Thread vbal...@apache.org
 Thanks everyone for the help in validating the release. Let's close this 
voting thread. I will create a new release candidate after addressing the 
compliancy concerns.
Balaji.V 
On Friday, September 20, 2019, 03:47:56 PM PDT, Luciano Resende 
 wrote:  
 
 Based on the current DISCLAIMER I am assuming fully compliant release.

-1 (binding)

Signatures ok,

the source distribution contains a binary jar which is not allowed
./hudi-cli/lib/dnl/utils/textutils/0.3.3/textutils-0.3.3.jar

Missing headers:
 !? ./README.md
 !? ./RELEASE_NOTES.md
!? ./docker/hoodie/hadoop/prestobase/Dockerfile
!? ./packaging/README.md

Your notice has too many unnecessary mentions, please see the guide here
http://www.apache.org/dev/licensing-howto.html

Also, you should not add the additional lines such as
"Licensed under the Apache License, Version 2.0 (the "License"); you
may not use this file except in compliance with the License. You may
obtain a copy of the License at"

" Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied. See the License for the specific language governing
permissions and limitations under the License."

these are already built into the license, and properly worded.


On Tue, Sep 17, 2019 at 5:02 PM vbal...@apache.org  wrote:
>
> Hi everyone,We have a new release candidate after addressing issues reported 
> in first release candidate (see email thread)The new version is : 
> 0.5.0-incubating-rc2. Please review and vote on the release candidate #2 for 
> version 0.5.0, as follows:
> [ ] +1, Approve the release[ ] -1, Do not approve the release (please provide 
> specific comments)The complete staging area is available for your review, 
> which includes:
>    - JIRA release notes [1]
>    - The official Apache source release and binary convenience releases to be 
>deployed to dist.apache.org [2], which are signed with the key with 
>fingerprint AF9BAF79D311A3D3288E583F24A499037262AAA4  [3],
>
>    - all artifacts to be deployed to the Maven Central Repository [4]
>
>    - source code tag "release-0.5.0-incubating-rc2" [5]
>
> The vote will be open for at least 72 hours.
> Please cast your votes before *Sep. 20 2019, 21:00 UTC*.
>
> It is adopted by majority approval, with at least 3 PMC affirmative votes.
>    - 
>https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12346087
>    - 
>https://dist.apache.org/repos/dist/dev/incubator/hudi/hudi-0.5.0-incubating-rc2/
>    - https://dist.apache.org/repos/dist/release/incubator/hudi/KEYS
>    - https://repository.apache.org/content/repositories/orgapachehudi-1002/
>    - 
>https://github.com/apache/incubator-hudi/tree/release-0.5.0-incubating-rc2
>
>
> P.S. : As this is a first time where Hudi community will be performing 
> release voting, you can look at 
> https://lists.apache.org/thread.html/75e40ed5a6e0c3174728a0bcfe86cbcd99ae4778ebe94b839f0674cd@%3Cdev.flink.apache.org%3E
>  for some understanding of validations community does to cast their 
> votes.Thanks,Balaji.V



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/  

Re: Running compaction using spark

2019-09-19 Thread vbal...@apache.org
 
You can look at 
https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java
 

Balaji.V
On Thursday, September 19, 2019, 08:15:28 AM PDT, Jaimin Shah 
 wrote:  
 
 Hi
  I am currently running compaction using hoodie-cli. But now I want to run
compaction directly from spark program. Is there any example which I can
refer to for running  compaction using spark program.

Thanks.
  

Re: Facing issues when decoding bytes to Avro

2019-09-19 Thread vbal...@apache.org
 
Hi Pratyaksh,
Looks like you forgot to attach the target schema. Anyways, to debug these 
schema issues, can you try printing both schema and record during encoding and 
decoding(when writing records) to get some idea of what is happening. 
Balaji.VOn Thursday, September 19, 2019, 12:16:00 AM PDT, Pratyaksh Sharma 
 wrote:  
 
 Hi Balaji, 

Yes target schema is different from source schema in my case. I am attaching 
sample schemas for your reference. 
I was able to solve this issue by using HoodieJsonPayload along with using 
AvroKafkaSource, where at the time of creating payload, I am calling 
HoodieJsonPayload(.toString()) in HoodieDeltaStreamer so as to 
use the constructor HoodieJsonPayload (String record), via ReflectionUtils. All 
the flattening is still getting done by Transformer class. 
However I am still trying to understand why there was an issue with 
OverwriteWithLatestAvroPayload. I tried googling around, the most common reason 
for mentioned exception seems to be the schemas are having some issue. But if 
schemas had any issue, then it should not even work with HoodieJsonPayload. 


On Wed, Sep 18, 2019 at 9:34 PM vbal...@apache.org  wrote:

 
Hi Pratyaksh,
Since you are using Transformer and altering the schema, you need to make sure 
targetSchema (flattened) is different from source schema (nested). 
https://github.com/apache/incubator-hudi/blob/227785c022939cd2ba153c2a4f7791ab3394c6c7/hoodie-utilities/src/main/java/com/uber/hoodie/utilities/schema/FilebasedSchemaProvider.java#L76


Can you check if you are doing that ?
Regarding storing payload in bytes instead of AvroRecords, this is for 
performance reasons. The specific avro Record object keeps reference to schema 
which results in bloating up the RDD size.
Balaji.V      On Wednesday, September 18, 2019, 02:33:06 AM PDT, Pratyaksh 
Sharma  wrote:  

 Also I am trying to understand why are we storing the
OverwriteWithLatestAvroPayload in the form of bytes and not the actual
record. Apologies if it is a very basic question, I am working on Avro for
the first time.

On Wed, Sep 18, 2019 at 2:25 PM Pratyaksh Sharma 
wrote:

> Hi,
>
> I am trying to use Hudi (hoodie-0.4.7) for building CDC pipeline. I am
> using AvroKafkaSource and FilebasedSchemaProvider. The source schema looks
> something like this where all the columns are nested in a field called
> 'columns' -
>
> {
>
>  "name": "rawdata",
>
>  "type": "record",
>
>  "fields": [
>
>    {
>
>      "name": "type",
>
>      "type": "string"
>
>    },
>
>    {
>
>      "name": "timestamp",
>
>      "type": "string"
>
>    },
>
>    {
>
>      "name": "database",
>
>      "type": "string"
>
>    },
>
>    {
>
>      "name": "table_name",
>
>      "type": "string"
>
>    },
>
>    {
>
>      "name": "binlog_filename",
>
>      "type": "string"
>
>    },
>
>    {
>
>      "name": "binlog_position",
>
>      "type": "string"
>
>    },
>
>    {
>
>      "name": "columns",
>
>      "type": {"type": "map", "values": ["null","string"]}
>
>    }
>
>  ]
>
> }
>
> The target schema has all the columns and I am using transformer class to
> extract the actual column fields from 'columns' field. Everything seems to
> be working fine, however at the time of actual writing, I am getting the
> below exception -
>
> ERROR com.uber.hoodie.io.HoodieIOHandle  - Error writing record
> HoodieRecord{key=HoodieKey { recordKey=123 partitionPath=2019/06/20},
> currentLocation='null', newLocation='null'}
> java.lang.ArrayIndexOutOfBoundsException: 123
> at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:402)
> at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
> at
> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
> at
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
> at
> com.uber.hoodie.common.util.

Re: Facing issues when decoding bytes to Avro

2019-09-18 Thread vbal...@apache.org
 
Hi Pratyaksh,
Since you are using Transformer and altering the schema, you need to make sure 
targetSchema (flattened) is different from source schema (nested). 
https://github.com/apache/incubator-hudi/blob/227785c022939cd2ba153c2a4f7791ab3394c6c7/hoodie-utilities/src/main/java/com/uber/hoodie/utilities/schema/FilebasedSchemaProvider.java#L76


Can you check if you are doing that ?
Regarding storing payload in bytes instead of AvroRecords, this is for 
performance reasons. The specific avro Record object keeps reference to schema 
which results in bloating up the RDD size.
Balaji.V  On Wednesday, September 18, 2019, 02:33:06 AM PDT, Pratyaksh 
Sharma  wrote:  
 
 Also I am trying to understand why are we storing the
OverwriteWithLatestAvroPayload in the form of bytes and not the actual
record. Apologies if it is a very basic question, I am working on Avro for
the first time.

On Wed, Sep 18, 2019 at 2:25 PM Pratyaksh Sharma 
wrote:

> Hi,
>
> I am trying to use Hudi (hoodie-0.4.7) for building CDC pipeline. I am
> using AvroKafkaSource and FilebasedSchemaProvider. The source schema looks
> something like this where all the columns are nested in a field called
> 'columns' -
>
> {
>
>  "name": "rawdata",
>
>  "type": "record",
>
>  "fields": [
>
>    {
>
>      "name": "type",
>
>      "type": "string"
>
>    },
>
>    {
>
>      "name": "timestamp",
>
>      "type": "string"
>
>    },
>
>    {
>
>      "name": "database",
>
>      "type": "string"
>
>    },
>
>    {
>
>      "name": "table_name",
>
>      "type": "string"
>
>    },
>
>    {
>
>      "name": "binlog_filename",
>
>      "type": "string"
>
>    },
>
>    {
>
>      "name": "binlog_position",
>
>      "type": "string"
>
>    },
>
>    {
>
>      "name": "columns",
>
>      "type": {"type": "map", "values": ["null","string"]}
>
>    }
>
>  ]
>
> }
>
> The target schema has all the columns and I am using transformer class to
> extract the actual column fields from 'columns' field. Everything seems to
> be working fine, however at the time of actual writing, I am getting the
> below exception -
>
> ERROR com.uber.hoodie.io.HoodieIOHandle  - Error writing record
> HoodieRecord{key=HoodieKey { recordKey=123 partitionPath=2019/06/20},
> currentLocation='null', newLocation='null'}
> java.lang.ArrayIndexOutOfBoundsException: 123
> at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:402)
> at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
> at
> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
> at
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
> at
> com.uber.hoodie.common.util.HoodieAvroUtils.bytesToAvro(HoodieAvroUtils.java:86)
> at
> com.uber.hoodie.OverwriteWithLatestAvroPayload.getInsertValue(OverwriteWithLatestAvroPayload.java:69)
> at
> com.uber.hoodie.func.CopyOnWriteLazyInsertIterable$HoodieInsertValueGenResult.(CopyOnWriteLazyInsertIterable.java:70)
> at
> com.uber.hoodie.func.CopyOnWriteLazyInsertIterable.lambda$getTransformFunction$0(CopyOnWriteLazyInsertIterable.java:83)
> at
> com.uber.hoodie.common.util.queue.BoundedInMemoryQueue.insertRecord(BoundedInMemoryQueue.java:175)
> at
> com.uber.hoodie.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
> at
> com.uber.hoodie.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:94)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> I have verified the schemas and the data types are fine and in sync. Has
> anyone else faced this issue? Any leads will be helpful.
>
  

[VOTE] Release 0.5.0-incubating, release candidate #2

2019-09-17 Thread vbal...@apache.org
Hi everyone,We have a new release candidate after addressing issues reported in 
first release candidate (see email thread)The new version is : 
0.5.0-incubating-rc2. Please review and vote on the release candidate #2 for 
version 0.5.0, as follows:
[ ] +1, Approve the release[ ] -1, Do not approve the release (please provide 
specific comments)The complete staging area is available for your review, which 
includes:   
   - JIRA release notes [1]
   - The official Apache source release and binary convenience releases to be 
deployed to dist.apache.org [2], which are signed with the key with fingerprint 
AF9BAF79D311A3D3288E583F24A499037262AAA4  [3],   

   - all artifacts to be deployed to the Maven Central Repository [4]   

   - source code tag "release-0.5.0-incubating-rc2" [5]   

The vote will be open for at least 72 hours. 
Please cast your votes before *Sep. 20 2019, 21:00 UTC*. 

It is adopted by majority approval, with at least 3 PMC affirmative votes.   
   - 
https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12346087
   - 
https://dist.apache.org/repos/dist/dev/incubator/hudi/hudi-0.5.0-incubating-rc2/
   - https://dist.apache.org/repos/dist/release/incubator/hudi/KEYS
   - https://repository.apache.org/content/repositories/orgapachehudi-1002/
   - https://github.com/apache/incubator-hudi/tree/release-0.5.0-incubating-rc2 
  
   

P.S. : As this is a first time where Hudi community will be performing release 
voting, you can look at 
https://lists.apache.org/thread.html/75e40ed5a6e0c3174728a0bcfe86cbcd99ae4778ebe94b839f0674cd@%3Cdev.flink.apache.org%3E
 for some understanding of validations community does to cast their 
votes.Thanks,Balaji.V 


Re: Merging schema's during Incremental load

2019-09-16 Thread vbal...@apache.org
 
Hi Gautam,
Independent of using Hudi, it is best practice to manage schema of your 
organization's datasets using some central mechanism like schema registry. 
Without this, it is pretty difficult to evolve schema. It is schema-registry's 
responsibility for providing the correct schema for your incoming batch. 
As you have noted, DeltaStreamer comes with integrations for 
FileBasedSchemaProvider and Confluent's schema registry. It is pretty easy to 
add integration with any other schema provider. 
IIUC, the loading from parquet is one-time (bootstrap). please take a look at 
https://hudi.apache.org/migration_guide.html#option-1Regarding csv incremental 
upsert, we have an active HIP 
(https://cwiki.apache.org/confluence/display/HUDI/HIP-1) for supporting csv 
sources in DeltaStreamer. So, if you want to use DeltaStreamer as-is, you can 
do a simple conversion of csv to json and delta-streamer would be able to 
ingest them now.
Balaji.V






On Saturday, September 14, 2019, 12:38:03 PM PDT, Gautam Nayak 
 wrote:  
 
 Thanks Balaji for the detailed information.One of our pipeline sources data 
from databases [1] (incremental Sqoop) as parquet files and the other pipeline 
sources system generated incremental CSV’s [2]  , both of which have to be 
persisted and read as Hive/Presto tables.In both these cases, We are seeing 
columns getting removed over time.We want our warehouse tables to keep track of 
all the columns from the day we started ingesting, Which we are currently doing 
using bulk merge (Spark) and custom schema evolution which is not appropriate 
for large datasets considering the full data scan it has to go through.

For [1], We are looking to use HoodieDeltaStreamer but since the data is in 
parquet, we are not sure if it's supported.
For [2], We are unsure of using HoodieDeltaStreamer, So we also want to have an 
option of using Datasource writer.

As you have mentioned that for HoodieDeltaStreamer, We need to provide a custom 
schema-provider class which will union the schema, but I am not sure if this 
will be an adhoc process which will first read the incremental data , infer the 
schema and then union with existing schema ? Because the implementations that I 
see in hudi-utilities are related to reading schema from File, Row, 
SchemaRegistry and nothing related to unioning schema. Does hoodie provide this 
functionality ?
Thanks Gautam


> On Sep 14, 2019, at 12:34 AM, vbal...@apache.org wrote:
> 
> 
> 
> [External Email]
> 
> 
> Hi Gautam,
> What I understood was you are trying to incrementally ingest from 
> RowBasedSource. It is not clear to me if this upstream source is another 
> HoodieIncrSource. If that is the case, not sure how the second batch will 
> miss the columns. Can you elaborate more on the setup and what your upstream 
> source is ?
> Anyways, It is ok for incremental dataset (second batch to be ingested) to 
> have fewer columns than those (in the first batch) as long as the missing 
> columns are nullable (Avro backwards compatible).  But per contract, Hudi 
> needs the latest schema (union schema) for every ingestion run. If you had 
> passed the schema (with columns missing), then its possible to lose the 
> columns. Hudi COW reads the older version of the file and creates newer 
> version using the schema passed. So, if the schema passed has missing 
> columns, both the old record and new records which were in the same file will 
> be missing the column.
>  IIUC, you would need to provide a schema-provider in HoodieDeltaStreamer 
>execution (--schema-provider-class) where the schema returned is the 
>union-schema.
> Let me know if this makes sense. Also please elaborate on your pipeline setup.
> Thanks,Balaji.V
> 
>    On Friday, September 13, 2019, 02:33:16 PM PDT, Gautam Nayak 
> wrote:  
> 
> Hi,
> We have been evaluating Hudi and there is one use case we are trying to 
> solve, where incremental datasets can have fewer columns than the ones that 
> have been already persisted in Hudi format.
> 
> For example : In initial batch , We have a total of 4 columns
>    val initial = Seq(("id1", "col1", "col2", 123456)).toDF("pk", "col1", 
>"col2", "ts")
> 
> and in the incremental batch, We have 3 columns
> val incremental = Seq(("id2", "col1", 123879)).toDF("id", "col1", "ts")
> 
> We want to have a union of initial and incremental schemas such that col2 of 
> id2 has some default type associated to it. But what we are seeing is the 
> latest schema(incremental) for both the records when we persist the data 
> (COW) and read it back through Spark. The actual incrementals datasets would 
> be in Avro format but we do not maintain their schemas.
> I tried looking through the documentation to see if there is a specific 
> configuration to achieve this, but couldn’t find any.
> We would also want to achieve this via Deltastreamer and then query these 
> results from Presto.
> 
> Thanks,
> Gautam
> 
> 
> 
> 
> 

  

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread vbal...@apache.org
 
+1 This is a pretty large undertaking. While the community is getting their 
hands dirty and ramping up on Hudi internals, it would be productive if Vinoth 
shepherds this
Balaji.VOn Monday, September 16, 2019, 11:30:44 AM PDT, Vinoth Chandar 
 wrote:  
 
 sg. :)

I will wait for others on this thread as well to chime in.

On Mon, Sep 16, 2019 at 11:27 AM Taher Koitawala  wrote:

> Vinoth, I think right now given your experience with the project you should
> be scoping out what needs to be done to take us there. So +1 for giving you
> more work :)
>
> We want to reach a point where we can start scoping out addition of Flink
> and Beam components within. Then I think will tremendous progress.
>
> On Mon, Sep 16, 2019, 11:43 PM Vinoth Chandar  wrote:
>
> > I still feel the key thing here is reimplementing HoodieBloomIndex
> without
> > needing spark caching.
> >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design&Architecture-BloomIndex(non-global)
> >  documents the spark DAG in detail.
> >
> > If everyone feels, it's best for me to scope the work out, then happy to
> do
> > it!
> >
> > On Mon, Sep 16, 2019 at 10:23 AM Taher Koitawala 
> > wrote:
> >
> > > Guys I think we are slowing down on this again. We need to start
> planning
> > > small small tasks towards this VC please can you help fast track this?
> > >
> > > Regards,
> > > Taher Koitawala
> > >
> > > On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar 
> wrote:
> > >
> > > > Look forward to the analysis. A key class to read would be
> > > > HoodieBloomIndex, which uses a lot of spark caching and shuffles.
> > > >
> > > > On Tue, Aug 13, 2019 at 7:52 PM vino yang 
> > wrote:
> > > >
> > > > > >> Currently Spark Streaming micro batching fits well with Hudi,
> > since
> > > it
> > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> micro
> > > > batch
> > > > > = 1 hudi commit
> > > > > With the per-record model in Flink, I am not sure how useful it
> will
> > be
> > > > to
> > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit, it
> > will
> > > > be
> > > > > inefficient..
> > > > >
> > > > > Yes, if 1 input record = 1 hudi commit, it would be inefficient.
> > About
> > > > > Flink streaming, we can also implement the "batch" and
> "micro-batch"
> > > > model
> > > > > when process data. For example:
> > > > >
> > > > >    - aggregation: use flexibility window mechanism;
> > > > >    - non-aggregation: use Flink stateful state API cache a batch
> data
> > > > >
> > > > >
> > > > > >> On first focussing on decoupling of Spark and Hudi alone, yes a
> > full
> > > > > summary of how Spark is being used in a wiki page is a good start
> > IMO.
> > > We
> > > > > can then hash out what can be generalized and what cannot be and
> > needs
> > > to
> > > > > be left in hudi-client-spark vs hudi-client-core
> > > > >
> > > > > agree
> > > > >
> > > > > Vinoth Chandar  于2019年8月14日周三 上午8:35写道:
> > > > >
> > > > > > >> We should only stick to Flink Streaming. Furthermore if there
> > is a
> > > > > > requirement for batch then users
> > > > > > >> should use Spark or then we will anyway have a beam
> integration
> > > > coming
> > > > > > up.
> > > > > >
> > > > > > Currently Spark Streaming micro batching fits well with Hudi,
> since
> > > it
> > > > > > amortizes the cost of indexing, workload profiling etc. 1 spark
> > micro
> > > > > batch
> > > > > > = 1 hudi commit
> > > > > > With the per-record model in Flink, I am not sure how useful it
> > will
> > > be
> > > > > to
> > > > > > support hudi.. for e.g, 1 input record cannot be 1 hudi commit,
> it
> > > will
> > > > > be
> > > > > > inefficient..
> > > > > >
> > > > > > On first focussing on decoupling of Spark and Hudi alone, yes a
> > full
> > > > > > summary of how Spark is being used in a wiki page is a good start
> > > IMO.
> > > > We
> > > > > > can then hash out what can be generalized and what cannot be and
> > > needs
> > > > to
> > > > > > be left in hudi-client-spark vs hudi-client-core
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 3:57 AM vino yang  >
> > > > wrote:
> > > > > >
> > > > > > > Hi Nick and Taher,
> > > > > > >
> > > > > > > I just want to answer Nishith's question. Reference his old
> > > > description
> > > > > > > here:
> > > > > > >
> > > > > > > > You can do a parallel investigation while we are deciding on
> > the
> > > > > module
> > > > > > > structure.  You could be looking at all the patterns in Hudi's
> > > Spark
> > > > > APIs
> > > > > > > usage (RDD/DataSource/SparkContext) and see if such support can
> > be
> > > > > > achieved
> > > > > > > in theory with Flink. If not, what is the workaround.
> Documenting
> > > > such
> > > > > > > patterns would be valuable when multiple engineers are working
> on
> > > it.
> > > > > For
> > > > > > > e:g, Hudi relies on    (a) custom partitioning logic for
> > upserts,
> > > > > >  (b)
> > > > > > > caching RDDs to avoid reruns

Re: [BUG] Null Pointer Exception in SourceFormatAdapter

2019-09-16 Thread vbal...@apache.org
 Yes, It makes sense to add validations with descriptive messages. Please open 
a ticket and send a PR for this.
Thanks,Balaji.VOn Monday, September 16, 2019, 01:11:12 AM PDT, Pratyaksh 
Sharma  wrote:  
 
 Hi Balaji,

I get your point. However I feel in such cases, instead of throwing a Null
Pointer, we should handle the case gracefully. The exception should be
thrown with proper user-facing message. Please let me know your thoughts
on this.

On Fri, Sep 13, 2019 at 7:26 PM Balaji Varadarajan
 wrote:

>  Hi Pratyaksh,
> This is expected. You need to pass a schema-provider since you are using
> Avro Sources.For RowBased sources, DeltaStreamer can deduce schema from Row
> type information available from Spark Dataset.
> Balaji.V
>    On Friday, September 13, 2019, 02:57:37 AM PDT, Pratyaksh Sharma <
> pratyaks...@gmail.com> wrote:
>
>  Hi,
>
> I am trying to build a CDC pipeline using Hudi working on tag hoodie-0.4.7.
> Here is the command I used for running DeltaStreamer -
>
> spark-submit --files jaas.conf --conf
> 'spark.driver.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf'
> --conf
>
> 'spark.executor.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf'
> --master yarn --deploy-mode cluster --num-executors 2 --class
> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
> /path/to/hoodie-utilities-0.4.7.jar --storage-type COPY_ON_WRITE
> --source-class com.uber.hoodie.utilities.sources.AvroKafkaSource
> --source-ordering-field  --target-base-path hdfs://path/to/cow_table
> --target-table cow_table --props hdfs://path/to/fg-kafka-source.properties
> --transformer-class com.uber.hoodie.utilities.transform.DebeziumTransformer
> --spark-master yarn-cluster --source-limit 5000
>
> Basically I have not passed any SchemaProvider class in the command. When I
> run the above command, I get the below exception in SourceFormatAdapter and
> the job gets killed -
>
> java.lang.NullPointerException
> at
>
> com.uber.hoodie.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94)
> at
>
> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:224)
> at
>
> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:504)
>
> In HoodieDeltaStreamer class, we try to initiate RowBasedSchemaProvider
> before registering Avro Schemas if the schemaProvider variable is null.
> Hence I am trying to understand if the above exception is expected
> behaviour.
>
> Please help.
>
  

Re: [VOTE] Release 0.5.0-incubating, release candidate #1

2019-09-15 Thread vbal...@apache.org
 
Thanks Thomas for the advice. I have started looking into other voting threads 
in that archives to collect list of things to do.
I will fix the release structure of hudi as part of publishing RC2. I was 
following flink release process model which is already graduated and hence this 
issue. 

Thanks,Balaji.V

On Sunday, September 15, 2019, 08:03:05 AM PDT, Thomas Weise 
 wrote:  
 
 There is an issue with the dist folder structure.

Incubator releases should be under
https://dist.apache.org/repos/dist/release/incubator/
<https://dist.apache.org/repos/dist/release/incubator/hudi/>

How was the top level hudi folder created?
https://dist.apache.org/repos/dist/release/hudi/



On Sun, Sep 15, 2019 at 7:55 AM Thomas Weise  wrote:

> Please note that after this release is approved by the PPMC, another vote
> will be needed on general@incubator.
>
> I would recommend to review comments and issues of other incubator
> releases votes in the archive as some tend to repeat for first releases of
> a podling.
>
>
> On Sat, Sep 14, 2019 at 7:39 PM vbal...@apache.org 
> wrote:
>
>>  Good point Vinoth. It looks like this is needed for incubator projects.
>> Let me go ahead and add this DISCLAIMER file and also check if NOTICE
>> files can be cleaned up.
>> Folks,  I am going to -1 and drop release candidate-1.  I will create a
>> new RC release after I fixed the above. Thanks a lot for verifying other
>> aspects. I will send information about RC-2 shortly.
>> Balaji.V
>>    On Saturday, September 14, 2019, 07:21:41 PM PDT, Bhavani Sudha
>> Saktheeswaran  wrote:
>>
>>  +1 (non-binding) on other aspects.
>> - verified checksums and signatures [SUCCESS]
>> - built from source release (mvn clean install -DskipTests) [SUCCESS]
>> - ran local docker tests [SUCCESS]
>> - ran some IDE tests [SUCCESS]
>>
>> Thanks,
>> Sudha
>>
>>
>> On Sat, Sep 14, 2019 at 6:46 PM Vinoth Chandar  wrote:
>>
>> > -1 (binding)
>> >
>> > - Checksums & Signatures verify
>> > - Built the branch & tests pass
>> > - My own test jobs seem to work
>> >  - Checked pom for version
>> >  - NOTICE and LICENSE I think were updated right before RC was cut.
>> Should
>> > be good to go
>> >  - Source files all have ASF license . Tested rat plugin fails build if
>> > java/scala files don't have license.
>> >
>> > But, checked other vote threads on general@incubator to understand any
>> > gaps
>> > [1] and have some concerns
>> > Most discussions mention DISCLAIMER. is this the disclaimer we have on
>> > site? or a separate file like this
>> >
>> >
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dheron_blob_master_DISCLAIMER-3F&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=6OACEXf4lzybehn_i2Q0YBCpsZTN98wE4ii347BUKDU&s=luHVLuRJrqND1UCrOiTq176TCoQJ9SIfZGNRPyGRAc4&e=
>> > If
>> > latter, I think we need to add it.
>> >
>> > Release manager, kindly take note if we need to do anything to handle
>> these
>> > before the general vote
>> >
>> > P.S: Found this to be a great resource for verifying the package
>> >
>> >
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apache.org_info_verification.html&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=6OACEXf4lzybehn_i2Q0YBCpsZTN98wE4ii347BUKDU&s=VE-T7bsasyA-9IBP5mztzksc5FHjQTHK9sEySUco8wA&e=
>> >
>> > On Sat, Sep 14, 2019 at 9:17 AM Prasanna Rajaperumal <
>> prasa...@apache.org>
>> > wrote:
>> >
>> > > +1 (binding)
>> > >
>> > > Great job getting the RC out!
>> > >
>> > > - verified checksums
>> > > - verified signatures
>> > > - Built the branch and my tests pass
>> > >
>> > >
>> > > On 2019/09/14 13:19:25, leesf  wrote:
>> > > > +1 (non-binding)
>> > > >
>> > > > - verified checksums and signatures - OK
>> > > > - checked that all pom.xml files point to the same
>> > > > version(0.5.0-incubating-rc1) - OK
>> > > > - built from source(mvn clean install -DskipTests) - OK
>> > > > - ran some tests in IDE - OK
>> > > >
>> > > > Best,
>> > > > Leesf
>> > > >
>> > > > vbal...@apache.org  于2019年9月14日周六 上午6:32写道:
>> > > >
>> > > > > Hi everyone

Re: [DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

2019-09-14 Thread vbal...@apache.org
 
+1. Agree with everyone's point. Go for it Taher !!
Balaji.VOn Saturday, September 14, 2019, 07:44:04 PM PDT, Bhavani Sudha 
Saktheeswaran  wrote:  
 
 +1 I  think adding new sources to DeltaStreamer is really valuable.

Thanks,
Sudha

On Sat, Sep 14, 2019 at 7:52 AM vino yang  wrote:

> Hi Taher,
>
> IMO, it's a good supplement to Hudi.
>
> So +1 from my side.
>
> Vinoth Chandar  于2019年9月14日周六 下午10:23写道:
>
> > Hi Taher,
> >
> > I am fully onboard on this. This is such a frequently asked question and
> > having it all doable with a simple DeltaStreamer command would be really
> > powerful.
> >
> > +1
> >
> > - Vinoth
> >
> > On 2019/09/14 05:51:05, Taher Koitawala  wrote:
> > > Hi All,
> > >          Currently, we are trying to pull data incrementally from our
> > RDBMS
> > > sources, however the way we are doing this is with HUDI is to create a
> > > spark table on top of the JDBC source using [1] which writes raw data
> to
> > an
> > > HDFS dir. We then use DeltaStreamer dfs-source to write that to a HUDI
> > > upsert COPY_ON_WRITE table.
> > >
> > >          However, I think it would be really helpful in such use cases
> > > where DeltaStreamer had something like a JDBC-source instead of sqoop
> or
> > > temp tables and then we could leave that in a continuous mode with a
> > > timestamp column and an interval which allows us to express how
> > frequently
> > > DeltaStreamer should check for new updates or inserts on RDBMS.
> > >
> > > 1: CREATE TABLE mysql_temp_table
> > > USING org.apache.spark.sql.jdbc
> > > OPTIONS (
> > >      url  "jdbc:mysql://
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__data.source.mysql.com&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=kd2JZkFO9u_nWk8s__l1rNlfZ0cQ_zXOjURNBNmmJo4&s=zIAG-Ct3xm-8XBHg7Gv4mxPF7YpQJ5wxWTarYnJlJDE&e=
> :3306/database?user=mysql_user&password=password&zeroDateTimeBehavior=CONVERT_TO_NULL
> > > ",
> > >      dbtable "database.table_name",
> > >      fetchSize "100",
> > >      partitionColumn "contact_id", lowerBound "1",
> > > upperBound "2962429",
> > > numPartitions "62"
> > > );
> > >
> > > Regards,
> > > Taher Koitawala
> > >
> >
>  

Re: [VOTE] Release 0.5.0-incubating, release candidate #1

2019-09-14 Thread vbal...@apache.org
 Good point Vinoth. It looks like this is needed for incubator projects. 
Let me go ahead and add this DISCLAIMER file and also check if NOTICE files can 
be cleaned up.
Folks,  I am going to -1 and drop release candidate-1.  I will create a new RC 
release after I fixed the above. Thanks a lot for verifying other aspects. I 
will send information about RC-2 shortly.
Balaji.V
On Saturday, September 14, 2019, 07:21:41 PM PDT, Bhavani Sudha 
Saktheeswaran  wrote:  
 
 +1 (non-binding) on other aspects.
- verified checksums and signatures [SUCCESS]
- built from source release (mvn clean install -DskipTests) [SUCCESS]
- ran local docker tests [SUCCESS]
- ran some IDE tests [SUCCESS]

Thanks,
Sudha


On Sat, Sep 14, 2019 at 6:46 PM Vinoth Chandar  wrote:

> -1 (binding)
>
> - Checksums & Signatures verify
> - Built the branch & tests pass
> - My own test jobs seem to work
>  - Checked pom for version
>  - NOTICE and LICENSE I think were updated right before RC was cut. Should
> be good to go
>  - Source files all have ASF license . Tested rat plugin fails build if
> java/scala files don't have license.
>
> But, checked other vote threads on general@incubator to understand any
> gaps
> [1] and have some concerns
> Most discussions mention DISCLAIMER. is this the disclaimer we have on
> site? or a separate file like this
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dheron_blob_master_DISCLAIMER-3F&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=6OACEXf4lzybehn_i2Q0YBCpsZTN98wE4ii347BUKDU&s=luHVLuRJrqND1UCrOiTq176TCoQJ9SIfZGNRPyGRAc4&e=
> If
> latter, I think we need to add it.
>
> Release manager, kindly take note if we need to do anything to handle these
> before the general vote
>
> P.S: Found this to be a great resource for verifying the package
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apache.org_info_verification.html&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=6OACEXf4lzybehn_i2Q0YBCpsZTN98wE4ii347BUKDU&s=VE-T7bsasyA-9IBP5mztzksc5FHjQTHK9sEySUco8wA&e=
>
> On Sat, Sep 14, 2019 at 9:17 AM Prasanna Rajaperumal 
> wrote:
>
> > +1 (binding)
> >
> > Great job getting the RC out!
> >
> > - verified checksums
> > - verified signatures
> > - Built the branch and my tests pass
> >
> >
> > On 2019/09/14 13:19:25, leesf  wrote:
> > > +1 (non-binding)
> > >
> > > - verified checksums and signatures - OK
> > > - checked that all pom.xml files point to the same
> > > version(0.5.0-incubating-rc1) - OK
> > > - built from source(mvn clean install -DskipTests) - OK
> > > - ran some tests in IDE - OK
> > >
> > > Best,
> > > Leesf
> > >
> > > vbal...@apache.org  于2019年9月14日周六 上午6:32写道:
> > >
> > > > Hi everyone, We have prepared the first apache release candidate for
> > > > Apache Hudi (incubating). The version is : 0.5.0-incubating-rc1.
> Please
> > > > review and vote on the release candidate #1 for the version 0.5.0, as
> > > > follows:[ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > > The complete staging area is available for your review, which
> includes:
> > > >    - JIRA release notes [1]
> > > >    - The official Apache source release and binary convenience
> > releases to
> > > > be deployed to
> https://urldefense.proofpoint.com/v2/url?u=http-3A__dist.apache.org&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=6OACEXf4lzybehn_i2Q0YBCpsZTN98wE4ii347BUKDU&s=ecQHabJSqEgYr80kZcxsVRV6TD2cI1GC08OdVtbIHyo&e=
> [2], which are signed with the key with
> > > > fingerprint AF9BAF79D311A3D3288E583F24A499037262AAA4  [3],
> > > >
> > > >    - all artifacts to be deployed to the Maven Central Repository [4]
> > > >
> > > >    - source code tag "release-0.5.0-incubating-rc1" [5]
> > > >
> > > > The vote will be open for at least 72 hours.
> > > > Please cast your votes before *Sep. 18th 2019, 23:00 UTC*.
> > > >
> > > > It is adopted by majority approval, with at least 3 PMC affirmative
> > > > votes.
> > > >    -
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_secure_ReleaseNote.jspa-3FprojectId-3D12322822-26version-3D12346087&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGd

Re: Merging schema's during Incremental load

2019-09-14 Thread vbal...@apache.org
 Hi Gautam,
What I understood was you are trying to incrementally ingest from 
RowBasedSource. It is not clear to me if this upstream source is another 
HoodieIncrSource. If that is the case, not sure how the second batch will miss 
the columns. Can you elaborate more on the setup and what your upstream source 
is ?
Anyways, It is ok for incremental dataset (second batch to be ingested) to have 
fewer columns than those (in the first batch) as long as the missing columns 
are nullable (Avro backwards compatible).  But per contract, Hudi needs the 
latest schema (union schema) for every ingestion run. If you had passed the 
schema (with columns missing), then its possible to lose the columns. Hudi COW 
reads the older version of the file and creates newer version using the schema 
passed. So, if the schema passed has missing columns, both the old record and 
new records which were in the same file will be missing the column.
 IIUC, you would need to provide a schema-provider in HoodieDeltaStreamer 
execution (--schema-provider-class) where the schema returned is the 
union-schema.
Let me know if this makes sense. Also please elaborate on your pipeline setup.
Thanks,Balaji.V

On Friday, September 13, 2019, 02:33:16 PM PDT, Gautam Nayak 
 wrote:  
 
 Hi,
We have been evaluating Hudi and there is one use case we are trying to solve, 
where incremental datasets can have fewer columns than the ones that have been 
already persisted in Hudi format.

For example : In initial batch , We have a total of 4 columns
    val initial = Seq(("id1", "col1", "col2", 123456)).toDF("pk", "col1", 
"col2", "ts")

and in the incremental batch, We have 3 columns
 val incremental = Seq(("id2", "col1", 123879)).toDF("id", "col1", "ts")

We want to have a union of initial and incremental schemas such that col2 of 
id2 has some default type associated to it. But what we are seeing is the 
latest schema(incremental) for both the records when we persist the data (COW) 
and read it back through Spark. The actual incrementals datasets would be in 
Avro format but we do not maintain their schemas.
I tried looking through the documentation to see if there is a specific 
configuration to achieve this, but couldn’t find any.
We would also want to achieve this via Deltastreamer and then query these 
results from Presto.

Thanks,
Gautam




  

[VOTE] Release 0.5.0-incubating, release candidate #1

2019-09-13 Thread vbal...@apache.org
Hi everyone, We have prepared the first apache release candidate for Apache 
Hudi (incubating). The version is : 0.5.0-incubating-rc1. Please review and 
vote on the release candidate #1 for the version 0.5.0, as follows:[ ] +1, 
Approve the release
[ ] -1, Do not approve the release (please provide specific comments)
The complete staging area is available for your review, which includes:   
   - JIRA release notes [1]
   - The official Apache source release and binary convenience releases to be 
deployed to dist.apache.org [2], which are signed with the key with fingerprint 
AF9BAF79D311A3D3288E583F24A499037262AAA4  [3],   

   - all artifacts to be deployed to the Maven Central Repository [4]   

   - source code tag "release-0.5.0-incubating-rc1" [5]   

The vote will be open for at least 72 hours. 
Please cast your votes before *Sep. 18th 2019, 23:00 UTC*. 

It is adopted by majority approval, with at least 3 PMC affirmative votes.   
   - 
https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12346087
   - https://dist.apache.org/repos/dist/dev/hudi/hudi-0.5.0-incubating-rc1/
   - https://dist.apache.org/repos/dist/release/hudi/KEYS
   - https://repository.apache.org/content/repositories/orgapachehudi-1001/
   - https://github.com/apache/incubator-hudi/tree/release-0.5.0-incubating-rc1 
  
   

P.S. : As this is a first time where Hudi community will be performing release 
voting, you can look at 
https://lists.apache.org/thread.html/75e40ed5a6e0c3174728a0bcfe86cbcd99ae4778ebe94b839f0674cd@%3Cdev.flink.apache.org%3E
 for some understanding of validations community does to cast their 
votes.Thanks,Balaji.V 


Re: [BUG] Exception when running HoodieDeltaStreamer

2019-09-13 Thread vbal...@apache.org
 
Hi Pratyaksh,
For boolean flags, you don't need to pass true or false. It is implicit. Just 
pass "--enable-hive-sync" without additional true/false in the command line.
Balaji.VOn Friday, September 13, 2019, 03:06:38 AM PDT, Pratyaksh Sharma 
 wrote:  
 
 Hi,

I am trying to run HoodieDeltaStreamer and am working on tag hoodie-0.4.7.
I am using spark version 2.3.2. I was trying to enable hive sync along with
running HoodieDeltaStreamer by passing the flag --enable-hive-sync as true.
Here is the command I used -

spark-submit --master local[1] --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
/home/ubuntu/pratyaksh/hoodie-utilities-debezium-0.4.7.jar
--enable-hive-sync false --storage-type COPY_ON_WRITE --source-class
com.uber.hoodie.utilities.sources.JsonDFSSource --source-ordering-field
  --target-base-path hdfs://path/to/cow_table --target-table cow_table
--props hdfs://path/to/fg-kafka-source.properties --schemaprovider-class
com.uber.hoodie.utilities.schema.FilebasedSchemaProvider

However I got the below exception -

Exception in thread "main"
com.uber.hoodie.com.beust.jcommander.ParameterException: Was passed main
parameter 'false' but no main parameter was defined
at
com.uber.hoodie.com.beust.jcommander.JCommander.getMainParameter(JCommander.java:914)
at
com.uber.hoodie.com.beust.jcommander.JCommander.parseValues(JCommander.java:759)
at
com.uber.hoodie.com.beust.jcommander.JCommander.parse(JCommander.java:282)
at
com.uber.hoodie.com.beust.jcommander.JCommander.parse(JCommander.java:265)
at
com.uber.hoodie.com.beust.jcommander.JCommander.(JCommander.java:210)
at
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:493)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:904)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/09/12 13:59:16 INFO ShutdownHookManager: Shutdown hook called

I was able to fix this by making the variable enableHiveSync in
HoodieDeltaStreamer.Config class as a integer type.

Has anybody faced the above issue? Any leads will be appreciated.
  

Re: Apache Pulsar component for Hudi

2019-09-12 Thread vbal...@apache.org
 
+1 for adding new types of data pipes. On Thursday, September 12, 2019, 
01:39:20 AM PDT, taher koitawala  wrote:  
 
 Hi All,
        A Jira has been opened to track the issue.
https://issues.apache.org/jira/projects/HUDI/issues/HUDI-246

On Thu, Sep 12, 2019 at 10:48 AM Vinoth Chandar  wrote:

> yes JIRA would be great to scope out the work.
>
> On Wed, Sep 11, 2019 at 10:00 PM Bhavani Sudha Saktheeswaran
>  wrote:
>
> > +1 for integrating Apache Pulsar.
> >
> > On Wed, Sep 11, 2019 at 8:58 PM taher koitawala 
> > wrote:
> >
> > > Should we file a jira? If everyone agrees?
> > >
> > > On Thu, Sep 12, 2019, 6:30 AM vino yang  wrote:
> > >
> > > > +1 to welcome Pulsar connector
> > > >
> > > > Vinoth Chandar  于2019年9月12日周四 上午6:57写道:
> > > >
> > > > > +1 Always welcome new sources. Any takers for a PulsarSource in
> > > > > DeltaStreamer?
> > > > >
> > > > > On Tue, Sep 10, 2019 at 4:33 AM taher koitawala <
> taher...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Vinoth,
> > > > > >              Apache Pulsar is a pub/sub messaging system like
> > Kafka,
> > > > > > however, it has a few more functions which makes it different
> like
> > > > > > serverless per record etl at Pulsar level. Pulsar auto service
> > > > discovery
> > > > > > etc.
> > > > > >            As Pulsar is picking up pace should we bring that as a
> > > > > component
> > > > > > in DeltaStreamer?
> > > > > >
> > > > > > Thanks,
> > > > > > Taher Koitawala
> > > > > >
> > > > >
> > > >
> > >
> >
>  

Re: ApacheCon NA 19 slides

2019-09-11 Thread vbal...@apache.org
 
Thanks guys. The talk was primarily focussed on a high level design around 
building data lakes. Hence, we did not go too deep into lower level details. 
Not sure if/when Apache Con is going to publish the talk video. We will add the 
slides meanwhile to the powered-by sectionOn Wednesday, September 11, 2019, 
03:34:15 AM PDT, leesf  wrote:  
 
 Also, for easy access by others, we can link the talk and slides to Talk &
Presentations on this page( https://hudi.apache.org/powered_by.html )

Best
Leesf

leesf  于2019年9月11日周三 下午6:21写道:

> The slides are very detailed and can help others know better about hudi.
> Thanks for sharing!
>
> Best
> Leesf
>
> taher koitawala  于2019年9月11日周三 下午3:26写道:
>
>> Hi Vinoth,
>>              Slides look amazing to me. However, shouldn't we give out
>> some more clarity on Hoodie Index, Compactions and also how we can do UDFs
>> when pulling data to Hudi? Other than that, the slides and explanation are
>> great.
>>
>> Regards,
>> Taher Koitawala
>>
>> On Wed, Sep 11, 2019 at 12:44 PM Vinoth Chandar 
>> wrote:
>>
>> > Hi all,
>> >
>> > You might have noticed reduced responses this week. Reason was that
>> Balaji
>> > and I were prepping for our talk at ApacheCon.
>> >
>> > Shared the slides here
>> >
>> >
>> >
>> https://docs.google.com/presentation/d/1FHhsvh70ZP6xXlHdVsAI0g__B_6Mpto5KQFlZ0b8-mM
>> >
>> > Thanks
>> > Vinoth
>> >
>>
>  

Re: Dropping support for Spark 2.2 and lower

2019-09-10 Thread vbal...@apache.org
 +1 Would be super useful.
Balaji.V
On Tuesday, September 10, 2019, 04:22:53 AM PDT, taher koitawala 
 wrote:  
 
 +1 we can drop that

On Tue, Sep 10, 2019 at 4:45 PM Kabeer Ahmed  wrote:

> +1.
>
> I am on spark 2.3 but would love to move to Spark 2.4.
> On Sep 10 2019, at 12:16 am, Vinoth Chandar  wrote:
> > Hello all,
> >
> > I am trying to gauge what spark version everyone is on. We would like to
> > move the spark version to 2.4 and simplify a whole bunch of stuff. Any
> > objections? As a best effort, we can try to make 2.3 work reliably. Any
> > objections?
> >
> > Note that if you are using the RDD based hudi-client primarily, this
> should
> > not affect you per se.
> >
> > Thanks
> > Vinoth
> >
>
>
  

Re: [For Mentors] Readiness for IP Clearance

2019-09-04 Thread vbal...@apache.org
 
Pinging to see if one of the mentors can update the xml page :)
Thanks,Balaji.VOn Friday, August 30, 2019, 05:29:52 PM PDT, Thomas Weise 
 wrote:  
 
 The signed CCLA was recorded on 2019/05/09


On Fri, Aug 30, 2019 at 5:24 PM Vinoth Chandar  wrote:

> Good point. Balaji was following up on this?  may be all we need is another
> LEGAL/INFRA ticket?
> I see the name search status is reflected now.
>
> On Fri, Aug 30, 2019 at 5:13 PM Thomas Weise  wrote:
>
> > So then the question is why we see "No Software Grant and No IP Clearance
> > Filed" on https://whimsy.apache.org/roster/ppmc/hudi and we are not sure
> > whether the signed CCLA was submitted or not?
> >
> > On Fri, Aug 30, 2019 at 4:59 PM Vinoth Chandar 
> wrote:
> >
> > > This is the confirmation
> > >
> > >
> >
> https://lists.apache.org/thread.html/49a42ca7dbbd8d50c62d9c936b4733a862eff3de54fdd11b2bfbf532@
> > > 
> > > But, I think there is a snag. They had to resubmit it signed. So I dont
> > > have an email showing the final acknowledgement.
> > >
> > > On Fri, Aug 30, 2019 at 4:51 PM Thomas Weise  wrote:
> > >
> > > > In the future it would be helpful to keep the conversation in a
> single
> > > > place. In this case, it diverted to slack and the list would be more
> > > > appropriate.
> > > >
> > > > Do you have the reference for the SGA recording?
> > > >
> > > > I don't think that another IP clearance is required, since it was
> > already
> > > > done as part of accepting the project to the incubator. See [1] for
> > this
> > > > specific question and check [2] for past clearances filed.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/c5fa334b99c49bd18394f2f5a817ea16ebea76737cdda10bc057b83f@%3Cgeneral.incubator.apache.org%3E
> > > > [2] https://incubator.apache.org/ip-clearance/
> > > >
> > > >
> > > > On Fri, Aug 30, 2019 at 9:53 AM Suneel Marthi 
> > > wrote:
> > > >
> > > > > I can do this later tonite after I get home from work - if any of
> the
> > > > other
> > > > > mentors get to it before me please go ahead and do the needful.
> > > > >
> > > > > On Fri, Aug 30, 2019 at 11:46 AM Vinoth Chandar  >
> > > > wrote:
> > > > >
> > > > > > Ping Ping :)
> > > > > >
> > > > > > On 2019/08/28 17:54:12, "vbal...@apache.org"  >
> > > > wrote:
> > > > > > > Dear Mentors,
> > > > > > >
> > > > > > > We are able to setup nightly snapshot builds. At this moment,
> we
> > > have
> > > > > > the following steps done (Master Jira:
> > > > > > https://jira.apache.org/jira/browse/HUDI-121)
> > > > > > >    - Software Grant : Software grant from Uber to Apache has
> been
> > > > > > completed
> > > > > > >    - Contributor CLA : Done
> > > > > > >    - License Conformance : All dependencies have been verified
> to
> > > be
> > > > > > conforming as per  https://apache.org/legal/resolved.html
> > > > > > >
> > > > > > >    - Apache Style Package : We have renamed source code to
> follow
> > > > > > "org.apache.hudi" package namespace. Migration guide for
> developers
> > > and
> > > > > > customers :
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Migration+Guide+From+com.uber.hoodie+to+org.apache.hudi
> > > > > > >    - KEYS uploaded to dist.apache.org
> > > > > > >    - Nightly snapshot builds setup (
> > > > > > https://builds.apache.org/job/hudi-snapshot-deployment-0.5/)
> > > > > > >
> > > > > > > I am working on getting the release branch cut and built. I
> will
> > > soon
> > > > > > send the first release candidate for voting. While this is
> > happening,
> > > > can
> > > > > > one of you file IP clearance request to ASF in parallel  (
> > > > > >
> > https://incubator.apache.org/ip-clearance/ip-clearance-template.html
> > > > ) ?
> > > > > > > Thanks,Balaji.V
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
  

Re: Reg: Hudi Jira Ticket Conventions

2019-08-29 Thread vbal...@apache.org
 Yes, I just opened a new ticket (https://jira.apache.org/jira/browse/HUDI-228) 
for this.
Balaji.V




On Thursday, August 29, 2019, 09:45:57 AM PDT, Vinoth Chandar 
 wrote:  
 
 Pratyaksh, I am fine with that. Please go ahead.

(Balaji, correct me if I am wrong. I dont think there is a task yet for
this?)

On Thu, Aug 29, 2019 at 8:26 AM vino yang  wrote:

> +1 for the conventions
>
> Pratyaksh Sharma  于2019年8月29日周四 下午3:03写道:
>
> > Hi Vinoth,
> >
> > I would like to take up this task.
> >
> > On Thu, Aug 29, 2019 at 8:49 AM Vinoth Chandar 
> wrote:
> >
> > > +1 can we add this to contributing/community pages. As well
> > >
> > > On Wed, Aug 28, 2019 at 2:33 PM vbal...@apache.org  >
> > > wrote:
> > >
> > > > To all contributors of Hudi:
> > > > Dear folks,
> > > > When filing or updating a JIRA for Apache Hudi, kindly make sure the
> > > issue
> > > > type and versions (when resolving the ticket) are set correctly.
> Also,
> > > the
> > > > summary needs to be descriptive enough to catch the essence of the
> > > > problem/features. This greatly helps in generating release notes.
> > > > Thanks,Balaji.V
> > >
> >
>  

Reg: Hudi Jira Ticket Conventions

2019-08-28 Thread vbal...@apache.org
To all contributors of Hudi:
Dear folks,
When filing or updating a JIRA for Apache Hudi, kindly make sure the issue type 
and versions (when resolving the ticket) are set correctly. Also, the summary 
needs to be descriptive enough to catch the essence of the problem/features. 
This greatly helps in generating release notes.
Thanks,Balaji.V

Re: Upsert after Delete

2019-08-28 Thread vbal...@apache.org
 
Hi Kabeer,
I have requested some information in the github ticket. 
Balaji.VOn Wednesday, August 28, 2019, 10:46:04 AM PDT, Kabeer Ahmed 
 wrote:  
 
 Thanks for the quick response Vinoth. That is what I would have thought that 
there is nothing complex or different in upsert after a delete. Yes, I can 
reproduce the issue with simple example that I have written in the email.

I have dug into the issue in detail and it seems it is a bug. I have filed it 
at: https://github.com/apache/incubator-hudi/issues/859 
(https://link.getmailspring.com/link/23c57df5-045c-4021-a880-93a1c46a3...@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-hudi%2Fissues%2F859&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
 Let me know if more information is required.
Thank you,

On Aug 23 2019, at 1:37 am, Vinoth Chandar  wrote:
> yes. I was asking about the HUDI storage type..
>
> There is nothing complex about upsert() after delete(). It almost as if a
> delete() for (2, vinoth) happened in between.
>
> Are you able to repro this literally with this tiny example with 3 records?
> Some things to check
>
> - This sequence would have created 3 commits. You can look at the commit
> files and see if the number of record updated, inserted, deleted match
> expectations.
> - if they do, then you can use spark.read.parquet(.). on the individual
> parquet files and see what records they actually contain ..
>
> This should shed some light on the pattern of failure and when exactly (2,
> vinoth) disappeared.
>
> Alternatively, if you can give a small snippet that reproduces this, we can
> debug from there.
>
>
>
>
>
>
> On Thu, Aug 22, 2019 at 3:06 PM Kabeer Ahmed  wrote:
> > And if you meant HUDI storage type, I have left it to default COW - Copy
> > On Write.
> >
> > If anyone has tried this please let me know if you have hit similar issue.
> > Any experience would be greatly helpful.
> > On Aug 22 2019, at 11:01 pm, Kabeer Ahmed  wrote:
> > > Hi Vinoth - thanks for the quick response.
> > >
> > > I have followed the mail thread for deletes ->
> > http://mail-archives.apache.org/mod_mbox/hudi-commits/201904.mbox/<
> > 16722511.2660.9583626796839453...@gitbox.apache.org>
> > >
> > > For your convenience, the code that I use is below at the end of the
> > email. EmptyHoodieRecord is inserted for the relevant records that need to
> > be deleted. After the delete, I can query from Hive and confirm that the
> > rows intended to be deleted are no longer present and the records not
> > deleted can be seen in the Hive table via Hive and Presto.
> > > The issue starts when the upsert is done after a delete.
> > > The storage type is S3 and I dont think there is any eventual
> >
> > consistency in play as the record upserted is visible but the old records
> > that werent deleted are not visible.
> > > And for the sake of completion, my insert and upsert logic is based out
> >
> > of the code below:
> > https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L43
> > > Thanks
> > > Kabeer.
> > >
> > > > /**
> > > > * Empty payload used for deletions
> > > > */
> > > > public class EmptyHoodieRecordPayload implements
> > >
> >
> > HoodieRecordPayload
> > > > {
> > > > public EmptyHoodieRecordPayload(GenericRecord record, Comparable
> > >
> >
> > orderingVal) { }
> > > > @Override
> > > > public EmptyHoodieRecordPayload preCombine(EmptyHoodieRecordPayload
> > >
> >
> > another) {
> > > > return another;
> > > > }
> > > > @Override
> > > > public Optional combineAndGetUpdateValue(IndexedRecord
> > >
> >
> > currentValue,
> > > > chema schema) {
> > > > return Optional.empty();
> > > > }
> > > > @Override
> > > > public Optional getInsertValue(Schema schema) {
> > > > return Optional.empty();
> > > > }
> > > > }
> > >
> > > -- Forwarded Message -
> > >
> > > From: Vinoth Chandar 
> > > Subject: Re: Upsert after Delete
> > > Date: Aug 22 2019, at 8:38 pm
> > > To: dev@hudi.apache.org
> > >
> > > That’s interesting. Can you also share details on storage type and how
> > you
> > > are issuing the deletes and also the table/view (ro, rt) that you are
> > > querying?
> > >
> > > On Thu, Aug 22, 2019 at 9:49 AM Kabeer Ahmed 
> > wrote:
> > > > Hudi experts and Users,
> > > > Has anyone attempted an upsert after a delete? Here is a weird thing
> > >
> >
> > that
> > > > I have bumped into and it is a shame that this has come up when
> > >
> >
> > someone in
> > > > the team tested this whilst I failed to run this test.
> > > > Use case:
> > > > Insert data into a table. Say records (1, kabeer | 2, vinoth)
> > > >
> > > > Delete a record (1, kabeer). Data in the table is: (2, vinoth) and it
> > is
> > > > visible via sql through Presto/Hive.
> > > >
> > > > Upsert a new record into the same table (3, balaji). Query the table
> > and
> > > > only record that is visible is: (3, balaji). The record (2, vinoth) is
> > >

[For Mentors] Readiness for IP Clearance

2019-08-28 Thread vbal...@apache.org
Dear Mentors,

We are able to setup nightly snapshot builds. At this moment, we have the 
following steps done (Master Jira: 
https://jira.apache.org/jira/browse/HUDI-121)    
   - Software Grant : Software grant from Uber to Apache has been completed
   - Contributor CLA : Done
   - License Conformance : All dependencies have been verified to be conforming 
as per  https://apache.org/legal/resolved.html   

   - Apache Style Package : We have renamed source code to follow 
"org.apache.hudi" package namespace. Migration guide for developers and 
customers : 
https://cwiki.apache.org/confluence/display/HUDI/Migration+Guide+From+com.uber.hoodie+to+org.apache.hudi
   - KEYS uploaded to dist.apache.org
   - Nightly snapshot builds setup 
(https://builds.apache.org/job/hudi-snapshot-deployment-0.5/)   

I am working on getting the release branch cut and built. I will soon send the 
first release candidate for voting. While this is happening, can one of you 
file IP clearance request to ASF in parallel  ( 
https://incubator.apache.org/ip-clearance/ip-clearance-template.html ) ? 
Thanks,Balaji.V


Re: [Hudi Improvement]: Introduce secondary source-ordering-field for breaking ties while writing

2019-08-28 Thread vbal...@apache.org
 Sure Pratyaksh, Whatever field works for your use-case is good enough. You do 
have the flexibility to generate a derived field or use one of the source 
fields 
Balaji.VOn Wednesday, August 28, 2019, 06:48:44 AM PDT, Pratyaksh Sharma 
 wrote:  
 
 Hi Balaji,

Sure I can do that. However after a considerable amount of time, the
bin-log position will get exhausted. To handle this, we can have secondary
ordering field as the ingestion_timestamp (the time when I am pushing the
event to Kafka to be consumed by DeltaStreamer) which will work always.

Please suggest.

On Thu, Aug 22, 2019 at 9:49 PM vbal...@apache.org 
wrote:

>  Hi Pratyaksh,
> The usual way we support this is to make use of
> com.uber.hoodie.utilities.transform.Transformer plugin in
> HoodieDeltaStreamer.  You can implement your own Transformer to add a new
> derived field which could be a combination of timestamp and
> binlog-position. You can then configure this new field to be used as source
> ordering field.
> Balaji.V
>
>    On Wednesday, August 21, 2019, 07:35:40 AM PDT, Pratyaksh Sharma <
> pratyaks...@gmail.com> wrote:
>
>  Hi,
>
> While building a CDC pipeline for capturing data changes in SQL using
> HoodieDeltaStreamer, I came across the following problem. We need to read
> SQL's bin log file for fetching all the modifications made to a particular
> table. However in production environment where we are handling hundreds
> of transactions per second (TPS), it is possible to have the same table row
> getting modified multiple times within a second.
>
> Here comes the problem with Mysql binlog as it has 32 bit timestamp upto
> seconds resolution. If we build CDC pipeline on top of such a table
> with huge TPS, then breaking ties between records with the same Hoodie key
> will not be possible with a single source-ordering-field (mentioned in
> HoodieDeltaStreamer.Config), which is binlog timestamp in this case.
>
> Example -  https://github.com/zendesk/maxwell/issues/925.
>
> Hence as a part of Hudi improvement, the proposal is to add one
> secondary-source-ordering-field for breaking ties among incoming records in
> such cases.  For example, we could have ingestion_timestamp or
> binlog_position as the secondary field.
>
> Please suggest. I have raised the issue here
> <https://issues.apache.org/jira/browse/HUDI-207>.
>
  

Re: [DISCUSS] Suggestion for Docs UI

2019-08-22 Thread vbal...@apache.org
 
+1, I like the idea. It would also make the whole page modular.
Balaji.VOn Thursday, August 22, 2019, 12:40:11 PM PDT, Vinoth Chandar 
 wrote:  
 
 +1 I was thinking along similar lines for the demo page

Our doc theme should already support this
https://idratherbewriting.com/documentation-theme-jekyll/mydoc_navtabs.html



On Thu, Aug 22, 2019 at 12:04 PM Bhavani Sudha Saktheeswaran
 wrote:

> Hi all,
>
> I was going through the documentation and thought, in some places, tab view
> (like this:
>
> https://ci.apache.org/projects/flink/flink-docs-master/getting-started/tutorials/local_setup.html#read-the-code
> )
> can be adopted where we showcase how each query engine (Hive. Sparksql,
> Presto) works. This would improve readability and also shorten the page
> length. I am happy to work on it if we are okay with this change. Any
> thoughts?
>
>
> Thanks,
> Sudha
>
  

Re: [Hudi Improvement]: Introduce secondary source-ordering-field for breaking ties while writing

2019-08-22 Thread vbal...@apache.org
 Hi Pratyaksh,
The usual way we support this is to make use of 
com.uber.hoodie.utilities.transform.Transformer plugin in HoodieDeltaStreamer.  
You can implement your own Transformer to add a new derived field which could 
be a combination of timestamp and binlog-position. You can then configure this 
new field to be used as source ordering field.
Balaji.V

On Wednesday, August 21, 2019, 07:35:40 AM PDT, Pratyaksh Sharma 
 wrote:  
 
 Hi,

While building a CDC pipeline for capturing data changes in SQL using
HoodieDeltaStreamer, I came across the following problem. We need to read
SQL's bin log file for fetching all the modifications made to a particular
table. However in production environment where we are handling hundreds
of transactions per second (TPS), it is possible to have the same table row
getting modified multiple times within a second.

Here comes the problem with Mysql binlog as it has 32 bit timestamp upto
seconds resolution. If we build CDC pipeline on top of such a table
with huge TPS, then breaking ties between records with the same Hoodie key
will not be possible with a single source-ordering-field (mentioned in
HoodieDeltaStreamer.Config), which is binlog timestamp in this case.

Example -  https://github.com/zendesk/maxwell/issues/925.

Hence as a part of Hudi improvement, the proposal is to add one
secondary-source-ordering-field for breaking ties among incoming records in
such cases.  For example, we could have ingestion_timestamp or
binlog_position as the secondary field.

Please suggest. I have raised the issue here
.
  

  1   2   >