Re: [VOTE] Release 0.5.1-incubating, release candidate #1

2020-01-22 Thread Balaji Varadarajan
+1 (binding)

Ran the following validation steps:

1. Checked out RC candidate source code and compiled successfully
2. Ran Apache Hudi quickstart steps successfully on 0.5.1-rc1
3. Ran Long running deltastreamer test for a half day without any
exceptions.
4. Compliance : Ran "./release/validate_staged_release.sh --release=0.5.1
--rc_num=1" successfully

Checking Checksum of Source Release
Checksum Check of Source Release - [OK]
Checking Signature
  Signature Check - [OK]
Checking for binary files in source release
   No Binary Files in Source Release? - [OK]
Checking for DISCLAIMERi-WIP
   DISCLAIMER file exists ? [OK]
Checking for LICENSE and NOTICE
   License file exists ? [OK]
   Notice file exists ? [OK]
Performing custom Licensing Check
   Licensing Check Passed [OK]
Running RAT Check
   RAT Check Passed [OK]

Thanks,
Balaji.V


On Wed, Jan 22, 2020 at 11:20 AM leesf  wrote:

> Hi everyone,
>
> We have prepared the second apache release candidate for Apache Hudi
> (incubating). The version is : 0.5.1-incubating-rc1. Please review and vote
> on the release candidate #1 for the version 0.5.1, as follows:
>
> [ ] +1, Approve the release
>
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
>
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint 623E08E06DB376684FB9599A3F5953147903948A [3],
>
> * all artifacts to be deployed to the Maven Central Repository [4],
>
> * source code tag "release-0.5.1-incubating-rc1" [5],
>
>
>
> The vote will be open for at least 72 hours.
> Please cast your votes before *Jan. 27th 2020, 16:00 UTC*.
>
> It is adopted by majority approval, with at least 3 PMC affirmative votes.
>
>
> Thanks,
> Leesf
>
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12346183
> [2]
>
> https://dist.apache.org/repos/dist/dev/incubator/hudi/hudi-0.5.1-incubating-rc1/
> [3] https://dist.apache.org/repos/dist/dev/incubator/hudi//KEYS
> [4] https://repository.apache.org/content/repositories/orgapachehudi-1014
> [5]
> https://github.com/apache/incubator-hudi/tree/release-0.5.1-incubating-rc1
>


[DISCUSS] Redraw of hudi data lake architecture diagram on langing page

2020-01-22 Thread lamberken


Hello everyone, 


I redrawed the hudi data lake architecture diagram on landing page. If you have 
time, go ahead with hudi website[1] and test site[2].
Any thoughts are welcome, thanks very much. :)


[1] https://hudi.apache.org
[2] https://lamber-ken.github.io


Thanks
Lamber-Ken

Re: HoodieDeltaStreamerException during upsert with DeltaStreamer.sync()

2020-01-22 Thread Vinoth Chandar
Hi Venkatesh,

It should keep writing an empty commit with checkpoints.. I think your
situation is different though.. You are seeing a commit wth no checkpoint
key.
>From the code, we always set this.

Since I see com.emr.java on the stack trace, can any of the amazon folks
confirm any code changes on the EMR code that is getting invoked?

Thanks
VInoth

On Tue, Jan 21, 2020 at 12:57 PM Venki g  wrote:

> Hi Vinoth,
>
> Thanks for looking into this.
>
> The source delta file had 7877 records in it,
>
> Driver log showing the number of records - 20/01/17 03:51:04 INFO
> HoodieBloomIndex: TotalRecords 7877, TotalFiles 30, TotalAffectedPartitions
> 29, TotalComparisons 7877, SafeParallelism 1
>
> Looking at the commit history, looks like some changes were done on the
> partition(esp delete).
>
> 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.clean.duration,
> value=93112
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.clean.numFilesDeleted, value=1
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.commitTime, value=1579233069000
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.duration, value=127138
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalBytesWritten, value=6908850
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalCompactedRecordsUpdated, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalCreateTime, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalFilesInsert, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalFilesUpdate, value=1
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalInsertRecordsWritten, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalLogFilesCompacted, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalLogFilesSize, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalPartitionsWritten, value=1
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalRecordsWritten, value=95705
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalScanTime, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalUpdateRecordsWritten, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalUpsertTime, value=7013
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.deltastreamer.duration, value=153732
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.deltastreamer.hiveSyncDuration, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.finalize.duration, value=437
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.finalize.numFilesFinalized, value=1
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.index.update.duration, value=0
>
> Assuming, it was an empty commit. Shouldn't it be still writing the
> checkpoint key read from the last commit to the empty commit file? Since
> checkpoint key is always needed on the recent commit file to avoid this
> exception(*HoodieDeltaStreamerException: Unable to find previous
> checkpoint. Please double check if this table was indeed built via delta
> streamer*).
>
> Can I workaround the problem by passing the most recent checkpoint key in
> the config while calling deltastreamer.sync()?
>
> Thanks
> Venkatesh
>
> On Mon, Jan 20, 2020 at 5:07 PM Vinoth Chandar  wrote:
>
> > Hi Venki,
> >
> > Thanks for reporting this. The latest commit file seems to be empty? I am
> > wondering if this is happening because there was no new data to process
> and
> > the tool wrote an empty commit file..
> > Can you confirm if this seems to match the case?
> >
> > Thanks
> > Vinoth
> >
> >
> > On Mon, Jan 20, 2020 at 4:00 PM Venki g  wrote:
> >
> > > Correcting the link to commit file
> > >
> > > On Mon, Jan 20, 2020 at 3:50 PM Venki g  wrote:
> > >
> > > > Hi,
> > > >
> > > > I am using a spark job to upsert the incremental delta files from S3
> > into
> > > > Hudi storage using HoodieDeltaStreamer.sync() API , The incremental
> > spark
> > > > job is failing with below exception
> > > >
> > > > java.lang.RuntimeException:
> > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException:
> > Unable
> > > to
> > > > find previous checkpoint. Please double check if this table was
> indeed
> > > > built via delta streamer
> > > > at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:36)
> > > > at com.emr.java.HudiDataLoadJob.run(HudiDataLoadJob.java:28)
> > > > at com.emr.java.HiveDeltaStreamer.main(HiveDeltaStreamer.java:19)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > at
> > > >
> > >
> >
> 

Re: [DISCUSS] Unify Hudi code cleanup and improvement

2020-01-22 Thread Vinoth Chandar
Hi,

Thanks everyone for sharing your views!

Some of this conversation is starting to feel like boiling the ocean. I
believe in refactoring with purpose and discussing class-by-class or
module-by-module does not make sense to me. Can we first list down what we
want to achieve? So far, I have only heard fixing IDE/IntelliJ warnings.
Also instead of focussing on new work, how about looking at the pending
JIRAs under "Testing" "Code Cleanup" components first and see if those are
worth tackling.

We went down this path for code formatting and today we still have
inconsistencies. Looking back, I feel we should have clearly defined end
goals for the cleanups and we can then rank them based on ROI.

Thanks
Vinoth

On Wed, Jan 22, 2020 at 7:05 PM vino yang  wrote:

> Hi Shiyan and Bhavani:
>
> Thanks for sharing your thoughts.
>
> As I originally stated. The advantage of using modules as a unit to split
> work is that the decomposition is clear, but the disadvantage is that the
> volume of changes may be huge, which brings huge risks (considering that
> Hudi's test coverage is still not very high) and the workload of review.
> The advantage of splitting by class is that the volume of changes is small
> and the review is more convenient, but the disadvantages are too many tasks
> and high maintenance costs.
>
>
> *In addition, we need to define the boundaries of the "code cleanup" I
> expressed in this topic: it is limited to the smart tips shown by Intellij
> IDEA. If the boundaries are too wide, then this discussion will lose
> control.*
> I agree with Bhavani that we don't take it as the actual goal. But we are
> not opposed to the community to help improve the quality of the code
> (basically, these tips given by the IDE are more reasonable).
>
>
> So, I still give my thoughts: We manage this work with Jira. Before we
> start working, we need to find a committer as a mentor. The mentor must
> decide whether the scale of the subtasks is reasonable and whether
> additional unit tests need to be added to verify the changes. And the
> mentor should be responsible for merged changes.
>
> What do you think?
>
> Best,
> Vino
>
> Bhavani Sudha  于2020年1月22日周三 下午2:22写道:
>
> > Hi @vinoyang thanks for bringing this to discussion. I feel it would be
> > less disruptive to clean up code as part of individual classes being
> > touched for a specific goal rather than code cleanup being the actual
> goal.
> > This would narrow the touch point and ensure test coverage (both unit and
> > integration tests)  catches any accidental/unintentional changes. Also it
> > would give chance to change any documentation quoting/referencing that
> > code. Wanted to share my personal opinion.
> >
> > Thanks,
> > Sudha
> >
> >
> >
> > On Tue, Jan 21, 2020 at 11:36 AM Shiyan Xu 
> > wrote:
> >
> > > The clean-up work can actually be split by modules.
> > >
> > > Though it is generally a good practice to follow, my concern is the
> > > clean-up is likely to cause conflicts with some on-going changes. If I
> > may
> > > suggest, the dedicated clean-up tasks should avoid
> > > - modules that are undergoing multiple feature changes/PRs
> > > - modules that are planned to have major refactoring due to design
> > changes
> > > (since clean-up can be done altogether during refactoring)
> > >
> > > On Tue, Jan 21, 2020 at 4:17 AM Vinoth Chandar 
> > wrote:
> > >
> > > > Not sure if I fully agree with sweeping statements being made. But,
> +1
> > > for
> > > > structuring this work via Jiras and having some committer “accept”
> the
> > > > issue first.  Some of these tend to be subjective and we do need to
> > make
> > > > different tradeoffs.
> > > >
> > > > On Tue, Jan 21, 2020 at 1:28 AM vino yang 
> > wrote:
> > > >
> > > > > Hi Pratyaksh,
> > > > >
> > > > > Thanks for your thought.
> > > > >
> > > > > Let's listen to others' comments. If there is no objection, we will
> > > > follow
> > > > > this way.
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > >
> > > > > Pratyaksh Sharma  于2020年1月21日周二 下午4:56写道:
> > > > >
> > > > > > Hi Vino,
> > > > > >
> > > > > > Big +1 for this initiative. I have done this code cleanup for
> test
> > > > > classes
> > > > > > in the past and strongly feel there is a need to do the same at
> > other
> > > > > > places as well. I would definitely like to volunteer for this.
> > > > > >
> > > > > > On Tue, Jan 21, 2020 at 1:52 PM vino yang  >
> > > > wrote:
> > > > > >
> > > > > > > Hi folks,
> > > > > > >
> > > > > > > Currently, the code quality of some Hudi module is not very
> well.
> > > As
> > > > > many
> > > > > > > developers have seen, the Intellij IDEA has shown many
> > intellisense
> > > > > about
> > > > > > > cleanup and improvement. The community does not object to doing
> > the
> > > > > > cleanup
> > > > > > > and improvement work and the work has been started via some
> > direct
> > > > > > "minor"
> > > > > > > PRs by some volunteers. The current way is unorganized and hard
> > to
> > > > > 

[VOTE] Release 0.5.1-incubating, release candidate #1

2020-01-22 Thread leesf
Hi everyone,

We have prepared the second apache release candidate for Apache Hudi
(incubating). The version is : 0.5.1-incubating-rc1. Please review and vote
on the release candidate #1 for the version 0.5.1, as follows:

[ ] +1, Approve the release

[ ] -1, Do not approve the release (please provide specific comments)



The complete staging area is available for your review, which includes:

* JIRA release notes [1],

* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint 623E08E06DB376684FB9599A3F5953147903948A [3],

* all artifacts to be deployed to the Maven Central Repository [4],

* source code tag "release-0.5.1-incubating-rc1" [5],



The vote will be open for at least 72 hours.
Please cast your votes before *Jan. 27th 2020, 16:00 UTC*.

It is adopted by majority approval, with at least 3 PMC affirmative votes.


Thanks,
Leesf


[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12346183
[2]
https://dist.apache.org/repos/dist/dev/incubator/hudi/hudi-0.5.1-incubating-rc1/
[3] https://dist.apache.org/repos/dist/dev/incubator/hudi//KEYS
[4] https://repository.apache.org/content/repositories/orgapachehudi-1014
[5]
https://github.com/apache/incubator-hudi/tree/release-0.5.1-incubating-rc1


Re: [SUPPORT] hudi-integ-test fails after adding licenses

2020-01-22 Thread leesf
Thanks sivablan for your confirmation. :)

Sivabalan  于2020年1月23日周四 上午1:22写道:

> cool. thanks man for fixing it. Presto command that takes in a file for
> list of queries doesn't recognize # as comments.
> presto --server presto-coordinator-1:8090 --catalog hive --schema default
> -f /usr/hive/data/input/presto-batch1.commands
>
>
> On Wed, Jan 22, 2020 at 12:17 PM leesf  wrote:
>
> > Update: seems like would be fixed via
> > https://github.com/apache/incubator-hudi/pull/1273, and verified in my
> own
> > travis  https://api.travis-ci.org/v3/job/640530602/log.txt
> >
> > leesf  于2020年1月23日周四 上午12:35写道:
> >
> > > Hi all,
> > >
> > > After merging the PR(
> https://github.com/apache/incubator-hudi/pull/1271)
> > > which adds the missing apache license in
> > > docker/demo/presto-batch1.commands
> > > <
> >
> https://github.com/apache/incubator-hudi/pull/1271/files#diff-992e43ced3eb0102cf50773f98ed0554
> > >
> > > ,docker/demo/presto-batch2-after-compaction.commands
> > > <
> >
> https://github.com/apache/incubator-hudi/pull/1271/files#diff-4a23593a9c83f2ac3d423e756c84d1ef
> > >
> > > , docker/demo/presto-table-check.commands
> > > <
> >
> https://github.com/apache/incubator-hudi/pull/1271/files#diff-d7c0616cefe7404855e6fed66d81e7ea
> > >,
> > > then the travis fails with the exception
> >
> ITTestHoodieDemo.testDemo:81->testPrestoAfterFirstBatch:192->ITTestBase.executePrestoCommandFile:223->ITTestBase.executeCommandStringInDocker:190->ITTestBase.executeCommandInDocker:168
> > > Command ([presto, --server, presto-coordinator-1:8090, --catalog, hive,
> > > --schema, default, -f,
> /usr/hive/data/input/presto-table-check.commands])
> > > expected to succeed. Exit (1) expected:<0> but was:<1>
> > >
> > > and the detail log link
> > https://api.travis-ci.org/v3/job/640489853/log.txt
> > >
> > > Any ideas? thanks.
> > >
> > > Best,
> > > Leesf
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


Re: [SUPPORT] hudi-integ-test fails after adding licenses

2020-01-22 Thread Sivabalan
cool. thanks man for fixing it. Presto command that takes in a file for
list of queries doesn't recognize # as comments.
presto --server presto-coordinator-1:8090 --catalog hive --schema default
-f /usr/hive/data/input/presto-batch1.commands


On Wed, Jan 22, 2020 at 12:17 PM leesf  wrote:

> Update: seems like would be fixed via
> https://github.com/apache/incubator-hudi/pull/1273, and verified in my own
> travis  https://api.travis-ci.org/v3/job/640530602/log.txt
>
> leesf  于2020年1月23日周四 上午12:35写道:
>
> > Hi all,
> >
> > After merging the PR(https://github.com/apache/incubator-hudi/pull/1271)
> > which adds the missing apache license in
> > docker/demo/presto-batch1.commands
> > <
> https://github.com/apache/incubator-hudi/pull/1271/files#diff-992e43ced3eb0102cf50773f98ed0554
> >
> > ,docker/demo/presto-batch2-after-compaction.commands
> > <
> https://github.com/apache/incubator-hudi/pull/1271/files#diff-4a23593a9c83f2ac3d423e756c84d1ef
> >
> > , docker/demo/presto-table-check.commands
> > <
> https://github.com/apache/incubator-hudi/pull/1271/files#diff-d7c0616cefe7404855e6fed66d81e7ea
> >,
> > then the travis fails with the exception
> ITTestHoodieDemo.testDemo:81->testPrestoAfterFirstBatch:192->ITTestBase.executePrestoCommandFile:223->ITTestBase.executeCommandStringInDocker:190->ITTestBase.executeCommandInDocker:168
> > Command ([presto, --server, presto-coordinator-1:8090, --catalog, hive,
> > --schema, default, -f, /usr/hive/data/input/presto-table-check.commands])
> > expected to succeed. Exit (1) expected:<0> but was:<1>
> >
> > and the detail log link
> https://api.travis-ci.org/v3/job/640489853/log.txt
> >
> > Any ideas? thanks.
> >
> > Best,
> > Leesf
> >
>


-- 
Regards,
-Sivabalan


Re: [SUPPORT] hudi-integ-test fails after adding licenses

2020-01-22 Thread leesf
Update: seems like would be fixed via
https://github.com/apache/incubator-hudi/pull/1273, and verified in my own
travis  https://api.travis-ci.org/v3/job/640530602/log.txt

leesf  于2020年1月23日周四 上午12:35写道:

> Hi all,
>
> After merging the PR(https://github.com/apache/incubator-hudi/pull/1271)
> which adds the missing apache license in
> docker/demo/presto-batch1.commands
> 
> ,docker/demo/presto-batch2-after-compaction.commands
> 
> , docker/demo/presto-table-check.commands
> ,
> then the travis fails with the exception 
> ITTestHoodieDemo.testDemo:81->testPrestoAfterFirstBatch:192->ITTestBase.executePrestoCommandFile:223->ITTestBase.executeCommandStringInDocker:190->ITTestBase.executeCommandInDocker:168
> Command ([presto, --server, presto-coordinator-1:8090, --catalog, hive,
> --schema, default, -f, /usr/hive/data/input/presto-table-check.commands])
> expected to succeed. Exit (1) expected:<0> but was:<1>
>
> and the detail log link https://api.travis-ci.org/v3/job/640489853/log.txt
>
> Any ideas? thanks.
>
> Best,
> Leesf
>