from:"vino yang"

Re: [VOTE] hudi-rs 0.1.0, release candidate #2

2024-07-14 Thread Vino Yang

+1 (binding)

- checked signature and checksum successfully;
- pip installed pkg successfully;
- ran example in my local

Best,
VIno

leesf  于2024年7月14日周日 11:14写道：

> +1
>
> - ran python quick start.
>
> sagar sumit  于2024年7月14日周日 11:07写道：
>
> > +1 (binding)
> >
> > - verified source release
> > - verified python quickstart in readme
> >
> > Regards,
> > Sagar
> >
> > On Sat, Jul 13, 2024 at 4:23 AM Shiyan Xu 
> > wrote:
> >
> > > +1 (binding)
> > >
> > > On Fri, Jul 12, 2024 at 5:29 PM Bhavani Sudha  >
> > > wrote:
> > >
> > > > +1 (binding)
> > > > - tested read me page examples
> > > > - ran verification script from checksum and signatures successfully.
> > > >
> > > > Thanks,
> > > > Sudha
> > > >
> > > > On Thu, Jul 11, 2024 at 8:19 PM Shiyan Xu <
> xu.shiyan.raym...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > Please review and vote on hudi-rs 0.1.0-rc.2 as follows:
> > > > >
> > > > > [ ] +1, Approve the release
> > > > > [ ] -1, Do not approve the release (please provide specific
> comments)
> > > > >
> > > > > The complete staging area is available for you to review:
> > > > >
> > > > > * Release tracking issue is up-to-date [1]
> > > > > * Categorized changelog is available here [2]
> > > > > * Source release has been deployed to dist.apache.org [3]
> > > > > * Source release can be verified using this script [4]
> > > > > * Source code commit is tagged as "release-0.1.0-rc.2" [5]
> > > > > * Source code commit CI has passed [6]
> > > > > * Python artifacts have been published to pypi.org [7]
> > > > > * Rust artifacts have been published to crates.io [8]
> > > > >
> > > > > The vote will be open for at least 72 hours. It is adopted by
> > majority
> > > > > approval, with at least 3 PMC affirmative votes.
> > > > >
> > > > > Thanks,
> > > > > Release Manager
> > > > >
> > > > > [1] https://github.com/apache/hudi-rs/issues/62
> > > > > [2]
> > > https://github.com/apache/hudi-rs/issues/62#issuecomment-2224322166
> > > > > [3]
> https://dist.apache.org/repos/dist/dev/hudi/hudi-rs-0.1.0-rc.2/
> > > > > [4]
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/hudi-rs/blob/7b2d199c180bf36e2fac5e03559fffbfe00bf5fe/release/verify_src_release.sh
> > > > > [5]
> > https://github.com/apache/hudi-rs/releases/tag/release-0.1.0-rc.2
> > > > > [6] https://github.com/apache/hudi-rs/actions/runs/9901188924
> > > > > [7] https://pypi.org/project/hudi/0.1.0rc2/
> > > > > [8] https://crates.io/crates/hudi/0.1.0-rc.2
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Shiyan
> > >
> >
>

Re: [ANNOUNCE] New Apache Hudi Committer - Zhaojing Yu

2022-03-31 Thread Vino Yang

Congrats!

Best,
Vino

Gary Li  于2022年3月25日周五 19:11写道：
>
> Congrats!
>
> Best,
> Gary
>
> On Fri, Mar 25, 2022 at 4:07 PM Shiyan Xu 
> wrote:
>
> > Congrats!
> >
> > On Fri, Mar 25, 2022 at 1:40 PM Danny Chan  wrote:
> >
> > > Hi everyone,
> > >
> > > On behalf of the PMC, I'm very happy to announce Zhaojing Yu as a new
> > > Hudi committer.
> > >
> > > Zhaojing is very active in Flink Hudi contributions, many cool
> > > features such as the flink streaming bootstrap, compaction service and
> > > all kinds of writing modes are contributed by him. He also fixed many
> > > critical bugs from the Flink side.
> > >
> > > Besides that, Zhaojing is also active in use case publicity of Hudi in
> > > China, he is very active in answering user questions in our Dingtalk
> > > group. Now he is working in Bytedance for pushing forward the Volcanic
> > > cloud service Hudi products !
> > >
> > > Please join me in congratulating Zhaojing for becoming a Hudi committer!
> > >
> > > Cheers,
> > > Danny
> > >
> >
> >
> > --
> > --
> > Best,
> > Shiyan
> >

Re: [DISCUSS] Change data feed for spark sql

2022-02-12 Thread vino yang

+1 for this feature, looking forward to share more details or design doc.

Best,
Vino

Xianghu Wang  于2022年2月12日周六 17:06写道：

> this is definitely a great feature
>  +1
>
> On 2022/02/12 02:32:32 Forward Xu wrote:
> > Hi All,
> >
> > I want to support change data feed for to spark sql, This feature can be
> > achieved in two ways.
> >
> > 1. Call Procedure Command
> > sql syntax
> > CALL system.table_changes('tableName',  start_timestamp, end_timestamp)
> > example:
> > CALL system.table_changes('tableName', TIMESTAMP '2021-01-23 04:30:45',
> > TIMESTAMP '2021-02-23 6:00:00')
> >
> > 2. Support querying MOR(CDC) table as of a savepoint
> > SELECT * FROM A.B TIMESTAMP AS OF 1643119574;
> > SELECT * FROM A.B TIMESTAMP AS OF '2019-01-29 00:37:58' ;
> >
> > SELECT * FROM A.B TIMESTAMP AS OF '2019-01-29 00:37:58'  AND '2021-02-23
> > 6:00:00' ;
> > SELECT * FROM A.B VERSION AS OF 'Snapshot123456789';
> >
> > Any feedback is welcome!
> >
> > Thank you.
> >
> > Regards,
> > Forward Xu
> >
> > Related Links:
> > [1] Call Procedure Command <
> https://issues.apache.org/jira/browse/HUDI-3161>
> > [2] Support querying a table as of a savepoint
> > 
> > [3] Change data feed
> > <
> https://docs.databricks.com/delta/delta-change-data-feed.html#language-sql
> >
> >
>

Re: [VOTE] Release 0.10.1, release candidate #2

2022-01-24 Thread vino yang

+1 binding

- ran `mvn pacakge -DskipTests` [OK]
- verified checksum and signature [OK]
- ran some flink related tests [OK]

Best,
Vino

Mehrotra, Udit  于2022年1月25日周二 03:44写道：

> +1 binding
>
> - Compilation for Spark 2 and Spark 3 [OK]
> - RC validation [OK]
> - QuickStart [OK]
>
> Thanks,
> Udit
>
> On 1/25/22, 12:11 AM, "Balaji Varadarajan" 
> wrote:
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
>  +1 binding. RC passed.
> Balaji.V
>
> On Monday, January 24, 2022, 10:28:58 AM PST, Bhavani Sudha <
> bhavanisud...@gmail.com> wrote:
>
>  +1 binding
>
> Ran RC check, quickstart and some IDE tests.
>
> Thanks,
> Sudha
>
> On Mon, Jan 24, 2022 at 9:23 AM sagar sumit 
> wrote:
>
> > +1
> >
> > - Builds for Spark2/3 [OK]
> > - Spark quickstart [OK]
> > - Docker Demo (Hive/Presto querying) [OK]
> > - Long-running deltastreamer continuous mode with async
> > compaction/clustering [OK]
> >
> > Regards,
> > Sagar
> >
> > On Mon, Jan 24, 2022 at 10:23 PM Sivabalan 
> wrote:
> >
> >> Hey folks,
> >>  Can we get some attention on this. I expect participation from
> PMCs
> >> and committers atleast. Would appreciate, if you folks can spare
> some time
> >> on RC testing and voting.
> >>
> >>
> >> On Mon, 24 Jan 2022 at 07:54, Pratyaksh Sharma <
> pratyaks...@gmail.com>
> >> wrote:
> >>
> >> > +1
> >> >
> >> > - Compilation OK
> >> > - Validation script OK
> >> >
> >> > On Sun, Jan 23, 2022 at 8:09 PM Nishith 
> wrote:
> >> >
> >> > > +1 binding
> >> > >
> >> > > -Nishith
> >> > >
> >> > > > On Jan 22, 2022, at 7:49 PM, Vinoth Chandar <
> vin...@apache.org>
> >> wrote:
> >> > > >
> >> > > > +1 (binding)
> >> > > >
> >> > > > Ran my rc checks on updated link and changing my vote to a +1
> >> > > >
> >> > > >> On Sat, Jan 22, 2022 at 4:10 AM Sivabalan <
> n.siv...@gmail.com>
> >> wrote:
> >> > > >>
> >> > > >> my bad, the link([2]) was wrong. It is
> >> > > >> https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.1-rc2/
> .
> >> > > >> Can you take a look please?
> >> > > >>
> >> > > >>> On Sat, 22 Jan 2022 at 00:08, Vinoth Chandar <
> vin...@apache.org>
> >> > > wrote:
> >> > > >>>
> >> > > >>> -1
> >> > > >>>
> >> > > >>> The artifact version is wrong! It should be 0.10.*1*
> >> > > >>>
> >> > > >>>
> >> > > >>>  - hudi-0.10.0-rc2.src.tgz
> >> > > >>>  <
> >> > > >>>
> >> > > >>
> >> > >
> >> >
> >>
> https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc2/hudi-0.10.0-rc2.src.tgz
> >> > > 
> >> > > >>>  - hudi-0.10.0-rc2.src.tgz.asc
> >> > > >>>  <
> >> > > >>>
> >> > > >>
> >> > >
> >> >
> >>
> https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc2/hudi-0.10.0-rc2.src.tgz.asc
> >> > > 
> >> > > >>>  - hudi-0.10.0-rc2.src.tgz.sha512
> >> > > >>>  <
> >> > > >>>
> >> > > >>
> >> > >
> >> >
> >>
> https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc2/hudi-0.10.0-rc2.src.tgz.sha512
> >> > > 
> >> > > >>>
> >> > > >>> grep version hudi-0.10.0-rc2/pom.xml | grep rc2
> >> > > >>>  0.10.0-rc2
> >> > > >>>
> >> > > >>>
> >> > > >>> Why are all the arc
> >> > > >>>
> >> > >  On Thu, Jan 20, 2022 at 3:53 AM Sivabalan <
> n.siv...@gmail.com>
> >> > wrote:
> >> > > >>>
> >> > >  Hi everyone,
> >> > > 
> >> > >  Please review and vote on the release candidate #2 for the
> >> version
> >> > > >>> 0.10.1,
> >> > >  as follows:
> >> > > 
> >> > >  [ ] +1, Approve the release
> >> > > 
> >> > >  [ ] -1, Do not approve the release (please provide specific
> >> > comments)
> >> > > 
> >> > > 
> >> > >  The complete staging area is available for your review,
> which
> >> > > includes:
> >> > > 
> >> > >  * JIRA release notes [1],
> >> > > 
> >> > >  * the official Apache source release and binary convenience
> >> releases
> >> > > to
> >> > > >>> be
> >> > >  deployed to dist.apache.org [2], which are signed with
> the key
> >> with
> >> > >  fingerprint ACD52A06633DB3B2C7D0EA5642CA2D3ED5895122 [3],
> >> > > 
> >> > >  * all artifacts to be deployed to the Maven Central
> Repository
> >> [4],
> >> > > 
> >> > >  * source code tag "release-0.10.1-rc2" [5],
> >> > > 
> >> > > 
> >> > >  The vote will be open for at least 72 hours. It is adopted
> by
> >> > majority
> >> > >  approval, with at least 3 PMC affirmative votes.
> >> > > 
> >>

Re: Regular minor/patch releases

2021-12-15 Thread vino yang

+1

Agree that minor release mostly for bug fix purpose.

Best,
Vino

Danny Chan  于2021年12月15日周三 10:35写道：

> I guess we must do that for current rapid development and iteration. As for
> the release 0.10.0, after the announcement of only a few days we have
> received a bunch of bugs reported by the github issues: such as
>
> - the empty meta file: https://github.com/apache/hudi/issues/4249
> - and the timeline based marker files:
> https://github.com/apache/hudi/issues/4230
>
> With the rush in features without enough tests, I'm afraid the major
> release version is never ready for production, unless there is production
> validation like in Uber internal.
>
> And for minor releases, there should only include the bug fixes, no
> breaking change, no feature, it should not be a hard work i think.
>
> Best,
> Danny
>
> Sivabalan 于2021年12月14日 周二上午4:06写道：
>
> > +1 in general. but yeah, not sure if we have resources to do this for
> every
> > major release.
> >
> > On Mon, Dec 13, 2021 at 10:01 AM Vinoth Chandar 
> wrote:
> >
> > > Hi all,
> > >
> > > In the past we had plans for minor releases [1], but invariably we end
> up
> > > doing major ones, which also deliver the bug fixes.
> > >
> > > The reason was the cost involved in doing a release. We have made some
> > good
> > > progress towards regression/integration test, which prompts me to
> revive
> > > this.
> > >
> > > What does everyone think about a monthly bugfix release on the last
> > > major/minor version. (not on every major release, we still don't have
> > > enough contributors to pull that off IMO). So we would be trying to do
> a
> > > 0.10.1 early jan for e.g, in this model?
> > >
> > > [1]
> https://cwiki.apache.org/confluence/display/HUDI/Release+Management
> > >
> > > Thanks
> > > Vinoth
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>

Re: [VOTE] Release 0.10.0, release candidate #3

2021-12-06 Thread vino yang

+1 (binding)

- build successfully
- ran spark quickstart
- verified checksum

Best,
Vino

Y Ethan Guo  于2021年12月6日周一 14:25写道：

> +1 (non-binding)
>
> - [OK] Ran release validation script [1]
> - [OK] Built the source (Spark 2/3)
> - [OK] Ran Spark Guide in Quick Start using Spark 3.1.2
>
> [1] https://gist.github.com/yihua/39ef5b07a08ed5780fa9c43819b326cb
>
> Best,
> - Ethan
>
> On Sat, Dec 4, 2021 at 1:27 PM Bhavani Sudha 
> wrote:
>
> > +1 (binding)
> >
> > - [OK] checksums and signatures
> > - [OK] ran validation script
> > - [OK] built successfully
> > - [OK] ran spark quickstart
> > - [OK] Ran few tests in IDE
> >
> >
> >
> > bsaktheeswaran@Bhavanis-MacBook-Pro scripts %
> > ./release/validate_staged_release.sh --release=0.10.0 --rc_num=3
> > /tmp/validation_scratch_dir_001 ~/Sudha/hudi/scripts
> > Downloading from svn co https://dist.apache.org/repos/dist//dev/hudi
> > Validating hudi-0.10.0-rc3 with release type "dev"
> > Checking Checksum of Source Release
> > Checksum Check of Source Release - [OK]
> >
> >   % Total% Received % Xferd  Average Speed   TimeTime Time
> >  Current
> >  Dload  Upload   Total   SpentLeft
> >  Speed
> > 100 45904  100 459040 0  85323  0 --:--:-- --:--:-- --:--:--
> > 85165
> > Checking Signature
> > Signature Check - [OK]
> >
> > Checking for binary files in source release
> > No Binary Files in Source Release? - [OK]
> >
> > Checking for DISCLAIMER
> > DISCLAIMER file exists ? [OK]
> >
> > Checking for LICENSE and NOTICE
> > License file exists ? [OK]
> > Notice file exists ? [OK]
> >
> > Performing custom Licensing Check
> > Licensing Check Passed [OK]
> >
> > Running RAT Check
> > RAT Check Passed [OK]
> >
> > Thanks,
> > Sudha
> >
> > On Sat, Dec 4, 2021 at 6:59 AM Vinoth Chandar  wrote:
> >
> > > +1 (binding)
> > >
> > > Ran the RC checks in [1] . This is a huge release, thanks everyone for
> > all
> > > the hard work!
> > >
> > > [1]
> > https://gist.github.com/vinothchandar/68b34f3051e41752ebffd6a3edeb042b
> > >
> > > On Sat, Dec 4, 2021 at 5:20 AM Danny Chan 
> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Please review and vote on the release candidate #3 for the version
> > > 0.10.0,
> > > > as follows:
> > > >
> > > > [ ] +1, Approve the release
> > > >
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > >
> > > > * JIRA release notes [1],
> > > >
> > > > * the official Apache source release and binary convenience releases
> to
> > > be
> > > > deployed to dist.apache.org [2], which are signed with the key with
> > > > fingerprint 9A48922F682AB05D1AE4A3E7C2931E4BDB03D5AE [3],
> > > >
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > >
> > > > * source code tag "release-0.10.0-rc3" [5],
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > > Thanks,
> > > >
> > > > Release Manager
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12350285
> > > >
> > > > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc3/
> > > >
> > > > [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
> > > >
> > > > [4]
> > > >
> > > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapachehudi-1048/org/apache/hudi/
> > > >
> > > > [5] https://github.com/apache/hudi/tree/release-0.10.0-rc3
> > > >
> > >
> >
>

Re: please give me the contributor permission

2021-12-02 Thread vino yang

Hi,

I have given you Jira contributor permission.
Welcome to Hudi community!

Best,
Vino

hl z  于2021年12月2日周四 上午12:40写道：

> Hi,
>
> I want to contribute to Apache Hudi. Would you please give me the
> contributor permission? My Confluence ID is zhouhl.
>

Re: apply for permission

2021-09-29 Thread vino yang

Hi you,zhou:

Done! And welcome to Hudi community!

best,
Vino

121 <1058249...@qq.com.invalid> 于2021年9月29日周三 下午5:00写道：

> Hi, I would like to start contributing to Hudi. Can anyone grant me proper
> privileges in Confluence & Jira? username:
> yao.z...@yuanxi.onaliyun.com fullname:yao.zhou Thanks in advance!!

Re: Monthly or Bi-Monthly Dev meeting?

2021-09-23 Thread vino yang

+1 for monthly

Best,
Vino

Pratyaksh Sharma  于2021年9月23日周四 下午9:36写道：

> Monthly should be good. Been a long time since we connected in these
> meetings. :)
>
> On Thu, Sep 23, 2021 at 7:02 PM Vinoth Chandar <
> mail.vinoth.chan...@gmail.com> wrote:
>
> > 1 hour monthly is what I was proposing to be specific.
> >
> > On Thu, Sep 23, 2021 at 6:30 AM Gary Li  wrote:
> >
> > > +1 for monthly.
> > >
> > > On Thu, Sep 23, 2021 at 8:28 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > Once upon a time, we used to have a weekly community sync. Wondering
> if
> > > > there is interest in having a monthly or bi-monthly dev meeting?
> > > >
> > > > Agenda could be
> > > > - Update/Summary of all dev work tracks
> > > > - Show and tell, where people can present their ongoing work
> > > > - Open floor discussions, bring up new issues.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
>

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-15 Thread vino yang

Hi Ethan,

Big +1 for the proposal.

Actually, we have discussed this topic before.[1]

Will review your refactor PR later.

Best,
Vino

[1]:
https://lists.apache.org/thread.html/r71d96d285c735b1611920fb3e7224c9ce6fd53d09bf0e8f144f4fcbd%40%3Cdev.hudi.apache.org%3E


Y Ethan Guo  于2021年9月15日周三 下午3:34写道：

> Hi all,
>
> hudi-client module has core Hudi abstractions and client logic for
> different engines like Spark, Flink, and Java.  While previous effort
> (HUDI-538 [1]) has decoupled the integration with Spark, there is quite
> some code duplication across different engines for almost the same logic
> due to the current interface design.  Some part also has divergence among
> engines, making debugging and support difficult.
>
> I propose to further refactor the hudi-client module with the goal of
> improving the code reuse across multiple engines and reducing the
> divergence of the logic among them, so that the core Hudi action execution
> logic should be shared across engines, except for engine specific
> transformations.  Such a pattern also allows easy support of core Hudi
> functionality for all engines in the future.  Specifically,
>
> (1) Abstracts the transformation boilerplates inside the
> HoodieEngineContext and implements the engine-specific data transformation
> logic in the subclasses.  Type cast can be done inside the engine context.
> (2) Creates new HoodieData abstraction for passing input and output along
> the flow of execution, and uses it in different Hudi abstractions, e.g.,
> HoodieTable, HoodieIOHandle, BaseActionExecutor, instead of enforcing type
> parameters encountering RDD and List which are
> one source of duplication.
> (3) Extracts common execution logic to hudi-client-common module from
> multiple engines.
>
> As a first step and exploration for item (1) and (3) above, I've tried to
> refactor the rollback actions and the PR is here [HUDI-2433][2].  In this
> PR, I completely remove all engine-specific rollback packages and only keep
> one rollback package in hudi-client-common, adding ~350 LoC while deleting
> 1.3K LoC.  My next step is to refactor the commit action which encompasses
> item (2) above.
>
> What do you folks think and any other suggestions?
>
> [1] [HUDI-538] [UMBRELLA] Restructuring hudi client module for multi engine
> support
> https://issues.apache.org/jira/browse/HUDI-538
> [2] PR: [HUDI-2433] Refactor rollback actions in hudi-client module
> https://github.com/apache/hudi/pull/3664/files
>
> Best,
> - Ethan
>

Re: [ANNOUNCEMENT] CI changes

2021-09-06 Thread vino yang

awesome! Great job!

Thanks for driving and landing this big infra improvement!

Best,
Vino

Raymond Xu  于2021年9月4日周六 上午9:42写道：

> Hi all,
>
> As you may have noticed, we have been running Azure Pipelines for the tests
> for some time and have recently retired Travis CI in this PR
> .
>
> Background
>
> It was a pain for the CI process in the past with Travis, which from time
> to time queued up CI jobs forever. This severely affects the developer
> experience for making contributions, and also the release process.
>
> The New Setup
>
> Thanks to the Flink community, who pioneered the CI setup, and MS Azure,
> who provided the free resources, we are able to mirror the repo and PRs to
> a separate GitHub organization  and run
> the tests in Azure Pipelines. Hudi's ci-bot
>  (forked from Flink's ci-bot
> ) runs on a GCP server and
> periodically
> scans recently changed PRs for CI submission. CI results are commented back
> to the PR by hudi-bot . Full details about
> the
> setup are documented in this
> <
> https://cwiki.apache.org/confluence/display/HUDI/Guide+on+CI+infrastructure
> >
> page
> <
> https://cwiki.apache.org/confluence/display/HUDI/Guide+on+CI+infrastructure
> >
> .
>
> Azure Pipelines provides 10 free managed parallel jobs. CI tests are split
> into 5 jobs. We have dedicated resources to test 2 PRs in parallel.
>
>- master builds:
>
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build?definitionId=3
>- branch builds:
>
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build?definitionId=5
>
> Note: PRs against asf-site (website updates) will be ignored by this setup.
>
> Additionally, we make use of GitHub Actions to build against different
> Spark and Scala versions. GitHub Actions jobs also provide fast feedback
> for compliance like checkstyle and apache-rat.
>
> For PR Owners and Reviewers
>
> With these changes, PR owners and reviewers should pay attention to the
> following:
>
>- CI results are indicated in hudi-bot's comment
>- A new commit in the same PR will trigger a new build and cancel any
>existing build
>- Comment `@hudi-bot run azure` to manually trigger a new build
>- GitHub Actions jobs will show as checks in the PR
>- Minimum conditions to merge:
>   - Azure CI report shows success, and
>   - GitHub Actions jobs passed
>- For website update PRs (for asf-site branch), owners post screenshots
>to show the changes in lieu of CI tests.
>
>
> Hope this contributes towards a more seamless developer experience. Please
> reach out to the community for CI issues or further questions.
>
>
> Best,
> Raymond
>

Re: Apache Hudi release voting process

2021-08-23 Thread vino yang

Hi,

+1 for:

>> We have never encountered an issue like this before. So it's an
opportunity
to draft an agreed upon set of criteria
for what qualifies for a valid -1, this also needs us as a community in
investing in nightly integration tests, and more volunteers
to ensure our tests are not flaky and exercise all complex scenarios.
Without this, I think we have to enforce agreed upon timelines
very stringently.

---

In my opinion, we must make a reasonable trade-off between ensuring quality
and following the release plan of the project, otherwise, the release of
the project
may be delayed endlessly. It is necessary for us to clearly define what is
the
"Release blocker" issues, these issues are the reasons for our new RC.

Considering that we are going to release a major version, the RM needs to
ensure
sufficient testing and preparation time, and the community may therefore
delay the development or merge plan.
So, we should use the "-1" right very carefully.

"-1" is a right owned by community members, but we need to clearly define
what circumstances "-1" is valid.

Best,
Vino

Vinoth Chandar  于2021年8月23日周一 上午9:23写道：

> Hi all,
>
> First of all, it's great to see us debating around ensuring high quality,
> timely releases. Shows we have developers
> who care and are passionate around the project!
>
> Thanks for establishing the timelines, Siva. I would like to add the
> following data points that all 4 PRs in question
> (raised on the voting thread) have been submitted well after the Aug 13th
> cutoff date we had originally agreed upon.
> If there was communication to the RM or PMC before the cutoff around these
> JIRAs, please chime in. In the absence of
> this, it seems like this is an issue of some last minute feature requests
> and 1 hot fix around SparkSQL. I am pretty
> concerned about setting this precedent here, that RCs can be voted down if
> certain bug fixes did not make it in time.
>
> We have never encountered an issue like this before. So it's an opportunity
> to draft an agreed upon set of criteria
> for what qualifies for a valid -1, this also needs us as a community in
> investing in nightly integration tests, and more volunteers
> to ensure our tests are not flaky and exercise all complex scenarios.
> Without this, I think we have to enforce agreed upon timelines
> very stringently.
>
> I added my +1 to 0.9.0, since I perceive all 4 PRs issues to be
> non-blocking (even though the sparkSQL bug is a serious limitation).
> But, I'd still have us honor the timelines we agreed upon, rather than cut
> another RC3 for these.
>
> Apache Voting guidelines explicitly state that "Releases may not be
> vetoed. Generally the community will cancel the release
> vote if anyone identifies serious problems, but in most cases the ultimate
> decision lies with the individual serving as release manager"
>
> Love to hear more from the other PMC members and the RM. Looks like the RM
> has the clear final decision here, unless the majority
> binding votes for the RC cannot be obtained from the PMC.
>
> Thanks
> Vinoth
>
> On Sun, Aug 22, 2021 at 1:25 PM Sivabalan  wrote:
>
> > Hi folks,
> > Wanted to start a thread to discuss our guidelines on the release
> > process with Apache Hudi. You can find our existing release process here
> > <
> >
> https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+-+Release+Guide
> > >.
> > On
> > a high level, our release process is as follows.
> >
> > 1. Call for a release and rough timeline.
> > 2. Nominate a release manager.
> > 3. RM collects all release blockers and starts an email thread. This
> > email is to call for any other jiras/release blockers to be included as
> > part of the release. Also, asks respective owners of release blockers to
> > commit to a time for closing it out.
> > 4. After the deadline, if not all release blockers are landed: a.
> Moves
> > any pending release blockers to the next release that seems not very
> > critical. b. If there are some essential release blockers, ask the
> > respective owners to get it to closure and extends deadlines to get them
> > in.
> > 5. Once all release blockers are landed, works on RC1. Verifies the
> > candidate and puts it out for voting.
> > 6. If approved by majority, proceed with actual release.
> > 7. If not approved by majority, waits for the fix to get merged and
> > works on RC2.
> >
> > Coming to our 0.9.0 release, here is how the timeline looks like.
> >
> > Jul 14: Decided on RM
> > Aug  3: RM sent an email with all release blockers. He called out for
> > missed our release blockers and asked for respective owners to be mindful
> > of the deadline for release.
> > Aug  5: Decided the deadline for release blockers to be landed as 13th
> Aug.
> > Aug 14: All release blockers were landed. Those that could not be landed
> > were rolled over to 0.10.0
> > Aug 15: RC1 was announced.
> > Aug 17: voted -1 by PMC due to config name issue. Existing jobs from
> older
> > hudi

Re: [DISCUSS] Enable Github Discussions

2021-08-11 Thread vino yang

+1

Best,
Vino

Pratyaksh Sharma  于2021年8月12日周四 上午2:16写道：

> +1
>
> I have never used it, but we can try this out. :)
>
> On Thu, Jul 15, 2021 at 9:43 AM Vinoth Chandar  wrote:
>
> > Hi all,
> >
> > I would like to propose that we explore the use of github discussions.
> Few
> > other apache projects have also been trying this out.
> >
> > Please chime in
> >
> > Thanks
> > Vinoth
> >
>

Re: [DISCUSS] Hudi is the data lake platform

2021-07-30 Thread vino yang

+1

Pratyaksh Sharma  于2021年7月30日周五 上午1:47写道：

> Guess we should rebrand Hudi on README.md file as well -
> https://github.com/apache/hudi#readme?
>
> This page still mentions the following -
>
> "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
> Incrementals. Hudi manages the storage of large analytical datasets on
> DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)."
>
> On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar  wrote:
>
>> Thanks Vino! Got a bunch of emoticons on the PR as well.
>>
>> Will land this monday, giving it more time over the weekend as well.
>>
>>
>> On Wed, Jul 21, 2021 at 7:36 PM vino yang  wrote:
>>
>> > Thanks vc
>> >
>> > Very good blog, in-depth and forward-looking. Learned!
>> >
>> > Best,
>> > Vino
>> >
>> > Vinoth Chandar  于2021年7月22日周四 上午3:58写道：
>> >
>> > > Expanding to users@ as well.
>> > >
>> > > Hi all,
>> > >
>> > > Since this discussion, I started to pen down a coherent strategy and
>> > convey
>> > > these ideas via a blog post.
>> > > I have also done my own research, talked to (ex)colleagues I respect
>> to
>> > get
>> > > their take and refine it.
>> > >
>> > > Here's a blog that hopefully explains this vision.
>> > >
>> > > https://github.com/apache/hudi/pull/3322
>> > >
>> > > Look forward to your feedback on the PR. We are hoping to land this
>> early
>> > > next week, if everyone is aligned.
>> > >
>> > > Thanks
>> > > Vinoth
>> > >
>> > > On Wed, Apr 21, 2021 at 9:01 PM wei li  wrote:
>> > >
>> > > > +1 , Cannot agree more.
>> > > >  *aux metadata* and metatable, can make hudi have large preformance
>> > > > optimization on query end.
>> > > > Can continuous develop.
>> > > > cache service may the necessary component in cloud native
>> environment.
>> > > >
>> > > > On 2021/04/13 05:29:55, Vinoth Chandar  wrote:
>> > > > > Hello all,
>> > > > >
>> > > > > Reading one more article today, positioning Hudi, as just a table
>> > > format,
>> > > > > made me wonder, if we have done enough justice in explaining what
>> we
>> > > have
>> > > > > built together here.
>> > > > > I tend to think of Hudi as the data lake platform, which has the
>> > > > following
>> > > > > components, of which - one if a table format, one is a
>> transactional
>> > > > > storage layer.
>> > > > > But the whole stack we have is definitely worth more than the sum
>> of
>> > > all
>> > > > > the parts IMO (speaking from my own experience from the past 10+
>> > years
>> > > of
>> > > > > open source software dev).
>> > > > >
>> > > > > Here's what we have built so far.
>> > > > >
>> > > > > a) *table format* : something that stores table schema, a metadata
>> > > table
>> > > > > that stores file listing today, and being extended to store column
>> > > ranges
>> > > > > and more in the future (RFC-27)
>> > > > > b) *aux metadata* : bloom filters, external record level indexes
>> > today,
>> > > > > bitmaps/interval trees and other advanced on-disk data structures
>> > > > tomorrow
>> > > > > c) *concurrency control* : we always supported MVCC based log
>> based
>> > > > > concurrency (serialize writes into a time ordered log), and we now
>> > also
>> > > > > have OCC for batch merge workloads with 0.8.0. We will have
>> > multi-table
>> > > > and
>> > > > > fully non-blocking writers soon (see future work section of
>> RFC-22)
>> > > > > d) *updates/deletes* : this is the bread-and-butter use-case for
>> > Hudi,
>> > > > but
>> > > > > we support primary/unique key constraints and we could add foreign
>> > keys
>> > > > as
>> > > > > an extension, once our transactions can span tables.
>> > > > > e) *table services*: a hudi pipeline today is self-managing -
>> sizes
>>

Re: [DISCUSS] Hudi is the data lake platform

2021-07-21 Thread vino yang

Thanks vc

Very good blog, in-depth and forward-looking. Learned!

Best,
Vino

Vinoth Chandar  于2021年7月22日周四 上午3:58写道：

> Expanding to users@ as well.
>
> Hi all,
>
> Since this discussion, I started to pen down a coherent strategy and convey
> these ideas via a blog post.
> I have also done my own research, talked to (ex)colleagues I respect to get
> their take and refine it.
>
> Here's a blog that hopefully explains this vision.
>
> https://github.com/apache/hudi/pull/3322
>
> Look forward to your feedback on the PR. We are hoping to land this early
> next week, if everyone is aligned.
>
> Thanks
> Vinoth
>
> On Wed, Apr 21, 2021 at 9:01 PM wei li  wrote:
>
> > +1 , Cannot agree more.
> >  *aux metadata* and metatable, can make hudi have large preformance
> > optimization on query end.
> > Can continuous develop.
> > cache service may the necessary component in cloud native environment.
> >
> > On 2021/04/13 05:29:55, Vinoth Chandar  wrote:
> > > Hello all,
> > >
> > > Reading one more article today, positioning Hudi, as just a table
> format,
> > > made me wonder, if we have done enough justice in explaining what we
> have
> > > built together here.
> > > I tend to think of Hudi as the data lake platform, which has the
> > following
> > > components, of which - one if a table format, one is a transactional
> > > storage layer.
> > > But the whole stack we have is definitely worth more than the sum of
> all
> > > the parts IMO (speaking from my own experience from the past 10+ years
> of
> > > open source software dev).
> > >
> > > Here's what we have built so far.
> > >
> > > a) *table format* : something that stores table schema, a metadata
> table
> > > that stores file listing today, and being extended to store column
> ranges
> > > and more in the future (RFC-27)
> > > b) *aux metadata* : bloom filters, external record level indexes today,
> > > bitmaps/interval trees and other advanced on-disk data structures
> > tomorrow
> > > c) *concurrency control* : we always supported MVCC based log based
> > > concurrency (serialize writes into a time ordered log), and we now also
> > > have OCC for batch merge workloads with 0.8.0. We will have multi-table
> > and
> > > fully non-blocking writers soon (see future work section of RFC-22)
> > > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi,
> > but
> > > we support primary/unique key constraints and we could add foreign keys
> > as
> > > an extension, once our transactions can span tables.
> > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > files,
> > > cleans, compacts, clusters data, bootstraps existing data - all these
> > > actions working off each other without blocking one another. (for most
> > > parts).
> > > f) *data services*: we also have higher level functionality with
> > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > > coming, ...and more), incremental ETL support, de-duplication, commit
> > > callbacks, pre-commit validations are coming, error tables have been
> > > proposed. I could also envision us building towards streaming egress,
> > data
> > > monitoring.
> > >
> > > I also think we should build the following (subject to separate
> > > DISCUSS/RFCs)
> > >
> > > g) *caching service*: Hudi specific caching service that can hold
> mutable
> > > data and serve oft-queried data across engines.
> > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> > turn
> > > it into a scalable, sharded metastore, that all engines can use to
> obtain
> > > any metadata.
> > >
> > > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed
> to
> > > "ingests & manages storage of large analytical datasets over DFS (hdfs
> or
> > > cloud stores)." and convey the scope of our vision,
> > > given we have already been building towards that. It would also provide
> > new
> > > contributors a good lens to look at the project from.
> > >
> > > (This is very similar to for e.g, the evolution of Kafka from a pub-sub
> > > system, to an event streaming platform - with addition of
> > > MirrorMaker/Connect etc. )
> > >
> > > Please share your thoughts!
> > >
> > > Thanks
> > > Vinoth
> > >
> >
>

Re: [VOTE] Move content off cWiki

2021-07-20 Thread vino yang

+1

Navinder Brar  于2021年7月20日周二 上午11:01写道：

> +1
> Navinder
>
>
> Sent from Yahoo Mail for iPhone
>
>
> On Tuesday, July 20, 2021, 7:28 AM, Sivabalan  wrote:
>
> +1
>
> On Mon, Jul 19, 2021 at 9:19 PM Nishith  wrote:
>
> > +1
> >
> > -Nishith
> >
> > > On Jul 19, 2021, at 6:15 PM, Udit Mehrotra 
> > wrote:
> > >
> > > +1
> > >
> > > Best,
> > > Udit
> > >
> > >> On Mon, Jul 19, 2021 at 6:04 PM wangxianghu  wrote:
> > >>
> > >> +1 - Approve the move
> > >>
> > >>> 2021年7月20日 上午8:37，Danny Chan  写道：
> > >>>
> > >>> +1 - Approve the move
> > >>
> > >>
> >
>
>
> --
> Regards,
> -Sivabalan
>
>
>
>

Re: Welcome our PMC Member, Raymond Xu

2021-07-16 Thread vino yang

Congrats! Well deserved!

Best,
Vino

Vinoth Chandar  于2021年7月17日周六 上午8:28写道：

> Folks,
>
> I am incredibly happy to share the addition of Raymond Xu to the Hudi PMC.
> Raymond has been a valuable member of our community, over the past few
> years now. Always hustlin and taking on the most underappreciated, but
> extremely valuable aspects of the project, mostly recently with getting our
> tests working smoothly on Azure CI!
>
> Please join me in congratulating Raymond!
>
> Onwards,
> Vinoth
>

Re: Welcome New Committers: Pengzhiwei and DannyChan

2021-07-16 Thread vino yang

Congratulation to both of you! Well deserved!

Best,
Vino

leesf  于2021年7月16日周五 下午6:38写道：

> Hi all,
>
> Please join me in congratulating our newest committers *Pengzhiwei *and
> * DannyChan.*
>
> *Pengzhiwei *has been a consistent contributor to Hudi, he has
> contributed numerous features to Hudi, such as Spark SQL integration with
> Hudi, Spark Structured Streaming Source for Hudi and Spark FileIndex for
> Hudi and also lots of other good contributions around Spark, and also very
> active to answer users's questions. He is a solid team player and an asset
> to the project.
>
> *DannyChan* has contributed many good features, such as new streaming
> write pipeline for Flink with automatic compaction and cleaning (COW and
> MOR), batch and streaming reader for Flink (COW and MOR) and support Flink
> SQL connectors (reader and writer), he is actively join the ML and
> answer users' questions as well as wrote a Hudi Flink integration guide and
> launched a live show to promote Hudi Flink integration for Chinese users.
>
> Thanks so much for your continued contributions to make Hudi better and
> better!
>
> Also I would like to introduce the current state of Hudi in China. Hudi
> becomes more and more popular in China with the help of all community
> members and has been adopted by almost all top companies in China,
> including Alibaba, Baidu, ByteDance, Huawei, Tencent and other companies,
> from startups to large companies, data scale from TB to PB. You would find
> the logo wall below(PS: *unofficial statistics*, just listed some of them
> and you can contact me to add your company logo if wanted).
>
> We would not achieve this without such a good community and the
> contribution of all community members. Cheers and Go！
>
> [image: poweredby-0706.png]
>
> Thanks,
> Leesf
>

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-15 Thread vino yang

+1 for option B.

Best,
Vino

Sivabalan  于2021年7月16日周五 上午10:35写道：

> +1 on B. Not sure on A though. I understand the intent to have all in
> one place. but not very sure if we can get all functionality (version,
> type, component, status, parent- child relation), etc ported over to
> github. I assume labels are the only option we have to achieve these.
> Probably, we should also document the labels in detail so that anyone
> looking to take a look at untriaged issues should know how/where to look
> at. If we plan to use GH issues for all, I am sure there will be a lot of
> proliferation of issues.
>
> On Fri, Jul 9, 2021 at 12:29 PM Vinoth Chandar  wrote:
>
> > Based on this, I will start consolidating more of the cWiki content to
> > github wiki and master branch?
> >
> > JIRA vs GH Issue still probably needs more feedback. I do see the
> tradeoffs
> > there.
> >
> > On Fri, Jul 9, 2021 at 2:39 AM wei li  wrote:
> >
> > > +1
> > >
> > > On 2021/07/02 03:40:51, Vinoth Chandar  wrote:
> > > > Hi all,
> > > >
> > > > When we incubated Hudi, we made some initial choices around
> > collaboration
> > > > tools of choice. I am wondering if there are still optimal, given the
> > > scale
> > > > of the community at this point.
> > > >
> > > > Specifically, two points.
> > > >
> > > > A) Our issue tracker is JIRA, while we just use Github Issues for
> > support
> > > > triage. While JIRA is pretty advanced and gives us the ability to
> track
> > > > releases, versions and kanban boards, there are few practical
> > operational
> > > > problems.
> > > >
> > > > - Developers often open bug fixes/PR which all need to be
> continuously
> > > > tagged against a release version (fix version)
> > > > - Referencing JIRAs from Pull Requests is great (we cannot do things
> > like
> > > > `fixes #1234` to close issues when PR lands, not an easy way to click
> > and
> > > > get to the JIRA)
> > > > - Many more developers have a github account, to contribute to Hudi
> > > though,
> > > > they need an additional sign-up on jira.
> > > >
> > > > So wondering if we should just use one thing - Github Issues, and
> build
> > > > scripts/hubot or something to get the missing project management from
> > > > boards.
> > > >
> > > > B) Our design docs are on cWiki. Even though we link it off the site,
> > > from
> > > > my experience, many do not discover them.
> > > > For large PRs, we need to manually enforce that design and code are
> in
> > > sync
> > > > before we land. If we can, I would love to make RFC being in good
> > shape a
> > > > pre-requisite for landing the PR.
> > > > Once again, separate signup is needed to write design docs or comment
> > on
> > > > them.
> > > >
> > > > So, wondering if we can move our process docs etc into Github Wiki
> and
> > > RFCs
> > > > to the master branch in a rfc folder, and we just use github PRs to
> > raise
> > > > RFCs and discuss them.
> > > >
> > > > This all also makes it easy for us to measure community activity and
> > keep
> > > > streamlining our processes.
> > > >
> > > > personally, these different channels are overwhelming to me at-least
> :)
> > > >
> > > > Love to hear thoughts. Please specify if you are for,against each of
> A
> > > and
> > > > B.
> > > >
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>

Re: Confluence & Jira Permissions

2021-06-10 Thread vino yang

Hi Thiru,

Done and Welcome to Hudi community!

Best,
Vino

Thiru Malai  于2021年6月9日周三 下午10:43写道：

> Hi,
>
> I would like to start contributing to Hudi. Can anyone grant me proper
> privileges in Confluence & Jira?
> username: thirumalai.raj
>
> Thanks in advance!!
>
> Regards
> Thirumalai Raj
>

Re: Request contributor permission

2021-06-09 Thread vino yang

Done and welcome!

Best,
Vino

w.gh123  于2021年6月9日周三 下午5:11写道：

> Hi,
>
>
>  I want to contribute to Apache Hudi. 
>  Would you please give me the contributor permission? 
>  My JIRA ID is hapihu

Re: permission opening application

2021-06-03 Thread vino yang

Hi,

Can you clarify that you need Jira permission or confluence permission, you
want which one? They are different.

Best,
Vino

曹明 <13760436...@163.com> 于2021年6月3日周四 下午5:49写道：

> How I can initiate a RFC?

Re: jira necessary permission

2021-05-26 Thread vino yang

Hi taylor,

I have given you jira contributor permission.
Welcome and look forward to your contribution.

Best,
Vino

廖辉轩 <726830...@qq.com> 于2021年5月26日周三 上午9:20写道：

> hi,
> I want to contribute to apache hudi.
> Could you please give me the necessary permission?
> Jira id - taylor liao
>
>
> Regards,
> taylor liao

Re: [DISCUSS] Improving hudi user experience by providing more ways to configure hudi jobs

2021-05-24 Thread vino yang

also +1,

IMO, simplifying the complexity of configuration and reducing the cost of
entry for new users are very important for improving user experience.

It is a good proposal to simplify the configuration complexity by
introducing some built-in enumerations.

But at the same time, it is necessary to allow the fully qualified name of
the configuration class (for advanced requirements that have
self-extension).

Best,
Vino

Pratyaksh Sharma  于2021年5月22日周六 下午8:24写道：

> +1 from my side.
>
> Introducing new configs based on types definitely improves user experience
> as compared to supplying full class names. We just need to define the enums
> properly.
>
> On Sat, May 22, 2021 at 9:13 AM wangxianghu  wrote:
>
> > Hi community:
> >
> >
> >
> > Here I want to start a discussion about improving the hudi user
> experience.
> >
> >
> >
> >
> > Now hudi has more and more users all over the world, but most of them
> > don’t know hudi like uber engineers or us.
> >
> > when they start hudi tasks, they need to do a lot of configuration，many
> of
> > which are not user-friendly.
> >
> >
> >
> >
> > such as:
> > ```
> >
> > hoodie.datasource.write.keygenerator.class   ->
> > org.apache.hudi.keygen.SimpleKeyGenerator
> >
> > hoodie.datasource.write.payload.class ->
> > org.apache.hudi.OverwriteWithLatestAvroPayload`
> >
> > --schemaprovider-class` -> subclass of org.apache.hudi.utilities.schema
> >
> > --transformer-class -> full class names to act transform
> >
> > --sync-tool-classes -> full class names of sync tool
> >
> > --source-class -> Subclass of org.apache.hudi.utilities.sources
> > ...
> > ```
> >
> > I think asking users to provide the full name of the class is not very
> > friendly, especially for new users.
> >
> > so, maybe we can provide more ways to configure parameters， just like the
> > case of `HoodieIndex`.
> >
> >
> >
> >
> > In `HoodieIndex` case, The users can configure one of the index type or
> > index class names to tell hudi which index to use.
> >
> > ```
> >
> > hoodie.index.type -> HBASE
> >
> > ```
> >
> > or
> >
> > ```
> >
> > hoodie.index.class -> org.apache.hudi.index.hbase.SparkHoodieHBaseIndex
> >
> > ```
> >
> > I believe more users like the `hoodie.index.type` way.
> >
> >
> >
> >
> > So, I think we can make some configuration above support being set
> through
> > type, and keep the way of class name configuration at the same time, in
> > case of some users need customizing functions on their own.
> >
> >
> >
> >
> > I'm looking forward to your feedback. Any suggestions are appreciated
>

Re: Welcome new committers and PMC Members!

2021-05-11 Thread vino yang

Congrats to Gary and Wenning!

wangxianghu  于2021年5月12日周三 上午8:40写道：

> Congratulations @Gary Li and @Wenning Ding!
>
> > 2021年5月12日 上午7:18，Prashant Wason  写道：
> >
> > Congratulations Gary and Wenning!
> >
> > On Tue, May 11, 2021 at 3:59 PM Raymond Xu 
> > wrote:
> >
> >> Big congrats to Gary and Wenning!
> >>
> >> On Tue, May 11, 2021 at 1:14 PM vbal...@apache.org 
> >> wrote:
> >>
> >>> Many Congratulations Gary Li and Wenning Ding. Well deserved !!
> >>> Balaji.V
> >>>On Tuesday, May 11, 2021, 01:06:47 PM PDT, Bhavani Sudha <
> >>> bhavanisud...@gmail.com> wrote:
> >>>
> >>> Congratulations @Gary Li and @Wenning Ding!
> >>> On Tue, May 11, 2021 at 12:42 PM Vinoth Chandar 
> >> wrote:
> >>>
> >>> Hello all,
> >>> Please join me in congratulating our newest set of committers and PMCs.
> >>> Wenning Ding (Committer) Wenning has been a consistent contributor to
> >>> Hudi, over the past year or so. He has added some critical bug fixes,
> >> lots
> >>> of good contributions around Spark!
> >>> Gary Li (PMC Member) Gary is a regular feature on all our support
> >>> channels. He has contributed numerous features to Hudi, and evangelized
> >>> across many companies including Bosch/Bytedance. Most of all, he is a
> >> solid
> >>> team player and an asset to the project.
> >>> Thanks so much for your continued contributions, to make Hudi better
> and
> >>> better!
> >>> ThanksVinoth
> >>>
> >>>
> >>
>
>

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread vino yang

+1 Excited by this new vision!

Best,
Vino

Dianjin Wang  于2021年4月13日周二 下午3:53写道：

> +1  The new brand is straightforward, a better description of Hudi.
>
> Best,
> Dianjin Wang
>
>
> On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha 
> wrote:
>
> > +1 . Cannot agree more. I think this makes total sense and will provide
> for
> > a much better representation of the project.
> >
> > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar 
> wrote:
> >
> > > Hello all,
> > >
> > > Reading one more article today, positioning Hudi, as just a table
> format,
> > > made me wonder, if we have done enough justice in explaining what we
> have
> > > built together here.
> > > I tend to think of Hudi as the data lake platform, which has the
> > following
> > > components, of which - one if a table format, one is a transactional
> > > storage layer.
> > > But the whole stack we have is definitely worth more than the sum of
> all
> > > the parts IMO (speaking from my own experience from the past 10+ years
> of
> > > open source software dev).
> > >
> > > Here's what we have built so far.
> > >
> > > a) *table format* : something that stores table schema, a metadata
> table
> > > that stores file listing today, and being extended to store column
> ranges
> > > and more in the future (RFC-27)
> > > b) *aux metadata* : bloom filters, external record level indexes today,
> > > bitmaps/interval trees and other advanced on-disk data structures
> > tomorrow
> > > c) *concurrency control* : we always supported MVCC based log based
> > > concurrency (serialize writes into a time ordered log), and we now also
> > > have OCC for batch merge workloads with 0.8.0. We will have multi-table
> > and
> > > fully non-blocking writers soon (see future work section of RFC-22)
> > > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi,
> > but
> > > we support primary/unique key constraints and we could add foreign keys
> > as
> > > an extension, once our transactions can span tables.
> > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > files,
> > > cleans, compacts, clusters data, bootstraps existing data - all these
> > > actions working off each other without blocking one another. (for most
> > > parts).
> > > f) *data services*: we also have higher level functionality with
> > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > > coming, ...and more), incremental ETL support, de-duplication, commit
> > > callbacks, pre-commit validations are coming, error tables have been
> > > proposed. I could also envision us building towards streaming egress,
> > data
> > > monitoring.
> > >
> > > I also think we should build the following (subject to separate
> > > DISCUSS/RFCs)
> > >
> > > g) *caching service*: Hudi specific caching service that can hold
> mutable
> > > data and serve oft-queried data across engines.
> > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> > turn
> > > it into a scalable, sharded metastore, that all engines can use to
> obtain
> > > any metadata.
> > >
> > > To this end, I propose we rebrand to "*Data Lake Platform*" as opposed
> to
> > > "ingests & manages storage of large analytical datasets over DFS (hdfs
> or
> > > cloud stores)." and convey the scope of our vision,
> > > given we have already been building towards that. It would also provide
> > new
> > > contributors a good lens to look at the project from.
> > >
> > > (This is very similar to for e.g, the evolution of Kafka from a pub-sub
> > > system, to an event streaming platform - with addition of
> > > MirrorMaker/Connect etc. )
> > >
> > > Please share your thoughts!
> > >
> > > Thanks
> > > Vinoth
> > >
> >
>

Re: Apache Hudi 0.8.0 Released

2021-04-09 Thread vino yang

Thanks Gary, great work!

Best,
Vino

Danny Chan  于2021年4月10日周六 上午10:27写道：

> Cheers ~
>
> Best,
> Danny Chan
>
> Vinoth Chandar  于2021年4月10日周六 上午12:43写道：
>
> > Thanks Gary! +1 fantastic job with the release!
> >
> > Please also announce on Slack (if not done already)
> >
> > I shared some tweets at https://twitter.com/apachehudi
> >
> > On Fri, Apr 9, 2021 at 7:44 AM leesf  wrote:
> >
> > > Thanks gary for driving the release, great job.
> > >
> > > Pratyaksh Sharma  于2021年4月9日周五 下午10:40写道：
> > >
> > > > Great news!
> > > >
> > > > On Fri, Apr 9, 2021 at 11:42 AM Sivabalan 
> wrote:
> > > >
> > > > > Awesome! Great job Gary on the release work!
> > > > >
> > > > > On Fri, Apr 9, 2021 at 1:59 AM Gary Li 
> > > wrote:
> > > > >
> > > > > > Thanks Vinoth.
> > > > > >
> > > > > > The page for 0.8.0 is ready
> > > > > > https://hudi.apache.org/docs/0.8.0-spark_quick-start-guide.html.
> > > > > > The release note could be found here
> > > > > https://hudi.apache.org/releases.html
> > > > > >
> > > > > > Best,
> > > > > > Gary Li
> > > > > >
> > > > > > On Thu, Apr 8, 2021 at 12:15 AM Vinoth Chandar <
> vin...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > This is awesome! Thanks for sharing, Gary!
> > > > > > >
> > > > > > > Are we waiting for the site to be rendered with 0.8.0 release
> > info
> > > > and
> > > > > > > homepage update?
> > > > > > >
> > > > > > > On Wed, Apr 7, 2021 at 7:54 AM Gary Li <
> yanjia.gary...@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > We are excited to share that Apache Hudi 0.8.0 was released.
> > > Since
> > > > > the
> > > > > > > > 0.7.0 release, we resolved 97 JIRA tickets and made 120 code
> > > > commits.
> > > > > > We
> > > > > > > > implemented many new features, bugfix, and performance
> > > improvement.
> > > > > > > Thanks
> > > > > > > > to all the contributors who had made this happened.
> > > > > > > >
> > > > > > > > *Release Highlights*
> > > > > > > >
> > > > > > > > *Flink Integration*
> > > > > > > > Since the initial support for the Hudi Flink Writer in the
> > 0.7.0
> > > > > > release,
> > > > > > > > the Hudi community made great progress on improving the
> > > Flink/Hudi
> > > > > > > > integration, including redesigning the Flink writer pipeline
> > with
> > > > > > better
> > > > > > > > performance and scalability, state-backed indexing with
> > bootstrap
> > > > > > > support,
> > > > > > > > Flink writer for MOR table, batch reader for COW&MOR table,
> > > > streaming
> > > > > > > > reader for MOR table, and Flink SQL connector for both source
> > and
> > > > > sink.
> > > > > > > In
> > > > > > > > the 0.8.0 release, the user is able to use all those features
> > > with
> > > > > > Flink
> > > > > > > > 1.11+.
> > > > > > > >
> > > > > > > > Please see [RFC-24](
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+24%3A+Hoodie+Flink+Writer+Proposal
> > > > > > > > )
> > > > > > > > for more implementation details of the Flink writer and
> follow
> > > this
> > > > > > > [page](
> > > > > > > > https://hudi.apache.org/docs/flink-quick-start-guide.html)
> to
> > > get
> > > > > > > started
> > > > > > > > with Flink!
> > > > > > > >
> > > > > > > > *Parallel Writers Support*
> > > > > > > > As many users requested, now Hudi supports multiple ingestion
> > > > writers
> > > > > > to
> > > > > > > > the same Hudi Table with optimistic concurrency control. Hudi
> > > > > supports
> > > > > > > file
> > > > > > > > level OCC, i.e., for any 2 commits (or writers) happening to
> > the
> > > > same
> > > > > > > > table, if they do not have writes to overlapping files being
> > > > changed,
> > > > > > > both
> > > > > > > > writers are allowed to succeed. This feature is currently
> > > > > experimental
> > > > > > > and
> > > > > > > > requires either Zookeeper or HiveMetastore to acquire locks.
> > > > > > > >
> > > > > > > > Please see [RFC-22](
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+22+%3A+Snapshot+Isolation+using+Optimistic+Concurrency+Control+for+multi-writers
> > > > > > > > )
> > > > > > > > for more implementation details and follow this [page](
> > > > > > > > https://hudi.apache.org/docs/concurrency_control.html) to
> get
> > > > > started
> > > > > > > with
> > > > > > > > concurrency control!
> > > > > > > >
> > > > > > > > *Writer side improvements*
> > > > > > > > - InsertOverwrite Support for Flink writer client.
> > > > > > > > - Support CopyOnWriteTable in Java writer client.
> > > > > > > >
> > > > > > > > *Query side improvements*
> > > > > > > > - Support Spark Structured Streaming read from Hudi table.
> > > > > > > > - Performance improvement of Metadata table.
> > > > > > > > - Performance improvement of Clustering.
> > > > > > > >
> > > > > > > > *Raw Release Notes*
> > > > >

Re: Request for JIRA permissions for ticket assignment

2021-04-06 Thread vino yang

Hi susu,

I have given you Jira contributor permission.

Welcome!

Best,
Vino

Susu Dong  于2021年4月5日周一 上午12:37写道：

> Hi Hudi team,
>
> My name is Susu, and I am a software engineer currently based in Tokyo.
> We’re heavy Hudi users within our platform, and I personally would love to
> contribute to the Hudi project. :)
> Particularly, this request is for me to work on ticket HUDI-1740
>  as a start, in which I
> was involved in some deep discussions for issue-2707
> .
>
> Please kindly grant me permissions to access JIRA, and my JIRA username is
> *susudong*. Thanks!
>
> Best regards,
> Susu
>

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-03-31 Thread vino yang

>> Oops, the image crushes, for "change flags", i mean: insert,
update(before
and after) and delete.

Yes, the image I attached is also about these flags.
[image: image (3).png]

+1 for the idea.

Best,
Vino


Danny Chan  于2021年4月1日周四 上午10:03写道：

> Oops, the image crushes, for "change flags", i mean: insert, update(before
> and after) and delete.
>
> The Flink engine can propagate the change flags internally between its
> operators, if HUDI can send the change flags to Flink, the incremental
> calculation of CDC would be very natural (almost transparent to users).
>
> Best,
> Danny Chan
>
> vino yang  于2021年3月31日周三 下午11:32写道：
>
> > Hi Danny,
> >
> > Thanks for kicking off this discussion thread.
> >
> > Yes, incremental query( or says "incremental processing") has always been
> > an important feature of the Hudi framework. If we can make this feature
> > better, it will be even more exciting.
> >
> > In the data warehouse, in some complex calculations, I have not found a
> > good way to conveniently use some incremental change data (similar to the
> > concept of retracement stream in Flink?) to locally "correct" the
> > aggregation result (these aggregation results may belong to the DWS
> layer).
> >
> > BTW: Yes, I do admit that some simple calculation scenarios (single table
> > or an algorithm that can be very easily retracement) can be dealt with
> > based on the incremental calculation of CDC.
> >
> > Of course, the expression of incremental calculation on various occasions
> > is sometimes not very clear. Maybe we will discuss it more clearly in
> > specific scenarios.
> >
> > >> If HUDI can keep and propagate these change flags to its consumers, we
> > can
> > use HUDI as the unified format for the pipeline.
> >
> > Regarding the "change flags" here, do you mean the flags like the one
> > shown in the figure below?
> >
> > [image: image.png]
> >
> > Best,
> > Vino
> >
> > Danny Chan  于2021年3月31日周三 下午6:24写道：
> >
> >> Hi dear HUDI community ~ Here i want to fire a discuss about using HUDI
> as
> >> the unified storage/format for data warehouse/lake incremental
> >> computation.
> >>
> >> Usually people divide data warehouse production into several levels,
> such
> >> as the ODS(operation data store), DWD(data warehouse details), DWS(data
> >> warehouse service), ADS(application data service).
> >>
> >>
> >> ODS -> DWD -> DWS -> ADS
> >>
> >> In the NEAR-REAL-TIME (or pure realtime) computation cases, a big topic
> is
> >> syncing the change log(CDC pattern) from all kinds of RDBMS into the
> >> warehouse/lake, the cdc patten records and propagate the change flag:
> >> insert, update(before and after) and delete for the consumer, with these
> >> flags, the downstream engines can have a realtime accumulation
> >> computation.
> >>
> >> Using streaming engine like Flink, we can have a totally NEAR-REAL-TIME
> >> computation pipeline for each of the layer.
> >>
> >> If HUDI can keep and propagate these change flags to its consumers, we
> can
> >> use HUDI as the unified format for the pipeline.
> >>
> >> I'm expecting your nice ideas here ~
> >>
> >> Best,
> >> Danny Chan
> >>
> >
>

Re: [VOTE] Release 0.8.0, release candidate #1

2021-03-31 Thread vino yang

+1 binding

- ran `mvn clean package -DskipTests` [OK]
- quick start (Spark 2.x, 3.x) [OK]
- checked signature [OK]

Best,
Vino


Sivabalan  于2021年3月31日周三 下午12:32写道：

> +1 binding
>
> - Compilation Ok
> - Quick start utils w/ spark3 Ok
> - checksum Ok
> - release validation script Ok
> - Ran hudi test suite jobs. {COW, MOR} * {regular, metadata_enabled} 50
> iterations w/ validating cleaning and archival. Ok
>
> ---
> Checksum
> shasum -a 512 hudi-0.8.0-rc1.src.tgz > sha512
> diff sha512 hudi-0.8.0-rc1.src.tgz.sha512
>
> gpg --verify hudi-0.8.0-rc1.src.tgz.asc
> gpg: assuming signed data in 'hudi-0.8.0-rc1.src.tgz'
> gpg: Signature made Mon Mar 29 10:58:46 2021 EDT
> gpg:using RSA key E2A9714E0FBA3A087BDEE655E72873D765D6C406
> gpg: Good signature from "YanJia Li " [unknown]
> gpg: WARNING: This key is not certified with a trusted signature!
> gpg:  There is no indication that the signature belongs to the
> owner.
> Primary key fingerprint: E2A9 714E 0FBA 3A08 7BDE  E655 E728 73D7 65D6 C406
>
> Validation script:
> ./release/validate_staged_release.sh --release=0.8.0 --rc_num=1
> --release_type=dev
> /tmp/validation_scratch_dir_001
> ~/Documents/personal/projects/siva_hudi/temp_hudi/hudi-0.8.0-rc1/scripts
> Downloading from svn co https://dist.apache.org/repos/dist//dev/hudi
> Validating hudi-0.8.0-rc1 with release type "dev"
> Checking Checksum of Source Release
> Checksum Check of Source Release - [OK]
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
>  Current
>  Dload  Upload   Total   SpentLeft
>  Speed
> 100 38466  100 384660 0   171k  0 --:--:-- --:--:-- --:--:--
>  171k
> Checking Signature
> Signature Check - [OK]
>
> Checking for binary files in source release
> No Binary Files in Source Release? - [OK]
>
> Checking for DISCLAIMER
> DISCLAIMER file exists ? [OK]
>
> Checking for LICENSE and NOTICE
> License file exists ? [OK]
> Notice file exists ? [OK]
>
> Performing custom Licensing Check
> Licensing Check Passed [OK]
>
> Running RAT Check
> RAT Check Passed [OK]
>
>
>
>
>
>
> On Tue, Mar 30, 2021 at 3:48 PM Bhavani Sudha 
> wrote:
>
> > +1 (binding)
> >
> > - compile ok
> > - quickstart ok
> > - checksum ok
> > - ran some ide tests - ok
> > - release validation script - ok
> > /tmp/validation_scratch_dir_001 ~/Downloads/hudi-0.8.0-rc1/scripts
> > Downloading from svn co https://dist.apache.org/repos/dist//dev/hudi
> > Validating hudi-0.8.0-rc1 with release type "dev"
> > Checking Checksum of Source Release
> > Checksum Check of Source Release - [OK]
> >
> >   % Total% Received % Xferd  Average Speed   TimeTime Time
> >  Current
> >  Dload  Upload   Total   SpentLeft
> >  Speed
> > 100 38466  100 384660 0  77709  0 --:--:-- --:--:-- --:--:--
> > 77709
> > Checking Signature
> > Signature Check - [OK]
> >
> > Checking for binary files in source release
> > No Binary Files in Source Release? - [OK]
> >
> > Checking for DISCLAIMER
> > DISCLAIMER file exists ? [OK]
> >
> > Checking for LICENSE and NOTICE
> > License file exists ? [OK]
> > Notice file exists ? [OK]
> >
> > Performing custom Licensing Check
> > Licensing Check Passed [OK]
> >
> > Running RAT Check
> > RAT Check Passed [OK]
> >
> >
> >
> > On Mon, Mar 29, 2021 at 9:35 AM Gary Li 
> wrote:
> >
> > > Hi everyone,
> > >
> > > Please review and vote on the release candidate #1 for the version
> 0.8.0,
> > > as follows:
> > >
> > > [ ] +1, Approve the release
> > >
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > >
> > >
> > > The complete staging area is available for your review, which includes:
> > >
> > > * JIRA release notes [1],
> > >
> > > * the official Apache source release and binary convenience releases to
> > be
> > > deployed to dist.apache.org [2], which are signed with the key with
> > > fingerprint E2A9714E0FBA3A087BDEE655E72873D765D6C406 [3],
> > >
> > > * all artifacts to be deployed to the Maven Central Repository [4],
> > >
> > > * source code tag "release-0.8.0-rc1" [5],
> > >
> > >
> > >
> > > The vote will be open for at least 72 hours. It is adopted by majority
> > > approval, with at least 3 PMC affirmative votes.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Release Manager
> > >
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12349423
> > >
> > > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.8.0-rc1/
> > >
> > > [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> > >
> > > [4]
> > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapachehudi-1032/org/apache/hudi/
> > >
> > > [5] https://github.com/apache/hudi/tree/release-0.8.0-rc1
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-03-31 Thread vino yang

Hi Danny,

Thanks for kicking off this discussion thread.

Yes, incremental query( or says "incremental processing") has always been
an important feature of the Hudi framework. If we can make this feature
better, it will be even more exciting.

In the data warehouse, in some complex calculations, I have not found a
good way to conveniently use some incremental change data (similar to the
concept of retracement stream in Flink?) to locally "correct" the
aggregation result (these aggregation results may belong to the DWS layer).

BTW: Yes, I do admit that some simple calculation scenarios (single table
or an algorithm that can be very easily retracement) can be dealt with
based on the incremental calculation of CDC.

Of course, the expression of incremental calculation on various occasions
is sometimes not very clear. Maybe we will discuss it more clearly in
specific scenarios.

>> If HUDI can keep and propagate these change flags to its consumers, we
can
use HUDI as the unified format for the pipeline.

Regarding the "change flags" here, do you mean the flags like the one shown
in the figure below?

[image: image.png]

Best,
Vino

Danny Chan  于2021年3月31日周三 下午6:24写道：

> Hi dear HUDI community ~ Here i want to fire a discuss about using HUDI as
> the unified storage/format for data warehouse/lake incremental computation.
>
> Usually people divide data warehouse production into several levels, such
> as the ODS(operation data store), DWD(data warehouse details), DWS(data
> warehouse service), ADS(application data service).
>
>
> ODS -> DWD -> DWS -> ADS
>
> In the NEAR-REAL-TIME (or pure realtime) computation cases, a big topic is
> syncing the change log(CDC pattern) from all kinds of RDBMS into the
> warehouse/lake, the cdc patten records and propagate the change flag:
> insert, update(before and after) and delete for the consumer, with these
> flags, the downstream engines can have a realtime accumulation computation.
>
> Using streaming engine like Flink, we can have a totally NEAR-REAL-TIME
> computation pipeline for each of the layer.
>
> If HUDI can keep and propagate these change flags to its consumers, we can
> use HUDI as the unified format for the pipeline.
>
> I'm expecting your nice ideas here ~
>
> Best,
> Danny Chan
>

Re: Request to be added to Project Contributor Group

2021-03-29 Thread vino yang

Hi,

>> Just wanted to know what all tests do I need to run at my part before
raising the PR?

It would be better to verify all the test cases and make sure all of them
to be passed before opening a PR.

>>Is there any pipeline wherein I can test my brach directly?
Just do not ignore the test when you compile&package the project?

>> Also, do we need dorker setup for running Unit tests as well?
IMO, the docker process is the requirement for the integration test.

>> I tried "mvn
clean install -DskipITs" but it got stuck trying indefinitely to connect to
localhost:54522.

You can check which process is listening to this port then try to find a
solution.

Best,
Vino


Aditya Tiwari  于2021年3月29日周一 下午1:49写道：

> Thanks Vino for the approval.
>
> Just wanted to know what all tests do I need to run at my part before
> raising the PR? Is there any pipeline wherein I can test my brach directly?
> Also, do we need dorker setup for running Unit tests as well? I tried "mvn
> clean install -DskipITs" but it got stuck trying indefinitely to connect to
> localhost:54522.
>
> On Mon, Mar 29, 2021 at 7:53 AM vino yang  wrote:
>
> > Hi,
> >
> > I have given you Jira contributor permission.
> >
> > > Also, kindly guide me to any doc/ url for running both unit test and
> > integration tests on the project.
> > Is anything else required other than running sample queries from
> > docker_demo (https://hudi.apache.org/docs/docker_demo.html)?
> >
> > Do you have any detailed exceptions or information?
> > IMO, you need to start your docker app in your local.
> > Any more questions please let us know.
> >
> > Best,
> > Vino
> >
> > Aditya Tiwari  于2021年3月27日周六 下午4:11写道：
> >
> > > Hi,
> > >
> > > I would like to be added to the Project Contributor Group with
> reference
> > to
> > > issue: https://issues.apache.org/jira/browse/HUDI-1716
> > >
> > > JiraID: aditiwari (commented on the same issue)
> > >
> > > Also, kindly guide me to any doc/ url for running both unit test and
> > > integration tests on the project.
> > > Is anything else required other than running sample queries from
> > > docker_demo (https://hudi.apache.org/docs/docker_demo.html)?
> > >
> > > Thanks
> > > Aditya Tiwari
> > >
> >
>

Re: [Dev X] Azure Pipelines for CI

2021-03-28 Thread vino yang

Great job! Raymond. It will be very helpful to the community.
Best,
Vino

Sivabalan  于2021年3月29日周一 上午1:35写道：

> Awesome Raymond. Everytime when we are nearing release, we are getting
> blocked by travis. This will definitely be very helpful !
>
>
> On Sun, Mar 28, 2021 at 7:29 AM Raymond Xu 
> wrote:
>
> > Hi all,
> >
> > Just want to give some updates on setting up Azure Pipelines for running
> > tests.
> >
> > We have been experimenting with Azure Pipelines to run all CI tests with
> > the goal of switching over from Travis to it. I wrote this wiki page to
> > document CI related info.
> >
> https://cwiki.apache.org/confluence/display/HUDI/Guide+on+CI+infrastructure
> >
> > 2 blockers to complete the switch-over
> > - Jobs are all queued up, likely due to a service disruption
> > . Waiting for Azure DevOps service to get
> > back to normal
> > - Some tests are failing in Azure Pipelines' environment in a previously
> > setup project. Once the Pipeline service comes back, we'll see those
> > failures in detail and fix them accordingly.
> >
> > Once migration is over, we shall have faster development cycles and
> better
> > dev experience.
> >
> > What could this enable in the near future?
> > - Nightly builds
> > - Performance benchmarking
> >
> > Thanks.
> >
> > Cheers,
> > Raymond
> >
>
>
> --
> Regards,
> -Sivabalan
>

Re: Request to be added to Project Contributor Group

2021-03-28 Thread vino yang

Hi,

I have given you Jira contributor permission.

> Also, kindly guide me to any doc/ url for running both unit test and
integration tests on the project.
Is anything else required other than running sample queries from
docker_demo (https://hudi.apache.org/docs/docker_demo.html)?

Do you have any detailed exceptions or information?
IMO, you need to start your docker app in your local.
Any more questions please let us know.

Best,
Vino

Aditya Tiwari  于2021年3月27日周六 下午4:11写道：

> Hi,
>
> I would like to be added to the Project Contributor Group with reference to
> issue: https://issues.apache.org/jira/browse/HUDI-1716
>
> JiraID: aditiwari (commented on the same issue)
>
> Also, kindly guide me to any doc/ url for running both unit test and
> integration tests on the project.
> Is anything else required other than running sample queries from
> docker_demo (https://hudi.apache.org/docs/docker_demo.html)?
>
> Thanks
> Aditya Tiwari
>

Re: Apply contributing to Apache Hudi

2021-03-17 Thread vino yang

Hi Bing,

I have given you jira contributor permission.

Looking forward to your contribution!

Best,
Vino

shenbingl...@163.com  于2021年3月18日周四 上午10:06写道：

> Hi,
> I want to contribute to Apache Hudi. Would you please give me the
> contributor permission? My JIRA ID is ShenBingLife
>
>
>
> shenbingl...@163.com
>

Re: [DISCUSS] Introduce lgtm to analyze the changes of PR and simplify the cost of code review

2021-03-10 Thread vino yang

Hi,

I configured the lgtm service to let it scan my hudi repository(the mirror
of the official apache-hudi).

It found 50 alerts in the project. And I exported them into a file(sarif
format and attached it as an attachment).

We can use "sarif-web-component"[1]  to view it.

Generally speaking, each alert it found can show you a rule detail page.[2]
However, I can not find a completed rule list.

Best,
Vino

[1]: https://microsoft.github.io/sarif-web-component/
[2]: https://lgtm.com/rules/9980075/

vino yang  于2021年3月5日周五 下午5:33写道：

> OK, let me try to know more about it and test it via one PR.
>
> nishith agarwal  于2021年3月5日周五 上午2:20写道：
>
>> I see, thanks Vino!
>>
>> "*Prevent bugs from ever making it to your project'  - *That's an
>> extremely bold statement for anyone to make :)
>>
>> Like it mentions, although it tries to reduce the false positive rate, we
>> probably still will get some noise. Can we try it with one of the PR's to
>> see it's worth before adopting it ?
>>
>> -Nishith
>>
>>
>> On Wed, Mar 3, 2021 at 6:23 PM vino yang  wrote:
>>
>>> Hi,
>>>
>>> It did not provide much public information, but gave a description on
>>> the official website:
>>>
>>>
>>>
>>> *“Prevent bugs from ever making it to your project by using automated
>>> reviews that let you know when your code changes would introduce alerts
>>> into your project. We support GitHub and Bitbucket.We put a large emphasis
>>> on reducing the false positive rate of our standard queries, so you won’t
>>> suffer from a torrent of uninteresting alerts every time someone submits
>>> code.”*
>>>
>>> From the official website, you can see that it supports mainstream
>>> programming languages: C/C++, C#, Go, Java, JavaScript, Python.
>>>
>>> I speculate that maybe it integrates some bug static scanning tools.
>>>
>>> Best,
>>> Vino
>>>
>>> nishith agarwal  于2021年3月4日周四 上午4:43写道：
>>>
>>>> This is a good idea @vino yang 
>>>>
>>>> Have you looked into what the "automated code review" actually does ?
>>>>
>>>> -Nishith
>>>>
>>>> On Wed, Mar 3, 2021 at 7:38 AM vino yang  wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I want to introduce a code analysis service called lgtm[1] in the
>>>>> community. Recently, in the Kylin community, I found it in my
>>>>> colleague's
>>>>> PR.[2]
>>>>>
>>>>> lgtm is a code analysis platform for finding zero-days and preventing
>>>>> critical vulnerabilities. Some features listed here (copied from its
>>>>> official website): [1]
>>>>>
>>>>>
>>>>>- Unparalleled security analysis;
>>>>>- Automated code review
>>>>>- Free for open source
>>>>>
>>>>>
>>>>> We can see that it can be integrated with Github[3] and exist in the
>>>>> form
>>>>> of a robot triggered by a git hook.[2]
>>>>>
>>>>> With the development of the community, more and more people
>>>>> participate in
>>>>> the development of the community, and the workload of the code review
>>>>> has
>>>>> become more onerous. Introducing it, we can use some of the existing
>>>>> automated scanning and analysis capabilities to make up for the lack of
>>>>> knowledge or experience of the reviewer.
>>>>>
>>>>> WDYT?
>>>>>
>>>>> Any thoughts and opinions are welcome and appreciated!
>>>>>
>>>>> [1]: https://lgtm.com/
>>>>> [2]: https://github.com/apache/kylin/pull/1596#issuecomment-788935493
>>>>> [3]: https://github.com/marketplace/lgtm
>>>>>
>>>>> Best,
>>>>> Vino
>>>>>
>>>>

Re: [DISCUSS] Introduce lgtm to analyze the changes of PR and simplify the cost of code review

2021-03-05 Thread vino yang

OK, let me try to know more about it and test it via one PR.

nishith agarwal  于2021年3月5日周五 上午2:20写道：

> I see, thanks Vino!
>
> "*Prevent bugs from ever making it to your project'  - *That's an
> extremely bold statement for anyone to make :)
>
> Like it mentions, although it tries to reduce the false positive rate, we
> probably still will get some noise. Can we try it with one of the PR's to
> see it's worth before adopting it ?
>
> -Nishith
>
>
> On Wed, Mar 3, 2021 at 6:23 PM vino yang  wrote:
>
>> Hi,
>>
>> It did not provide much public information, but gave a description on the
>> official website:
>>
>>
>>
>> *“Prevent bugs from ever making it to your project by using automated
>> reviews that let you know when your code changes would introduce alerts
>> into your project. We support GitHub and Bitbucket.We put a large emphasis
>> on reducing the false positive rate of our standard queries, so you won’t
>> suffer from a torrent of uninteresting alerts every time someone submits
>> code.”*
>>
>> From the official website, you can see that it supports mainstream
>> programming languages: C/C++, C#, Go, Java, JavaScript, Python.
>>
>> I speculate that maybe it integrates some bug static scanning tools.
>>
>> Best,
>> Vino
>>
>> nishith agarwal  于2021年3月4日周四 上午4:43写道：
>>
>>> This is a good idea @vino yang 
>>>
>>> Have you looked into what the "automated code review" actually does ?
>>>
>>> -Nishith
>>>
>>> On Wed, Mar 3, 2021 at 7:38 AM vino yang  wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I want to introduce a code analysis service called lgtm[1] in the
>>>> community. Recently, in the Kylin community, I found it in my
>>>> colleague's
>>>> PR.[2]
>>>>
>>>> lgtm is a code analysis platform for finding zero-days and preventing
>>>> critical vulnerabilities. Some features listed here (copied from its
>>>> official website): [1]
>>>>
>>>>
>>>>- Unparalleled security analysis;
>>>>- Automated code review
>>>>- Free for open source
>>>>
>>>>
>>>> We can see that it can be integrated with Github[3] and exist in the
>>>> form
>>>> of a robot triggered by a git hook.[2]
>>>>
>>>> With the development of the community, more and more people participate
>>>> in
>>>> the development of the community, and the workload of the code review
>>>> has
>>>> become more onerous. Introducing it, we can use some of the existing
>>>> automated scanning and analysis capabilities to make up for the lack of
>>>> knowledge or experience of the reviewer.
>>>>
>>>> WDYT?
>>>>
>>>> Any thoughts and opinions are welcome and appreciated!
>>>>
>>>> [1]: https://lgtm.com/
>>>> [2]: https://github.com/apache/kylin/pull/1596#issuecomment-788935493
>>>> [3]: https://github.com/marketplace/lgtm
>>>>
>>>> Best,
>>>> Vino
>>>>
>>>

Re: [DISCUSS] Introduce lgtm to analyze the changes of PR and simplify the cost of code review

2021-03-03 Thread vino yang

Hi,

It did not provide much public information, but gave a description on the
official website:



*“Prevent bugs from ever making it to your project by using automated
reviews that let you know when your code changes would introduce alerts
into your project. We support GitHub and Bitbucket.We put a large emphasis
on reducing the false positive rate of our standard queries, so you won’t
suffer from a torrent of uninteresting alerts every time someone submits
code.”*

>From the official website, you can see that it supports mainstream
programming languages: C/C++, C#, Go, Java, JavaScript, Python.

I speculate that maybe it integrates some bug static scanning tools.

Best,
Vino

nishith agarwal  于2021年3月4日周四 上午4:43写道：

> This is a good idea @vino yang 
>
> Have you looked into what the "automated code review" actually does ?
>
> -Nishith
>
> On Wed, Mar 3, 2021 at 7:38 AM vino yang  wrote:
>
>> Hi guys,
>>
>> I want to introduce a code analysis service called lgtm[1] in the
>> community. Recently, in the Kylin community, I found it in my colleague's
>> PR.[2]
>>
>> lgtm is a code analysis platform for finding zero-days and preventing
>> critical vulnerabilities. Some features listed here (copied from its
>> official website): [1]
>>
>>
>>- Unparalleled security analysis;
>>- Automated code review
>>- Free for open source
>>
>>
>> We can see that it can be integrated with Github[3] and exist in the form
>> of a robot triggered by a git hook.[2]
>>
>> With the development of the community, more and more people participate in
>> the development of the community, and the workload of the code review has
>> become more onerous. Introducing it, we can use some of the existing
>> automated scanning and analysis capabilities to make up for the lack of
>> knowledge or experience of the reviewer.
>>
>> WDYT?
>>
>> Any thoughts and opinions are welcome and appreciated!
>>
>> [1]: https://lgtm.com/
>> [2]: https://github.com/apache/kylin/pull/1596#issuecomment-788935493
>> [3]: https://github.com/marketplace/lgtm
>>
>> Best,
>> Vino
>>
>

[DISCUSS] Introduce lgtm to analyze the changes of PR and simplify the cost of code review

2021-03-03 Thread vino yang

Hi guys,

I want to introduce a code analysis service called lgtm[1] in the
community. Recently, in the Kylin community, I found it in my colleague's
PR.[2]

lgtm is a code analysis platform for finding zero-days and preventing
critical vulnerabilities. Some features listed here (copied from its
official website): [1]


   - Unparalleled security analysis;
   - Automated code review
   - Free for open source


We can see that it can be integrated with Github[3] and exist in the form
of a robot triggered by a git hook.[2]

With the development of the community, more and more people participate in
the development of the community, and the workload of the code review has
become more onerous. Introducing it, we can use some of the existing
automated scanning and analysis capabilities to make up for the lack of
knowledge or experience of the reviewer.

WDYT?

Any thoughts and opinions are welcome and appreciated!

[1]: https://lgtm.com/
[2]: https://github.com/apache/kylin/pull/1596#issuecomment-788935493
[3]: https://github.com/marketplace/lgtm

Best,
Vino

Re: Request Access to Apache Hudi Jira

2021-03-02 Thread vino yang

Hi,

I have given you the Jira contribution. Looking forward to your
contribution.

Best,
Vino

陶克路  于2021年3月1日周一 下午11:06写道：

> Hi,
>
> I want to contribute to Apache Hudi.
> Would you please give me the contributor permission?
> My JIRA ID is legendtkl.
>
> --
>
> Hello, Find me here: www.legendtkl.com.
>

Re: [DISCUSS] Rethink the abstraction of current client

2021-02-02 Thread vino yang

Hi,

> I think the proposed interfaces indeed look more intuitive and could
simplify the code structures. My concern is mostly around the ROI of such
refactoring work. Probably I lack some direct involvement in the flink
client work but it looks like it's mainly about code restructuring and
simplification for a new engine implementation?

My original intention for this proposal is as you said: refactoring the
code
abstraction and simplifying the client implementation. But Danny also has
an idea to redesign the abstraction for the DataFlow model. It depends on
whether we want to solve all problems in one shot? Maybe splitting into
multiple steps will focus a little bit.

Regarding ROI, I have explained in the original proposal. The current
implementation is too deep in abstraction and serious expansion of classes.
Any new engine implementation will create a dozen new classes.
When we make some adjustments in the Spark write client
(such as adding a new feature, or fixing a bug), other engines have to
perceive a high cost.

> Speaking of simplifying the client implementation, I wonder the
possibility of removing the table type concept, i.e., to make COW/MOR
tables the same thing by configuring each insert/upsert operation

About this idea, I will let @Vinoth Chandar  chim in.

Best,
Vino

Raymond Xu  于2021年1月21日周四 上午11:04写道：

> I think the proposed interfaces indeed look more intuitive and could
> simplify the code structures. My concern is mostly around the ROI of such
> refactoring work. Probably I lack some direct involvement in the flink
> client work but it looks like it's mainly about code restructuring and
> simplification for a new engine implementation?
>
> Speaking of simplifying the client implementation, I wonder the
> possibility of removing the table type concept, i.e., to make COW/MOR
> tables the same thing by configuring each insert/upsert operation
> - COW table should be ok to take in a new delta commit by just taking log
> files alongside with base files
> - MOR table should be ok to do a one-time compaction for all log files and
> the incoming records
> This also looks like a big refactoring work so I also concern the ROI.
> - Benefits I see: unifying the concepts for Hudi as a table format, less
> classes to implement for clients, more flexibility in writing
> - Some downsides: too much code change; MOR->COW can be too expensive (skip
> this case maybe?)
>
> Just thinking if we do carry out the client abstraction work, could this
> table type simplification also be done at the same time?
>
> On Tue, Jan 19, 2021 at 1:38 AM vino yang  wrote:
>
> > Hi guys,
> >
> > *I open this thread to discuss if we can separate the attributes and
> > behaviors of HoodieTable, and rethink the abstraction of the client.*
> >
> > Currently, in the hudi-client-common module, there is a HoodieTable
> class,
> > which contains a set of attributes and behaviors. For different engines,
> it
> > has different implementations. The existing classes include:
> >
> >- HoodieSparkTable;
> >- HoodieFlinkTable;
> >- HoodieJavaTable;
> >
> > In addition, for two different table types: COW and MOR, these classes
> are
> > further split. For example, HoodieSparkTable is split into:
> >
> >- HoodieSparkCopyOnWriteTable;
> >- HoodieSparkMergeOnReadTable;
> >
> > HoodieSparkTable degenerates into a factory to initialize these classes.
> >
> > This model looks clear but brings some problems.
> >
> > First of all, HoodieTable is a mixture of attributes and behaviors. The
> > attributes are independent of the engines, but the behavior varies
> > depending on the engine. Semantically speaking, HoodieTable should belong
> > to hudi-common, and should not only be associated with
> hudi-client-common.
> >
> > Second, the behaviors contained in HoodieTable, such as:
> >
> >- upsert
> >- insert
> >- delete
> >- insertOverwrite
> >
> > They are similar to the APIs provided by the client, but it is not
> > implemented directly in HoodieTable. Instead, the implementation is
> handed
> > over to a bunch of actions (executors), such as:
> >
> >- commit
> >- compact
> >- clean
> >- rollback
> >
> > In addition, these actions do not completely contain the implementation
> > logic. Part of their logic is separated into some Helper classes under
> the
> > same package, such as:
> >
> >- SparkWriteHelper
> >- SparkMergeHelper
> >- SparkDeleteHelper
> >
> > To sum up, for abstraction, the implementation is moved backward layer by
> > layer (m

Re: Apply permission

2021-02-01 Thread vino yang

Hi feifei,

I have given you contributor permission. Welcome and looking forward to
your contribution.

Best,
Vino

黄飞飞  于2021年2月1日周一 下午5:56写道：

> *Hi,*
>
> *I want to contribute to Apache Hudi. Would you please give me the
> contributor permission? My JIRA ID is huangfeifei.*
>

Re: Congrats to our newest committers!

2021-01-27 Thread vino yang

Congrats to both of them!
Well deserved!

Best,
Vino

Trevor Zhang  于2021年1月27日周三 下午7:20写道：

> Congratulations to  Wang Xianghu and  Li Wei.
>
> Best ,
>
> Trevor
>
> leesf  于2021年1月27日周三 下午7:16写道：
>
> > Hi all,
> >
> > I am very happy to announce our newest committers.
> >
> > Wang Xianghu: Xianghu has done a great job in decoupling hudi with spark
> > and implemented the first version of flink and contributed bug fixes,
> also
> > he is very active in answering users questions in china wechat group.
> >
> > Li Wei: Liwei has also done a great job in driving major features like
> > RFC-19 together with satish, also contributed many features and bug fixes
> > in core modules.
> >
> > Please join me in congratulating them!
> >
> > Thanks,
> > Leesf
> >
>

Re: [VOTE] Release 0.7.0, release candidate #2

2021-01-24 Thread vino yang

+1 binding

- ran `mvn clean package -DskipTests` [OK]
- ran QuickStart [OK]
- checked signature and checksum [OK]
- tested flink write client in local [OK]

Best,
Vino

leesf  于2021年1月24日周日 下午8:30写道：

> +1 binding
>
> - Build successful
> - Ran quickstart successful.
> - Additional manual testing with and without Metadata based listing enabled
> for COW and MOR table against aliyun OSS.
>
> Sivabalan  于2021年1月23日周六 下午9:55写道：
>
> > Got it, I didn't do -1, but just wanted to remind you, so that you don't
> > miss it when you redo the steps again to promote the final one.
> >
> > +1 binding.
> > But do ensure when you release, the staged repo (promoted candidate) has
> > only one set of artifacts and it's a new repo.
> >
> >
> > On Sat, Jan 23, 2021 at 2:03 AM nishith agarwal 
> > wrote:
> >
> > > +1 binding
> > >
> > > - Build Successful
> > > - Release validation script Successful
> > > - Quick start runs Successfully
> > >
> > > Checking Checksum of Source Release
> > > Checksum Check of Source Release - [OK]
> > >
> > >   % Total% Received % Xferd  Average Speed   TimeTime Time
> > >  Current
> > >  Dload  Upload   Total   SpentLeft
> > >  Speed
> > > 100 34972  100 349720 0  96076  0 --:--:-- --:--:--
> --:--:--
> > > 96076
> > > Checking Signature
> > > Signature Check - [OK]
> > >
> > > Checking for binary files in source release
> > > No Binary Files in Source Release? - [OK]
> > >
> > > Checking for DISCLAIMER
> > > DISCLAIMER file exists ? [OK]
> > >
> > > Checking for LICENSE and NOTICE
> > > License file exists ? [OK]
> > > Notice file exists ? [OK]
> > >
> > > Performing custom Licensing Check
> > > Licensing Check Passed [OK]
> > >
> > > Running RAT Check
> > > RAT Check Passed [OK]
> > >
> > > Thanks,
> > > Nishith
> > >
> > > On Fri, Jan 22, 2021 at 9:28 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Thanks Siva! I am not sure if thats a required aspect for the binding
> > > vote.
> > > > Its a minor aspect that does not interfere with testing/validation in
> > > > anyway. The actual release artifact needs to be rebuilt and repushed
> > > anyway
> > > > from a separate repo. Like I noted, I found the wiki instructions bit
> > > > ambiguous and I intend to make it clearer going forward so we can
> avoid
> > > > this in future.
> > > >
> > > > I request everyone to consider this explanation, when casting your
> > vote.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > >
> > > > On Fri, Jan 22, 2021 at 8:35 PM Sivabalan 
> wrote:
> > > >
> > > > > - checksums and signatures [OK]
> > > > > - successfully built [OK]
> > > > > - ran quick start guide [OK]
> > > > > - Ran release validation guide [OK]
> > > > > - Ran test suite job w/ inserts, upserts, deletes and
> > validation(spark
> > > > sql
> > > > > and hive). Also same job w/ metadata enabled as well [OK]
> > > > >
> > > > > - Artifacts in staging repo : should be in separate repo where only
> > rc2
> > > > is
> > > > > present. Right now, I see both rc1 and rc2 are present in the same
> > > repo.
> > > > >
> > > > > Will add my binding vote once artifacts are fixed.
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jan 22, 2021 at 9:17 PM Udit Mehrotra 
> > > wrote:
> > > > >
> > > > > > +1
> > > > > > - Build successful
> > > > > > - Ran quickstart against S3
> > > > > > - Additional manual tests with MOR
> > > > > > - Additional manual testing with and without Metadata based
> listing
> > > > > enabled
> > > > > > - Release validation script successful
> > > > > >
> > > > > > Validating hudi-0.7.0-rc2 with release type "dev"
> > > > > > Checking Checksum of Source Release
> > > > > > -e Checksum Check of Source Release - [OK]
> > > > > >
> > > > > >   % Total% Received % Xferd  Average Speed   TimeTime
> > >  Time
> > > > > >  Current
> > > > > >  Dload  Upload   Total   Spent
> > > Left
> > > > > >  Speed
> > > > > > 100 34972  100 349720 0  70937  0 --:--:-- --:--:--
> > > > --:--:--
> > > > > > 70793
> > > > > > Checking Signature
> > > > > > -e Signature Check - [OK]
> > > > > >
> > > > > > Checking for binary files in source release
> > > > > > -e No Binary Files in Source Release? - [OK]
> > > > > >
> > > > > > Checking for DISCLAIMER
> > > > > > -e DISCLAIMER file exists ? [OK]
> > > > > >
> > > > > > Checking for LICENSE and NOTICE
> > > > > > -e License file exists ? [OK]
> > > > > > -e Notice file exists ? [OK]
> > > > > >
> > > > > > Performing custom Licensing Check
> > > > > > -e Licensing Check Passed [OK]
> > > > > >
> > > > > > Running RAT Check
> > > > > > -e RAT Check Passed [OK]
> > > > > >
> > > > > > Thanks,
> > > > > > Udit
> > > > > >
> > > > > > On Fri, Jan 22, 2021 at 12:41 PM Vinoth Chandar <
> vin...@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > Please review and vote on the release candidate #2 for the
> > version
> > > > > 0.7.0,
> > > > > > > as fo

Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite

2021-01-21 Thread vino yang

Hi zhiwei,

Done! Now, you should have cwiki permission.

Best,
Vino

pzwpzw  于2021年1月22日周五 上午12:06写道：

> That is great!  Can you give me the permission to the cwiki? My cwiki id
> is: zhiwei .
> I will move it to there and continue the disscussion.
>
> 2021年1月21日 下午11:19，Gary Li  写道：
>
> Hi pengzhiwei,
>
> Thanks for the proposal. That’s a great feature. Can we move the design
> doc to cwiki page as a new RFC? We can continue the discussion from there.
>
> Thanks,
>
> Best Regards,
> Gary Li
>
>
> From: pzwpzw 
> Reply-To: "dev@hudi.apache.org" 
> Date: Wednesday, January 20, 2021 at 11:52 PM
> To: "dev@hudi.apache.org" 
> Cc: "dev@hudi.apache.org" 
> Subject: Re: Reply:Re: [DISCUSS] SQL Support using Apache Calcite
>
> Hi, we have implemented the spark sql extension for hudi in our Internal
> version. Here is the main implementation, including the extension sql
> syntax and implementation scheme on spark. I am waiting for your feedback.
> Any comments are welcome~
>
>
> https://docs.google.com/document/d/1KC6Rae67CUaCUpKoIAkM6OTAGuOWFPD9qtNfqchAl1o/edit#heading=h.oeoy1y14sifu
>
>
> 2020年12月23日 上午12:30，Vinoth Chandar  写道：
> Sounds great. There will be a RFC/DISCUSS thread once 0.7.0 is out I think.
> love to have you involved.
>
> On Tue, Dec 22, 2020 at 3:20 AM pzwpzw 
> wrote:
>
>
> Yes, it looks good .
> We are building the spark sql extensions to support for hudi in
> our internal version.
> I am interested in participating in the extension of SparkSQL on hudi.
> 2020年12月22日 下午4:30，Vinoth Chandar  写道：
>
> Hi,
>
> I think what we are landing on finally is.
>
> - Keep pushing for SparkSQL support using Spark extensions route
> - Calcite effort will be a separate/orthogonal approach, down the line
>
> Please feel free to correct me, if I got this wrong.
>
> On Mon, Dec 21, 2020 at 3:30 AM pzwpzw 
> wrote:
>
> Hi 受春柏 ，here is my point. We can use Calcite to build a common sql layer
>
> to process engine independent SQL, for example most of the DDL、Hoodie CLI
>
> command and also provide parser for the common SQL extensions(e.g. Merge
>
> Into). The Engine-related syntax can be taught to the respective engines to
>
> process. If the common sql layer can handle the input sql, it handle
>
> it.Otherwise it is routed to the engine for processing. In long term, the
>
> common layer will more and more rich and perfect.
>
> 2020年12月21日 下午4:38，受春柏  写道：
>
>
> Hi,all
>
>
>
> That's very good,Hudi SQL syntax can support Flink、hive and other analysis
>
> components at the same time,
>
> But there are some questions about SparkSQL. SparkSQL syntax is in
>
> conflict with Calctite syntax.Is our strategy
>
> user migration or syntax compatibility?
>
> In addition ，will it also support write SQL?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2020-12-19 02:10:16，"Nishith"  写道：
>
>
> That’s awesome. Looks like we have a consensus on Calcite. Look forward to
>
> the RFC as well!
>
>
>
> -Nishith
>
>
>
> On Dec 18, 2020, at 9:03 AM, Vinoth Chandar  wrote:
>
>
>
> Sounds good. Look forward to a RFC/DISCUSS thread.
>
>
>
> Thanks
>
>
> Vinoth
>
>
>
> On Thu, Dec 17, 2020 at 6:04 PM Danny Chan  wrote:
>
>
>
> Yes, Apache Flink basically reuse the DQL syntax of Apache Calcite, i would
>
>
> add support for SQL connectors of Hoodie Flink soon ~
>
>
> Currently, i'm preparing a refactoring to the current Flink writer code.
>
>
>
> Vinoth Chandar  于2020年12月18日周五 上午6:39写道：
>
>
>
> Thanks Kabeer for the note on gmail. Did not realize that. :)
>
>
>
> My desired use case is user use the Hoodie CLI to execute these SQLs.
>
>
> They can choose what engine to use by a CLI config option.
>
>
>
> Yes, that is also another attractive aspect of this route. We can build
>
>
> out
>
>
> a common SQL layer and have this translate to the underlying engine
>
>
> (sounds
>
>
> like Hive huh)
>
>
> Longer term, if we really think we can more easily implement a full DML +
>
>
> DDL + DQL, we can proceed with this.
>
>
>
> As others pointed out, for Spark SQL, it might be good to try the Spark
>
>
> extensions route, before we take this on more fully.
>
>
>
> The other part where Calcite is great is, all the support for
>
>
> windowing/streaming in its syntax.
>
>
> Danny, I guess if we should be able to leverage that through a deeper
>
>
> Flink/Hudi integration?
>
>
>
>
> On Thu, Dec 17, 2020 at 1:07 PM Vinoth Chandar 
>
>
> wrote:
>
>
>
> I think Dongwook is investigating on the same lines. and it does seem
>
>
> better to pursue this first, before trying other approaches.
>
>
>
>
>
> On Tue, Dec 15, 2020 at 1:38 AM pzwpzw 
>
> .invalid>
>
>
> wrote:
>
>
>
> Yeah I agree with Nishith that an option way is to look at the
>
>
> ways
>
>
> to
>
>
> plug in custom logical and physical plans in Spark. It can simplify
>
>
> the
>
>
> implementation and reuse the Spark SQL syntax. And also users
>
>
> familiar
>
>
> with
>
>
> Spark SQL will be able to use HUDi's SQL features more quickly.
>
>
> In fact, spark have provided the SparkSessionExtensions int

Re: [DISCUSS] Rethink the abstraction of current client

2021-01-19 Thread vino yang

>> For the Spark client, it is true because no matter Spark or Spark
streaming
engine, they write as batches, but things are different for pure streaming
engines like Flink, Flink writes per-record, it does not accumulate
buffers.

Yes, what I mean about the "batch" is not about the behavior or mechanism
of the write operation, but for a transaction that involves processing a
batch of data.

Anyway, IMO, we both understand each other.

>> So I suggest:
1. The new operation supports writing as per-record but not batches.
2. The new operation should expose plugins like indexing/partitioning out
so that the engine can control it freely.
3. The new operation should expose the write handle (or file handle), so
the engine can roll over when it is necessary.

Since you mentioned the behavior of write operations, yes, we can provide a
better design to make the two write operations based on different computing
engines work well.

My original intention to draw this discussion is that I think the current
client implementation is a bit too "bloated". But as I said, when we think
about these in-depth, we will find that we need to focus on many details.
Perhaps in the process, we can also consider refactoring some core
components (as you mentioned).

Best,
Vino

Danny Chan  于2021年1月20日周三 上午10:25写道：

> > It contains three components:
>
>- Two objects: a table, a batch of records;
>
> For the Spark client, it is true because no matter Spark or Spark streaming
> engine, they write as batches, but things are different for pure streaming
> engines like Flink, Flink writes per-record, it does not accumulate
> buffers.
>
> The current Spark client encapsulate the logic like *index* and
> bucketing/partitioning strategies into it also, both are heavy resource
> consuming operation;
> it may be easy work for Spark because with a SparkEngineContext there, the
> pipeline can fire a new sub-pipeline work as they wish. Things are
> different for Flink too, the Flink engine can not fire sub-pipeline, all
> the pipelines are solid into one and once it fires, it runs there forever.
>
> The client also encapsulates the file rolling over action into it, which is
> hard for Flink to switch on Checkpoints for exactly-once semantics.
>
> So I suggest:
> 1. The new operation supports writing as per-record but not batches.
> 2. The new operation should expose plugins like indexing/partitioning out
> so that the engine can control it freely.
> 3. The new operation should expose the write handle (or file handle), so
> the engine can roll over when it is necessary.
>
>
> vino yang  于2021年1月19日周二 下午5:39写道：
>
> > Hi guys,
> >
> > *I open this thread to discuss if we can separate the attributes and
> > behaviors of HoodieTable, and rethink the abstraction of the client.*
> >
> > Currently, in the hudi-client-common module, there is a HoodieTable
> class,
> > which contains a set of attributes and behaviors. For different engines,
> it
> > has different implementations. The existing classes include:
> >
> >- HoodieSparkTable;
> >- HoodieFlinkTable;
> >- HoodieJavaTable;
> >
> > In addition, for two different table types: COW and MOR, these classes
> are
> > further split. For example, HoodieSparkTable is split into:
> >
> >- HoodieSparkCopyOnWriteTable;
> >- HoodieSparkMergeOnReadTable;
> >
> > HoodieSparkTable degenerates into a factory to initialize these classes.
> >
> > This model looks clear but brings some problems.
> >
> > First of all, HoodieTable is a mixture of attributes and behaviors. The
> > attributes are independent of the engines, but the behavior varies
> > depending on the engine. Semantically speaking, HoodieTable should belong
> > to hudi-common, and should not only be associated with
> hudi-client-common.
> >
> > Second, the behaviors contained in HoodieTable, such as:
> >
> >- upsert
> >- insert
> >- delete
> >- insertOverwrite
> >
> > They are similar to the APIs provided by the client, but it is not
> > implemented directly in HoodieTable. Instead, the implementation is
> handed
> > over to a bunch of actions (executors), such as:
> >
> >- commit
> >- compact
> >- clean
> >- rollback
> >
> > In addition, these actions do not completely contain the implementation
> > logic. Part of their logic is separated into some Helper classes under
> the
> > same package, such as:
> >
> >- SparkWriteHelper
> >- SparkMergeHelper
> >- SparkDeleteHelper
> >
> > To sum up, for abstraction, the implementation is moved ba

[DISCUSS] Rethink the abstraction of current client

2021-01-19 Thread vino yang

Hi guys,

*I open this thread to discuss if we can separate the attributes and
behaviors of HoodieTable, and rethink the abstraction of the client.*

Currently, in the hudi-client-common module, there is a HoodieTable class,
which contains a set of attributes and behaviors. For different engines, it
has different implementations. The existing classes include:

   - HoodieSparkTable;
   - HoodieFlinkTable;
   - HoodieJavaTable;

In addition, for two different table types: COW and MOR, these classes are
further split. For example, HoodieSparkTable is split into:

   - HoodieSparkCopyOnWriteTable;
   - HoodieSparkMergeOnReadTable;

HoodieSparkTable degenerates into a factory to initialize these classes.

This model looks clear but brings some problems.

First of all, HoodieTable is a mixture of attributes and behaviors. The
attributes are independent of the engines, but the behavior varies
depending on the engine. Semantically speaking, HoodieTable should belong
to hudi-common, and should not only be associated with hudi-client-common.

Second, the behaviors contained in HoodieTable, such as:

   - upsert
   - insert
   - delete
   - insertOverwrite

They are similar to the APIs provided by the client, but it is not
implemented directly in HoodieTable. Instead, the implementation is handed
over to a bunch of actions (executors), such as:

   - commit
   - compact
   - clean
   - rollback

In addition, these actions do not completely contain the implementation
logic. Part of their logic is separated into some Helper classes under the
same package, such as:

   - SparkWriteHelper
   - SparkMergeHelper
   - SparkDeleteHelper

To sum up, for abstraction, the implementation is moved backward layer by
layer (mainly completed by the executor + helper classes), which makes each
client need a lot of classes with similar patterns to implement the basic
API, and the class expansion is very serious.

Let us reorganize it:

What a write client does is to insert or upsert a batch of records to a
table with transaction semantics, and provide some additional operations to
the table. It contains three components:

   - Two objects: a table, a batch of records;
   - One type of operation: insert or upsert (focus on records)
   - One type of additional operation: compact / clean (focus on the table
   itself)

Therefore, the following improvements are proposed here:

   - The table object does not contain behavior, the table should be public
   and engine independent;
   - Classify and abstract the operation behavior:
  - TableInsertOperation(interface)
  - TableUpsertOperation(interface)
  - TableTransactionOperation
  - TableManageOperation(compact/clean…)

This kind of abstraction is more intuitive and focused so that there is
only one point of materialization. For example, the Spark engine for insert
operation will hatch the following specific implementation classes:

   - CoWTableSparkInsertOperation;
   - MoRTableSparkInsertOperation;

Of course, we can provide a factory class named TableSparkInsertOperation,
which is optional.

Based on the new abstraction, a new engine only needs to reimplement the
interfaces of the above behaviors, and then provide a new client to
instantiate them.

In order to focus here, I deliberately ignored an important object: the
index. The index should also be in the hudi-common module, and its
implementation may be engine-related, providing acceleration capabilities
for writing and querying at the same time.

The above is just a preliminary idea, there are still many details that
have not been considered. I hope to hear your thoughts on this.

Any opinions and thoughts are appreciated and welcome.

Best,
Vino

Re: [DISCUSS] Support multiple ordering fields

2021-01-18 Thread vino yang

+1,

We have found that such flexibility is needed in some scenarios.

Best,
Vino

Raymond Xu  于2021年1月17日周日 上午3:38写道：

> Just want to discuss a small improvement for setting ordering fields.
> For
> - property `hoodie.payload.ordering.field` and
> - deltastreamer --source-ordering-field
> I think it can be useful to support multiple fields (configured via a
> comma-separated list) to determine the order in some cases. This would need
> another config to set the Comparable implementation, say
> hoodie.payload.ordering.comparable.class to allow custom logic for doing
> comparison.
>
> Any suggestions? Thanks.
>

Re: Re: [DISCUSS] New Flink Writer Proposal

2021-01-07 Thread vino yang

+1 on Gary's opinion,

Yes, the public APIs that come from AbstractHoodieWriteClient should be
able to reuse.

We could try to make the HoodieFlinkWriteClient a common implementation.

IIUC, there is a mapping like this:

SparkRDDWriteClient -> HoodieFlinkWriteClient
HoodieDeltaStreamer -> HoodieFlinkStreamer (it could be multiple?)

Actually, I and Danny's divergence is that we need one HoodieFlinkStreamer
or two HoodieFlinkStreamers.

We can maintain one or two, although, we both try to find a good way to
maintain one app (entry-point).

Correct me, if I am wrong.

Best,
Vino


Gary Li  于2021年1月7日周四 下午4:31写道：

> Hi all,
>
> IIUC the current flink writer is like an app, just like the delta
> streamer. If we want to build another Flink writer, we can still share the
> same flink client right? Does the flink client also have to use the new
> feature only available on Flink 1.12?
>
> Thanks,
> Gary Li
> 
> From: Danny Chan 
> Sent: Thursday, January 7, 2021 10:19 AM
> To: dev@hudi.apache.org 
> Subject: Re: Re: [DISCUSS] New Flink Writer Proposal
>
> Thanks vino yang ~
>
> IMO, we should not put much put too much energy for current Flink writer,
> it is not production-ready in the long run. There are so many features need
> to add/support for the Flink write/read(MOR write, COR read, MOR read, the
> new index), we should focus on one version first, make it robust.
>
> I really hope that we can work together to make the writer production-ready
> as soon as possible, it is competitive that we have competitors like Apache
> Iceberg and Delta lake, so from this perspective, there is no benefit to be
> compatible with the current version writer.
>
> My idea is that i propose the new infrastructure first as quickly as
> possible(the basic pipeline, the test framework.), and then we can work
> together for the new version (MOR write, COR read, MOR read, the new
> index), we better not distract from promote the old writer.
>
> What do you think?
>
> vino yang  于2021年1月6日周三 下午2:14写道：
>
> > Hi Danny,
> >
> > As we discussed in the doc, we should agree on if we should be compatible
> > with the version less than Flink 1.11/1.12.
> >
> > We all know that there are some bottlenecks in the current plan. You
> > proposed some improvements, yes it is great, but it radically uses the
> > newer features provided by Flink. It is a pity that some users of old
> > versions of Flink have no way to benefit from these features.
> >
> > The information I can provide is that some users have already used the
> > current Flink write client or its improved version in a production
> > environment. For example, SF Technology, and the Flink versions they use
> > are 1.8.x and 1.10.x.
> >
> > Therefore, I personally suggest that there are two options:
> >
> > 1) The new design takes into account users of lower versions as much as
> > possible and maintains a client version;
> > 2) The new design is based on the features of the new version and evolves
> > separately from the old version(we also have a plan to optimize the
> current
> > implementation), but the public abstraction can be reused. I think it is
> > not impossible to maintain multiple versions. Flink used to support 4+
> > versions (0.8.2, 0.9, 0.10, 0.11, universal connector) for Kafka
> Connector,
> > but they share the same code base.
> >
> > Any thoughts and opinions are welcome and appreciated.
> >
> > Best,
> > Vino
> >
> > vino yang  于2021年1月6日周三 下午1:37写道：
> >
> > > Hi Danny,
> > >
> > > You should have cwiki edit permission now.
> > > Any problems let me know.
> > >
> > > Best,
> > > Vino
> > >
> > > Danny Chan  于2021年1月6日周三 下午12:05写道：
> > >
> > >> Sorry ~
> > >>
> > >> Forget to say that my Confluence ID is danny0405.
> > >>
> > >> It would be nice if any of you can help on this.
> > >>
> > >> Best,
> > >> Danny Chan
> > >>
> > >> Danny Chan  于2021年1月6日周三 下午12:00写道：
> > >>
> > >> > Hi, can someone give me the CWIKI permission so that i can update
> the
> > >> > design details to that (maybe as a new RFC though ~).
> > >> >
> > >> > wangxianghu  于2021年1月5日周二 下午2:43写道：
> > >> >
> > >> >> + 1, Thanks Danny!
> > >> >> I believe this new feature OperatorConrdinator in flink-1.11 will
> > help
> > >> >> improve the current implementation
> > >> >>

Re: Apply For Developer Permossion

2021-01-06 Thread vino yang

Hi Nieal,

I have given you contributor permission.

Welcome and look forward to your contribution!

Best,
Vino

杨翔  于2021年1月6日周三 下午11:29写道：

> Hi,
> I want to contribute to Apache Hudi. Would you please give me the
> contributor permission? My JIRA ID is Nieal-Yang.
> | |
> 杨翔
> |
> |
> 邮箱：csu_y...@126.com
> |
>
> 签名由 网易邮箱大师 定制

Re: Re: [DISCUSS] New Flink Writer Proposal

2021-01-05 Thread vino yang

Hi Danny,

As we discussed in the doc, we should agree on if we should be compatible
with the version less than Flink 1.11/1.12.

We all know that there are some bottlenecks in the current plan. You
proposed some improvements, yes it is great, but it radically uses the
newer features provided by Flink. It is a pity that some users of old
versions of Flink have no way to benefit from these features.

The information I can provide is that some users have already used the
current Flink write client or its improved version in a production
environment. For example, SF Technology, and the Flink versions they use
are 1.8.x and 1.10.x.

Therefore, I personally suggest that there are two options:

1) The new design takes into account users of lower versions as much as
possible and maintains a client version;
2) The new design is based on the features of the new version and evolves
separately from the old version(we also have a plan to optimize the current
implementation), but the public abstraction can be reused. I think it is
not impossible to maintain multiple versions. Flink used to support 4+
versions (0.8.2, 0.9, 0.10, 0.11, universal connector) for Kafka Connector,
but they share the same code base.

Any thoughts and opinions are welcome and appreciated.

Best,
Vino

vino yang  于2021年1月6日周三 下午1:37写道：

> Hi Danny,
>
> You should have cwiki edit permission now.
> Any problems let me know.
>
> Best,
> Vino
>
> Danny Chan  于2021年1月6日周三 下午12:05写道：
>
>> Sorry ~
>>
>> Forget to say that my Confluence ID is danny0405.
>>
>> It would be nice if any of you can help on this.
>>
>> Best,
>> Danny Chan
>>
>> Danny Chan  于2021年1月6日周三 下午12:00写道：
>>
>> > Hi, can someone give me the CWIKI permission so that i can update the
>> > design details to that (maybe as a new RFC though ~).
>> >
>> > wangxianghu  于2021年1月5日周二 下午2:43写道：
>> >
>> >> + 1, Thanks Danny!
>> >> I believe this new feature OperatorConrdinator in flink-1.11 will help
>> >> improve the current implementation
>> >>
>> >> Best,
>> >>
>> >> XianghuWang
>> >>
>> >> At 2021-01-05 14:17:37, "vino yang"  wrote:
>> >> >Hi,
>> >> >
>> >> >Sharing more details, the OperatorConrdinator is the part of the new
>> Data
>> >> >Source API(Beta) involved in the Flink 1.11's release note[1].
>> >> >
>> >> >Flink 1.11 was released only about half a year ago. The design of
>> RFC-13
>> >> >began at the end of 2019, and most of the implementation was completed
>> >> when
>> >> >Flink 1.11 was released.
>> >> >
>> >> >I believe that the production environment of many large companies has
>> not
>> >> >been upgraded so quickly (As far as our company is concerned, we still
>> >> have
>> >> >some jobs running on flink release packages below 1.9).
>> >> >
>> >> >So, maybe we need to find a mechanism to benefit both new and old
>> users.
>> >> >
>> >> >[1]:
>> >> >
>> >>
>> https://flink.apache.org/news/2020/07/06/release-1.11.0.html#new-data-source-api-beta
>> >> >
>> >> >Best,
>> >> >Vino
>> >> >
>> >> >vino yang  于2021年1月5日周二 下午12:30写道：
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> +1, thank you Danny for introducing this new feature
>> >> >> (OperatorCoordinator)[1] of Flink in the recently latest version.
>> >> >> This feature is very helpful for improving the implementation
>> >> mechanism of
>> >> >> Flink write-client.
>> >> >>
>> >> >> But this feature is only available after Flink 1.11. Before that,
>> there
>> >> >> was no good way to realize the mechanism of task upstream and
>> >> downstream
>> >> >> coordination through the public API provided by Flink.
>> >> >> I just have a concern, whether we need to take into account the
>> users
>> >> of
>> >> >> earlier versions (less than Flink 1.11).
>> >> >>
>> >> >> [1]: https://issues.apache.org/jira/browse/FLINK-15099
>> >> >>
>> >> >> Best,
>> >> >> Vino
>> >> >>
>> >> >> Gary Li  于2021年1月5日周二 上午10:40写道：
>> >> >>
>> >> >>> Hi Danny,
>> >> &

Re: Re: [DISCUSS] New Flink Writer Proposal

2021-01-05 Thread vino yang

Hi Danny,

You should have cwiki edit permission now.
Any problems let me know.

Best,
Vino

Danny Chan  于2021年1月6日周三 下午12:05写道：

> Sorry ~
>
> Forget to say that my Confluence ID is danny0405.
>
> It would be nice if any of you can help on this.
>
> Best,
> Danny Chan
>
> Danny Chan  于2021年1月6日周三 下午12:00写道：
>
> > Hi, can someone give me the CWIKI permission so that i can update the
> > design details to that (maybe as a new RFC though ~).
> >
> > wangxianghu  于2021年1月5日周二 下午2:43写道：
> >
> >> + 1, Thanks Danny!
> >> I believe this new feature OperatorConrdinator in flink-1.11 will help
> >> improve the current implementation
> >>
> >> Best,
> >>
> >> XianghuWang
> >>
> >> At 2021-01-05 14:17:37, "vino yang"  wrote:
> >> >Hi,
> >> >
> >> >Sharing more details, the OperatorConrdinator is the part of the new
> Data
> >> >Source API(Beta) involved in the Flink 1.11's release note[1].
> >> >
> >> >Flink 1.11 was released only about half a year ago. The design of
> RFC-13
> >> >began at the end of 2019, and most of the implementation was completed
> >> when
> >> >Flink 1.11 was released.
> >> >
> >> >I believe that the production environment of many large companies has
> not
> >> >been upgraded so quickly (As far as our company is concerned, we still
> >> have
> >> >some jobs running on flink release packages below 1.9).
> >> >
> >> >So, maybe we need to find a mechanism to benefit both new and old
> users.
> >> >
> >> >[1]:
> >> >
> >>
> https://flink.apache.org/news/2020/07/06/release-1.11.0.html#new-data-source-api-beta
> >> >
> >> >Best,
> >> >Vino
> >> >
> >> >vino yang  于2021年1月5日周二 下午12:30写道：
> >> >
> >> >> Hi,
> >> >>
> >> >> +1, thank you Danny for introducing this new feature
> >> >> (OperatorCoordinator)[1] of Flink in the recently latest version.
> >> >> This feature is very helpful for improving the implementation
> >> mechanism of
> >> >> Flink write-client.
> >> >>
> >> >> But this feature is only available after Flink 1.11. Before that,
> there
> >> >> was no good way to realize the mechanism of task upstream and
> >> downstream
> >> >> coordination through the public API provided by Flink.
> >> >> I just have a concern, whether we need to take into account the users
> >> of
> >> >> earlier versions (less than Flink 1.11).
> >> >>
> >> >> [1]: https://issues.apache.org/jira/browse/FLINK-15099
> >> >>
> >> >> Best,
> >> >> Vino
> >> >>
> >> >> Gary Li  于2021年1月5日周二 上午10:40写道：
> >> >>
> >> >>> Hi Danny,
> >> >>>
> >> >>> Thanks for the proposal. I'd recommend starting a new RFC. RFC-13
> was
> >> >>> done and including some work about the refactoring so we should mark
> >> it as
> >> >>> completed. Looking forward to having further discussion on the RFC.
> >> >>>
> >> >>> Best,
> >> >>> Gary Li
> >> >>> 
> >> >>> From: Danny Chan 
> >> >>> Sent: Tuesday, January 5, 2021 10:22 AM
> >> >>> To: dev@hudi.apache.org 
> >> >>> Subject: Re: [DISCUSS] New Flink Writer Proposal
> >> >>>
> >> >>> Sure, i can update the RFC-13 cwiki if you agree with that.
> >> >>>
> >> >>> Vinoth Chandar  于2021年1月5日周二 上午2:58写道：
> >> >>>
> >> >>> > Overall +1 on the idea.
> >> >>> >
> >> >>> > Danny, could we move this to the apache cwiki if you don't mind?
> >> >>> > That's what we have been using for other RFC discussions.
> >> >>> >
> >> >>> > On Mon, Jan 4, 2021 at 1:22 AM Danny Chan 
> >> wrote:
> >> >>> >
> >> >>> > > The RFC-13 Flink writer has some bottlenecks that make it hard
> to
> >> >>> adapter
> >> >>> > > to production:
> >> >>> > >
> >> >>> > > -

Re: [DISCUSS] New Flink Writer Proposal

2021-01-04 Thread vino yang

Hi,

Sharing more details, the OperatorConrdinator is the part of the new Data
Source API(Beta) involved in the Flink 1.11's release note[1].

Flink 1.11 was released only about half a year ago. The design of RFC-13
began at the end of 2019, and most of the implementation was completed when
Flink 1.11 was released.

I believe that the production environment of many large companies has not
been upgraded so quickly (As far as our company is concerned, we still have
some jobs running on flink release packages below 1.9).

So, maybe we need to find a mechanism to benefit both new and old users.

[1]:
https://flink.apache.org/news/2020/07/06/release-1.11.0.html#new-data-source-api-beta

Best,
Vino

vino yang  于2021年1月5日周二 下午12:30写道：

> Hi,
>
> +1, thank you Danny for introducing this new feature
> (OperatorCoordinator)[1] of Flink in the recently latest version.
> This feature is very helpful for improving the implementation mechanism of
> Flink write-client.
>
> But this feature is only available after Flink 1.11. Before that, there
> was no good way to realize the mechanism of task upstream and downstream
> coordination through the public API provided by Flink.
> I just have a concern, whether we need to take into account the users of
> earlier versions (less than Flink 1.11).
>
> [1]: https://issues.apache.org/jira/browse/FLINK-15099
>
> Best,
> Vino
>
> Gary Li  于2021年1月5日周二 上午10:40写道：
>
>> Hi Danny,
>>
>> Thanks for the proposal. I'd recommend starting a new RFC. RFC-13 was
>> done and including some work about the refactoring so we should mark it as
>> completed. Looking forward to having further discussion on the RFC.
>>
>> Best,
>> Gary Li
>> 
>> From: Danny Chan 
>> Sent: Tuesday, January 5, 2021 10:22 AM
>> To: dev@hudi.apache.org 
>> Subject: Re: [DISCUSS] New Flink Writer Proposal
>>
>> Sure, i can update the RFC-13 cwiki if you agree with that.
>>
>> Vinoth Chandar  于2021年1月5日周二 上午2:58写道：
>>
>> > Overall +1 on the idea.
>> >
>> > Danny, could we move this to the apache cwiki if you don't mind?
>> > That's what we have been using for other RFC discussions.
>> >
>> > On Mon, Jan 4, 2021 at 1:22 AM Danny Chan  wrote:
>> >
>> > > The RFC-13 Flink writer has some bottlenecks that make it hard to
>> adapter
>> > > to production:
>> > >
>> > > - The InstantGeneratorOperator is parallelism 1, which is a limit for
>> > > high-throughput consumption; because all the split inputs drain to a
>> > single
>> > > thread, the network IO would gains pressure too
>> > > - The WriteProcessOperator handles inputs by partition, that means,
>> > within
>> > > each partition write process, the BUCKETs are written one by one, the
>> > FILE
>> > > IO is limit to adapter to high-throughput inputs
>> > > - It buffers the data by checkpoints, which is too hard to be robust
>> for
>> > > production, the checkpoint function is blocking and should not have IO
>> > > operations.
>> > > - The FlinkHoodieIndex is only valid for a per-job scope, it does not
>> > work
>> > > for existing bootstrap data or for different Flink jobs
>> > >
>> > > Thus, here I propose a new design for the Flink writer to solve these
>> > > problems[1]. Overall, the new design tries to remove the single
>> > parallelism
>> > > operators and make the index more powerful and scalable.
>> > >
>> > > I plan to solve these bottlenecks incrementally (4 steps), there are
>> > > already some local POCs for these proposals.
>> > >
>> > > I'm looking forward to your feedback. Any suggestions are appreciated
>> ~
>> > >
>> > > [1]
>> > >
>> > >
>> >
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1oOcU0VNwtEtZfTRt3v9z4xNQWY-Hy5beu7a1t5B-75I%2Fedit%3Fusp%3Dsharing&data=04%7C01%7C%7Cd256cf75a4f14db4c7f608d8b120d69c%7C84df9e7fe9f640afb435%7C1%7C0%7C637454101880191121%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ecw3TcwsVPFFG74scaE7KhMsIryhVRn9g40B0yMQvfc%3D&reserved=0
>> > >
>> >
>>
>

Re: [DISCUSS] New Flink Writer Proposal

2021-01-04 Thread vino yang

Hi,

+1, thank you Danny for introducing this new feature
(OperatorCoordinator)[1] of Flink in the recently latest version.
This feature is very helpful for improving the implementation mechanism of
Flink write-client.

But this feature is only available after Flink 1.11. Before that, there was
no good way to realize the mechanism of task upstream and downstream
coordination through the public API provided by Flink.
I just have a concern, whether we need to take into account the users of
earlier versions (less than Flink 1.11).

[1]: https://issues.apache.org/jira/browse/FLINK-15099

Best,
Vino

Gary Li  于2021年1月5日周二 上午10:40写道：

> Hi Danny,
>
> Thanks for the proposal. I'd recommend starting a new RFC. RFC-13 was done
> and including some work about the refactoring so we should mark it as
> completed. Looking forward to having further discussion on the RFC.
>
> Best,
> Gary Li
> 
> From: Danny Chan 
> Sent: Tuesday, January 5, 2021 10:22 AM
> To: dev@hudi.apache.org 
> Subject: Re: [DISCUSS] New Flink Writer Proposal
>
> Sure, i can update the RFC-13 cwiki if you agree with that.
>
> Vinoth Chandar  于2021年1月5日周二 上午2:58写道：
>
> > Overall +1 on the idea.
> >
> > Danny, could we move this to the apache cwiki if you don't mind?
> > That's what we have been using for other RFC discussions.
> >
> > On Mon, Jan 4, 2021 at 1:22 AM Danny Chan  wrote:
> >
> > > The RFC-13 Flink writer has some bottlenecks that make it hard to
> adapter
> > > to production:
> > >
> > > - The InstantGeneratorOperator is parallelism 1, which is a limit for
> > > high-throughput consumption; because all the split inputs drain to a
> > single
> > > thread, the network IO would gains pressure too
> > > - The WriteProcessOperator handles inputs by partition, that means,
> > within
> > > each partition write process, the BUCKETs are written one by one, the
> > FILE
> > > IO is limit to adapter to high-throughput inputs
> > > - It buffers the data by checkpoints, which is too hard to be robust
> for
> > > production, the checkpoint function is blocking and should not have IO
> > > operations.
> > > - The FlinkHoodieIndex is only valid for a per-job scope, it does not
> > work
> > > for existing bootstrap data or for different Flink jobs
> > >
> > > Thus, here I propose a new design for the Flink writer to solve these
> > > problems[1]. Overall, the new design tries to remove the single
> > parallelism
> > > operators and make the index more powerful and scalable.
> > >
> > > I plan to solve these bottlenecks incrementally (4 steps), there are
> > > already some local POCs for these proposals.
> > >
> > > I'm looking forward to your feedback. Any suggestions are appreciated ~
> > >
> > > [1]
> > >
> > >
> >
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1oOcU0VNwtEtZfTRt3v9z4xNQWY-Hy5beu7a1t5B-75I%2Fedit%3Fusp%3Dsharing&data=04%7C01%7C%7Cd256cf75a4f14db4c7f608d8b120d69c%7C84df9e7fe9f640afb435%7C1%7C0%7C637454101880191121%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ecw3TcwsVPFFG74scaE7KhMsIryhVRn9g40B0yMQvfc%3D&reserved=0
> > >
> >
>

Re: dev@hudi.apache.org

2020-12-18 Thread vino yang

Hi chang,

Done and welcome to Hudi community!

Best,
Vino

chang li  于2020年12月18日周五 下午6:24写道：

> Hi,
> I want to contribute to Apache Hudi. Would you please give me the
> contributor permission?
> My JIRA ID is lichang

Re: 0.7.0 Release planning

2020-12-17 Thread vino yang

+1 for Danny as the RM of 0.7
+1 for xianghu to give the necessary help

Considering it's a major version, I and @leesf  can
also give some help if necessary

Best,
Vino

wangxianghu  于2020年12月17日周四 下午6:00写道：

> +1
> I can assist Danny in this release :)
>
> > 2020年12月17日 下午2:08，Danny Chan  写道：
> >
> > If no other release managers, i can be the one, although i'm only a
> > contributor now ~
> >
> > Vinoth Chandar  于2020年12月16日周三 下午12:36写道：
> >
> >> Hello all,
> >>
> >> We are hoping to cut a release candidate by Dec 31. Any volunteers for
> >> being the Release Manager?
> >>
> >> Thanks
> >> Vinoth
> >>
>
>

Re: [DISCUSS] SQL Support using Apache Calcite

2020-12-17 Thread vino yang

+1 for Calcite

Best,
Vino

David Sheard  于2020年12月17日周四 下午2:15写道：

> I agree with Calcite
>
> On Thu, 17 Dec 2020 at 5:04 pm, Danny Chan  wrote:
>
> > Apache Calcite is a good candidate for parsing and executing the SQL,
> > Apache Flink has an extension for the SQL based on the Calcite parser
> [1],
> >
> > > users will write : hudiSparkSession.sql("UPDATE ")
> >
> > Should user still need to instatiate the hudiSparkSession first ? My
> > desired use case is user use the Hoodie CLI to execute these SQLs. They
> can
> > choose what engine to use by a CLI config option.
> >
> > > If we want those expressed in Calcite as well, we need to also invest
> in
> > the full Query side support, which can increase the scope by a lot.
> >
> > That is true, my thought is that we use the Calcite to execute only these
> > MERGE SQL statements. For DQL or DML, we would delegate the parse/execute
> > to the undernethe engines(Flink or Spark), the Hoodie Calcite parser only
> > parse the query statements and handover it to the engines. One thing
> needs
> > to note is the SQL dialect difference, the Spark may have its own
> > syntax(keywords) that Calcite can not parse/recognize.
> >
> > [1]
> >
> >
> https://github.com/apache/flink/tree/master/flink-table/flink-sql-parser/src/main/codegen
> >
> > Vinoth Chandar  于2020年12月11日周五 下午3:58写道：
> >
> > > Hello all,
> > >
> > > One feature that keeps coming up is the ability to use UPDATE, MERGE
> sql
> > > syntax to support writing into Hudi tables. We have looked into the
> > Spark 3
> > > DataSource V2 APIs as well and found several issues that hinder us in
> > > implementing this via the Spark APIs
> > >
> > > - As of this writing, the UPDATE/MERGE syntax is not really opened up
> to
> > > external datasources like Hudi. only DELETE is.
> > > - DataSource V2 API offers no flexibility to perform any kind of
> > > further transformations to the dataframe. Hudi supports keys, indexes,
> > > preCombining and custom partitioning that ensures file sizes etc. All
> > this
> > > needs shuffling data, looking up/joining against other dataframes so
> > forth.
> > > Today, the DataSource V1 API allows this kind of further
> > > partitions/transformations. But the V2 API is simply offers partition
> > level
> > > iteration once the user calls df.write.format("hudi")
> > >
> > > One thought I had is to explore Apache Calcite and write an adapter for
> > > Hudi. This frees us from being very dependent on a particular engine's
> > > syntax support like Spark. Calcite is very popular by itself and
> supports
> > > most of the key words and (also more streaming friendly syntax). To be
> > > clear, we will still be using Spark/Flink underneath to perform the
> > actual
> > > writing, just that the SQL grammar is provided by Calcite.
> > >
> > > To give a taste of how this will look like.
> > >
> > > A) If the user wants to mutate a Hudi table using SQL
> > >
> > > Instead of writing something like : spark.sql("UPDATE ")
> > > users will write : hudiSparkSession.sql("UPDATE ")
> > >
> > > B) To save a Spark data frame to a Hudi table
> > > we continue to use Spark DataSource V1
> > >
> > > The obvious challenge I see is the disconnect with the Spark DataFrame
> > > ecosystem. Users would write MERGE SQL statements by joining against
> > other
> > > Spark DataFrames.
> > > If we want those expressed in calcite as well, we need to also invest
> in
> > > the full Query side support, which can increase the scope by a lot.
> > > Some amount of investigation needs to happen, but ideally we should be
> > able
> > > to integrate with the sparkSQL catalog and reuse all the tables there.
> > >
> > > I am sure there are some gaps in my thinking. Just starting this
> thread,
> > so
> > > we can discuss and others can chime in/correct me.
> > >
> > > thanks
> > > vinoth
> > >
> >
>

Re: [Discussion] Speed up CI/CD build

2020-12-11 Thread vino yang

Hi,

Sorry.
Due to busy work, I interrupted this verification work. I will let
@wangxianghu continue to follow up this work.

Best,
Vino

Vinoth Chandar  于2020年12月11日周五 下午3:25写道：

> Thanks for kicking this off, Gary!
>
> Thanks to Raymond, the CI tests are leaner and more parellelized now. But I
> do see that we hit travis queuing a lot.
> Is that what you are targeting primarily?
>
> We were exploring a move to Azure CI Pipelines to get more compute power
> allocated at some point.
> @vinoyang , IIRC, you were driving this? Could you please share your
> experience?
>
> On Thu, Dec 10, 2020 at 9:53 PM Gary Li  wrote:
>
> > Hi all,
> >
> > I am seeing we have a boost on PRs recently. That's great news that more
> > developers joined the Hudi community.
> >
> > On the other hand, the Travis build backlog was growing as well. As we
> > grow, we need to scale the CI/CD pipeline soon. So I'd like to start a
> > discussion here.
> >
> > Do we have any ongoing work on this topic? Anyone from larger open source
> > projects like Spark/Flink e.t.c can share experience about how they
> handle
> > this problem?
> >
> > Any thoughts are appreciated.
> >
> > Best,
> >
> > Gary Li
> >
>

Re: Re: Congrats to our newest committers!

2020-12-03 Thread vino yang

Congrats to both!

Trevor  于2020年12月4日周五 上午10:18写道：

>
> Congratulations to the new committers！Excited about the next release!
>
> Best,
>
> Trevor
>
>
> wowtua...@gmail.com
>
> From: Sivabalan
> Date: 2020-12-04 09:59
> To: dev
> CC: us...@hudi.apache.org
> Subject: Re: Congrats to our newest committers!
> Congratz guys! Well deserved and excited for upcoming release.
>
> On Thu, Dec 3, 2020 at 5:58 PM Gary Li  wrote:
>
> > Congratulations Satish and Prashant! Excited about the next release!
> >
> > Gary Li
> > 
> > From: Mehrotra, Udit 
> > Sent: Friday, December 4, 2020 8:35:07 AM
> > To: dev@hudi.apache.org 
> > Cc: us...@hudi.apache.org 
> > Subject: Re: Congrats to our newest committers!
> >
> > Huge congrats guys ! Well deserved indeed.
> >
> > On 12/3/20, 11:44 AM, "Prashant Wason"  wrote:
> >
> > CAUTION: This email originated from outside of the organization. Do
> > not click links or open attachments unless you can confirm the sender and
> > know the content is safe.
> >
> >
> >
> > Thanks everyone.
> >
> > Over the past one year I have really enjoyed learning and developing
> > with HUDI. Excited to be part of the group.
> >
> > > On Dec 3, 2020, at 11:37 AM, Balaji Varadarajan
> >  wrote:
> > >
> > > Very Well deserved !! Many congratulations to Satish and Prashant.
> > > Balaji.V
> > >On Thursday, December 3, 2020, 11:07:09 AM PST, Bhavani Sudha <
> > bhavanisud...@gmail.com> wrote:
> > >
> > > Congratulations Satish and Prashant!
> > > On Thu, Dec 3, 2020 at 11:03 AM Pratyaksh Sharma <
> > pratyaks...@gmail.com> wrote:
> > >
> > > Congratulations Satish and Prashant!
> > >
> > > On Fri, Dec 4, 2020 at 12:22 AM Vinoth Chandar 
> > wrote:
> > >
> > >> Hi all,
> > >>
> > >> I am really happy to announce our newest set of committers.
> > >>
> > >> *Satish Kotha*: Satish has ramped very quickly across our entire
> > code base
> > >> and contributed bug fixes and also drove large, unique features
> like
> > >> clustering, replace/overwrite which are about to go out in the
> 0.7.0
> > >> release. These efforts largely complete parts of our vision and it
> > could
> > >> have happened without Satish.
> > >>
> > >> *Prashant Wason*: In addition to a number of patches, Prashant has
> > been
> > >> shouldering massive responsibility on RFC-15, and thanks to his
> > efforts, we
> > >> have a simplified design, very solid implementation right now,
> that
> > is
> > >> being tested now for 0.7.0 release again.
> > >>
> > >> Please join me in congratulating them on this great milestone!
> > >>
> > >> Thanks,
> > >> Vinoth
> > >>
> > >
> >
> >
> > --
> Regards,
> -Sivabalan
>

Re: [DISCUSS] 0.7.0 release timelines

2020-12-02 Thread vino yang

+1 for option 2

Gary Li  于2020年12月2日周三 下午4:01写道：

> vote for option 2.
> 
> From: nishith agarwal 
> Sent: Wednesday, December 2, 2020 3:16 PM
> To: dev@hudi.apache.org 
> Subject: Re: [DISCUSS] 0.7.0 release timelines
>
> I vote for option 2 as well.
>
> -Nishith
>
> On Tue, Dec 1, 2020 at 10:05 PM Bhavani Sudha 
> wrote:
>
> > I vote for option 2 too.
> >
> > On Tue, Dec 1, 2020 at 7:36 PM Sivabalan  wrote:
> >
> > > I would vote for Option2 given that features are already being tested.
> if
> > > it's half way through development, may be would have given it a
> thought.
> > > But let's hear from the community.
> > >
> > >
> > > On Mon, Nov 30, 2020 at 8:15 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Hello all,
> > > >
> > > > We still have a few features to land for the 0.7.0 release.
> > Specifically,
> > > > RFC-15 and Clustering have PRs, undergoing test/production validation
> > at
> > > > the moment.
> > > >
> > > > Based on the JIRAs, I see two options
> > > >
> > > > Option 1:  Cut RC by next week or so, and push out the larger
> features
> > > to a
> > > > (hopefully quick) 0.8.0. We already have a few large features in
> > > > master/pending PRs (spark3, flink, replace/overwrite etc..)
> > > > Option 2:  Wait till December end to cut RC, with all the originally
> > > > planned feature set.
> > > >
> > > > Please chime in with your thoughts.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>

Re: JIRA permission apply

2020-11-29 Thread vino yang

Hi,

Done and welcome to hudi community!

Best,
Vino

key lou  于2020年11月28日周六 下午8:04写道：

> *Hi,*
>
> *I want to contribute to Apache Hudi. Would you please give me the
> contributor permission? My JIRA ID is loukey_j.*
>
> *Thanks.*
>

Re: want to contribute to Apache Hudi.

2020-11-24 Thread vino yang

Hi,

Done! And welcome to Hudi community!

Best,
Vino

ppqq1121...@163.com  于2020年11月25日周三 上午3:59写道：

>
> Hi,
> I want to contribute to Apache Hudi. Would you please give me the
> contributor permission? My JIRA ID is ppqq1121827.
>
>
>
> ppqq1121...@163.com
>

Re: join the commutity

2020-11-15 Thread vino yang

Hi shikai,

Done! Great to have you and looking forward to your continuous contribution!

shikai wang  于2020年11月16日周一 上午11:37写道：

> pls   add me as a contributor to the project.
> jira: karl wang
> github: Karl-WangSK
>

Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-09-15 Thread vino yang

> So, you are trying to avoid reading the again from an incremental query?
If
so, I don't know how we can achieve this in Hudi.
Let's say we

a) read the 20 mins of data from Kafka or DFS, into a Spark Dataframe,
b) issue an upsert into a hudi table (at this point the dataframe is lost)

if you want to do more processing, after step b, then we need to just cache
the input data frame, right?

IMO, In the beginning, Let's align this question.

Why we cannot reuse the result of those public written API:


*  public JavaRDD upsert(JavaRDD> records,
final String instantTime);*

so that we can add more processing nodes into the current DAG of the Spark
Job in one write cycle?

We do not need to cache the input data frame, because we can use
transformation before committing.
We try to reuse the written data.


Vinoth Chandar  于2020年9月15日周二 上午5:32写道：

> >I mean we can do more things in one pipeline based on the capabilities
> of the computing frameworks.
> >But the key difference, we can avoid re-reading these data from the
> persistent storage again.
>
> So, you are trying to avoid reading the again from an incremental query? If
> so, I don't know how we can achieve this in Hudi.
> Let's say we
>
> a) read the 20 mins of data from Kafka or DFS, into a Spark Dataframe,
> b) issue an upsert into a hudi table (at this point the dataframe is lost)
>
> if you want to do more processing, after step b, then we need to just cache
> the input data frame, right?
>
> Please don't get me wrong. I want us to brainstorm more on the incremental
> pipeline scenarios, but not connecting how someone
> would benefit from the proposal against what they are using today. May be
> if we can structure this discussion along lines of
>
> 1) What is the problem?
> 2) How are users solving this today?
> 3) How do we want to solve this tomorrow? why?
>
> without even going into implementation or framework design, that can help
> us?
>
>
>
>
> On Wed, Sep 9, 2020 at 7:52 PM vino yang  wrote:
>
> > >Given how much effort it takes for Beam to solve this, IMHO we cannot do
> > justice to such an undertaking. We should just use Beam :)
> >
> > My original idea is that we only support a very limited processing
> fashion.
> > Agree with you, the decision is very important. It
> > Besides Beam, The transport lib[1] contributed by LinkedIn also provided
> an
> > abstraction layer beyond computing frameworks.
> > But, yes, Beam is a better choice.
> >
> > > Sorry, I still cannot understand the uniqueness of the requirement
> here.
> > Windowing/aggregation again just sounds like another spark job to me,
> which
> > is crunching the ingested data.
> > It can incrementally query the source hudi tables already.
> >
> > I mean we can do more things in one pipeline based on the capabilities of
> > the computing frameworks.
> > The "Window" concept here, is not the same semantics in the computing
> > frameworks.
> > It's a more generalized and coarse-grained concept or simply says:
> >
> > *It's a batch of the sync interval of the DeltaStreamer in the continuous
> > mode.*
> >
> > For example, 20 mins batch data, half-hour batch data. We need some
> > capabilities to do some aggregation on such a "window" after the data
> > landing with the ACID semantic.
> > Yes, through the scheduler engine like Apache Airflow, we can read these
> > data from the storage then process them. But the key difference, we can
> > avoid re-reading these data from the persistent storage again.
> >
> > [1] https://github.com/linkedin/transport
> >
> > Best,
> > Vino
> >
> > Vinoth Chandar  于2020年9月10日周四 上午5:29写道：
> >
> > > Apologies. Long weekend, and lost track of this :/
> > >
> > > >We must touch the field that Beam trying to solve.
> > > Given how much effort it takes for Beam to solve this, IMHO we cannot
> do
> > > justice to such an undertaking. We should just use Beam :)
> > >
> > > >I just want to introduce a utility class (or a utility module/library
> if
> > > we try to define some engine-independent transformation class)
> > > > However, there are some scenarios that are not about table
> processing.
> > > e.g. metrics calculating, quickly aggregate and calculate windows
> > >
> > > Sorry, I still cannot understand the uniqueness of the requirement
> here.
> > > Windowing/aggregation again just sounds like another spark job to me,
> > which
> > > is crunching the ingested data.
> > > It can increment

Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-09-09 Thread vino yang

>Given how much effort it takes for Beam to solve this, IMHO we cannot do
justice to such an undertaking. We should just use Beam :)

My original idea is that we only support a very limited processing fashion.
Agree with you, the decision is very important. It
Besides Beam, The transport lib[1] contributed by LinkedIn also provided an
abstraction layer beyond computing frameworks.
But, yes, Beam is a better choice.

> Sorry, I still cannot understand the uniqueness of the requirement here.
Windowing/aggregation again just sounds like another spark job to me, which
is crunching the ingested data.
It can incrementally query the source hudi tables already.

I mean we can do more things in one pipeline based on the capabilities of
the computing frameworks.
The "Window" concept here, is not the same semantics in the computing
frameworks.
It's a more generalized and coarse-grained concept or simply says:

*It's a batch of the sync interval of the DeltaStreamer in the continuous
mode.*

For example, 20 mins batch data, half-hour batch data. We need some
capabilities to do some aggregation on such a "window" after the data
landing with the ACID semantic.
Yes, through the scheduler engine like Apache Airflow, we can read these
data from the storage then process them. But the key difference, we can
avoid re-reading these data from the persistent storage again.

[1] https://github.com/linkedin/transport

Best,
Vino

Vinoth Chandar  于2020年9月10日周四 上午5:29写道：

> Apologies. Long weekend, and lost track of this :/
>
> >We must touch the field that Beam trying to solve.
> Given how much effort it takes for Beam to solve this, IMHO we cannot do
> justice to such an undertaking. We should just use Beam :)
>
> >I just want to introduce a utility class (or a utility module/library if
> we try to define some engine-independent transformation class)
> > However, there are some scenarios that are not about table processing.
> e.g. metrics calculating, quickly aggregate and calculate windows
>
> Sorry, I still cannot understand the uniqueness of the requirement here.
> Windowing/aggregation again just sounds like another spark job to me, which
> is crunching the ingested data.
> It can incrementally query the source hudi tables already.
>
> My 2c is that we have lot higher value in integrating well with Apache
> Airflow/DolphinScheduler, to trigger dependent jobs. By and large, that's
> how people write batch jobs today.
> We have already built these commit callback mechanisms for e.g.
>
>
>
>
> On Tue, Sep 1, 2020 at 8:23 PM vino yang  wrote:
>
> > Hi vc,
> >
> > Thanks for your feedback.
> >
> > > Today, if I am a Spark developer, I can write a little program to do a
> > Hudi
> > upsert and then trigger some other transformation conditionally based on
> > whether upsert/insert happened, right?
> >
> > Yes, technically it can be implemented.
> >
> > > and I could do that without losing any of the existing transformation
> > methods I know in Spark.
> >
> > Yes
> >
> > > I am not quite clear on how much value this
> > library adds on top and in fact, bit concerned
> > that we set ourselves up for solving engine-independent problems that
> > Apache Beam for e.g has already solved.
> >
> > Introducing these APIs can bring more fluent API usage experience without
> > using chained program as you mentioned above. We can directly define the
> > logic of how to process the committed data after they land to the fs.
> >
> > Yes, if hudi's goal is an engine-independent data lake library. And if we
> > want
> > to introduce some abilities of transformation. We must touch the field
> that
> > Beam trying to solve.
> >
> > > I also have doubts on whether coupling the incremental processing after
> > commit into a single process itself is desirable.
> >
> > Actually, I just want to introduce a utility class (or a utility
> > module/library if
> > we try to define some engine-independent transformation class) that gives
> > users another choice to build a data processing pipeline, not only a data
> > ingesting library.
> >
> > The current functions stay the same, e.g. reading and writing are
> > completely decoupled.
> >
> > And I am thinking if it's a good idea naming it's an incremental
> processing
> > relevant proposal.
> > Based on it just try to process the recently commit, it could only
> provide
> > a limited function
> > compared with the current provided incremental view.
> >
> > Maybe it would be better to define it to be a pipeline relevant proposal?
> > Linking ingestion and

Re: may i added as a contributor

2020-09-07 Thread vino yang

Hi kai,

Welcome and done!
Looking forward to your contributions!

Best,
Vino

 于2020年9月7日周一 下午9:16写道：

> Hi, may i added as a contributor, and my ASF Jira username is: likai.yu
>
> Thanks
>
>

Re: Congrats to our newest committers!

2020-09-04 Thread vino yang

Congrats to all 3!

Best,
Vino

Balaji Varadarajan  于2020年9月4日周五 上午10:25写道：

>  Udit, Gary, Raymond and Pratyaksh,
> Many congratulations :) Well deserved. Looking forward to your continued
> contributions.
> Balaji.V
> On Thursday, September 3, 2020, 07:19:45 PM PDT, Sivabalan <
> n.siv...@gmail.com> wrote:
>
>  Congrats to all 3. Much deserved and really excited to see more committers
> 😊
>
> On Thu, Sep 3, 2020 at 9:23 PM leesf  wrote:
>
> > Congrats everyone, well deserved !
> >
> >
> >
> > selvaraj periyasamy  于2020年9月4日周五
> >
> > 上午5:05写道：
> >
> >
> >
> > > Congrats everyone !
> >
> > >
> >
> > > On Thu, Sep 3, 2020 at 1:59 PM Vinoth Chandar 
> wrote:
> >
> > >
> >
> > > > Hi all,
> >
> > > >
> >
> > > > I am really excited to share the good news about our new committers
> on
> >
> > > the
> >
> > > > project!
> >
> > > >
> >
> > > > *Udit Mehrotra *: Udit has travelled with the project since sept/oct
> > last
> >
> > > > year and immensely helped us making Hudi work well with the AWS
> >
> > > ecosystem.
> >
> > > > His most notable contributions are towards driving large parts of the
> >
> > > > implementation of RFC-12, Hive/Spark integration points. He has also
> >
> > > helped
> >
> > > > our users in various tricky issues.
> >
> > > >
> >
> > > > *Gary Li:* Gary is a great success story for the project, starting
> out
> > as
> >
> > > > an early user and steadily grown into a strong contributor, who has
> >
> > > > demonstrated the ability to take up challenging implementations (e.g
> >
> > > Impala
> >
> > > > support, MOR snapshot query impl on Spark), as well as patiently
> >
> > > > iterate through feedback and evolve the design/code. He has also been
> >
> > > > helping users on Slack and mailing lists
> >
> > > >
> >
> > > > *Raymond Xu:* Raymond has also been a consistent feature on our
> mailing
> >
> > > > lists, slack and github. He has been proposing immensely valuable
> >
> > > > test/tooling improvements. He has contributed a great deal of code as
> >
> > > well,
> >
> > > > towards the same. Many many users thank Raymond for the generous help
> > on
> >
> > > > Slack.
> >
> > > >
> >
> > > > *Pratyaksh Sharma:* This is yet another great example of user ->
> >
> > > > contributor -> committer. Pratyaksh has been a great champion for the
> >
> > > > project, over the past year or so, steadily contributing many
> >
> > > improvements
> >
> > > > around the Delta Streamer tool.
> >
> > > >
> >
> > > > Please join me in, congratulating them on this well deserved
> milestone!
> >
> > > >
> >
> > > > Onwards and upwards,
> >
> > > > Vinoth
> >
> > > >
> >
> > >
> >
> > --
> Regards,
> -Sivabalan

Re: Coding guidelines

2020-09-02 Thread vino yang

+1 to have the coding guidelines.

Left some comments.

Best,
Vino

Vinoth Chandar  于2020年9月2日周三 上午9:51写道：

> Hello all,
>
> Put together a list to formalize the things we follow in code review
> process today. Please chime in on the PR review, for comments.
>
> https://github.com/apache/hudi/pull/2061
>
>
> Thanks
> Vinoth
>

Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-09-01 Thread vino yang

; incrementally queries table A and kicks another ETL to build table B. Job A
> and B are typically different and written by different developers.
> If you could help me understand the use-case, that would be awesome.
>
> All that said, there are pains around "triggering" job B (downstream
> computations incrementally) and we could solve that by for e.g supporting
> an Apache Airflow operator that can trigger workflows
> when commits arrive on its upstream tables. What I am trying to say is -
> there is definitely gaps we would like to improve upon to make incremental
> processing mainstream, not sure if the proposed
> APIs are the highest on that list.
>
> Apologies if I am missing something. Please help me understand if so.
>
> Thanks
> Vinoth
>
>
>
>
> On Tue, Sep 1, 2020 at 4:26 AM vino yang  wrote:
>
> > Hi,
> >
> > Does anyone have ideas or disagreements?
> >
> > I think the introduction of these APIs will greatly enhance Hudi's data
> > processing capabilities and eliminate the performance overhead of reading
> > data for processing after writing.
> >
> > Best,
> > Vino
> >
> > wangxianghu  于2020年8月31日周一 下午3:44写道：
> >
> > > +1
> > > This will give hudi more capabilities besides data ingestion and
> writing,
> > > and make hudi-based data processing more timely!
> > > Best,
> > > wangxianghu
> > >
> > > 发件人: Abhishek Modi
> > > 发送时间: 2020年8月31日 15:01
> > > 收件人: dev@hudi.apache.org
> > > 主题: Re: [DISCUSS] Introduce incremental processing API in Hudi
> > >
> > > +1
> > >
> > > This sounds really interesting! I like that this implicitly gives Hudi
> > the
> > > ability to do transformations on ingested data :)
> > >
> > > On Sun, Aug 30, 2020 at 10:59 PM vino yang 
> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > >
> > > > For a long time, in the field of big data, people hope that the tools
> > > they
> > > > use can give greater play to the processing and analysis capabilities
> > of
> > > > big data. At present, from the perspective of API, Hudi mostly
> provides
> > > > APIs related to data ingestion, and relies on various big data query
> > > > engines on the query side to release capabilities, but does not
> > provide a
> > > > more convenient API for data processing after transactional writing.
> > > >
> > > > Currently, if a user wants to process the incremental data of a
> commit
> > > that
> > > > has just recently taken. It needs to go through three steps:
> > > >
> > > >
> > > >1.
> > > >
> > > >Write data to a hudi table;
> > > >2.
> > > >
> > > >Query or check completion of commit;
> > > >3.
> > > >
> > > >After the data is committed, the data is found out through
> > incremental
> > > >query, and then the data is processed;
> > > >
> > > >
> > > > If you want a quick link here, you may use Hudi's recent written
> commit
> > > > callback function to simplify it into two steps:
> > > >
> > > >
> > > >1.
> > > >
> > > >Write data to a hudi table;
> > > >2.
> > > >
> > > >Based on the written commit callback function to trigger an
> > > incremental
> > > >query to find out the data, and then perform data processing;
> > > >
> > > >
> > > > However, it is still very troublesome to split into two steps for
> > > scenarios
> > > > that want to perform more timely and efficient data analysis on the
> > data
> > > > ingest pipeline. Therefore, I propose to merge the entire process
> into
> > > one
> > > > step and provide a set of incremental(or saying Pipelined) processing
> > API
> > > > based on this:
> > > >
> > > > Write the data to a hudi table, after obtaining the data through
> > > > JavaRDD, directly apply the user-defined function(UDF)
> to
> > > > process the data. The processing behavior can be described via these
> > two
> > > > steps:
> > > >
> > > >
> > > >1.
> > > >
> > > >Conventional conversion such as Map/Filter/Reduce;
> > > >2.
> &g

Re: [DISCUSS] Introduce incremental processing API in Hudi

2020-09-01 Thread vino yang

Hi,

Does anyone have ideas or disagreements?

I think the introduction of these APIs will greatly enhance Hudi's data
processing capabilities and eliminate the performance overhead of reading
data for processing after writing.

Best,
Vino

wangxianghu  于2020年8月31日周一 下午3:44写道：

> +1
> This will give hudi more capabilities besides data ingestion and writing,
> and make hudi-based data processing more timely!
> Best,
> wangxianghu
>
> 发件人: Abhishek Modi
> 发送时间: 2020年8月31日 15:01
> 收件人: dev@hudi.apache.org
> 主题: Re: [DISCUSS] Introduce incremental processing API in Hudi
>
> +1
>
> This sounds really interesting! I like that this implicitly gives Hudi the
> ability to do transformations on ingested data :)
>
> On Sun, Aug 30, 2020 at 10:59 PM vino yang  wrote:
>
> > Hi everyone,
> >
> >
> > For a long time, in the field of big data, people hope that the tools
> they
> > use can give greater play to the processing and analysis capabilities of
> > big data. At present, from the perspective of API, Hudi mostly provides
> > APIs related to data ingestion, and relies on various big data query
> > engines on the query side to release capabilities, but does not provide a
> > more convenient API for data processing after transactional writing.
> >
> > Currently, if a user wants to process the incremental data of a commit
> that
> > has just recently taken. It needs to go through three steps:
> >
> >
> >1.
> >
> >Write data to a hudi table;
> >2.
> >
> >Query or check completion of commit;
> >3.
> >
> >After the data is committed, the data is found out through incremental
> >query, and then the data is processed;
> >
> >
> > If you want a quick link here, you may use Hudi's recent written commit
> > callback function to simplify it into two steps:
> >
> >
> >1.
> >
> >Write data to a hudi table;
> >2.
> >
> >Based on the written commit callback function to trigger an
> incremental
> >query to find out the data, and then perform data processing;
> >
> >
> > However, it is still very troublesome to split into two steps for
> scenarios
> > that want to perform more timely and efficient data analysis on the data
> > ingest pipeline. Therefore, I propose to merge the entire process into
> one
> > step and provide a set of incremental(or saying Pipelined) processing API
> > based on this:
> >
> > Write the data to a hudi table, after obtaining the data through
> > JavaRDD, directly apply the user-defined function(UDF) to
> > process the data. The processing behavior can be described via these two
> > steps:
> >
> >
> >1.
> >
> >Conventional conversion such as Map/Filter/Reduce;
> >2.
> >
> >Aggregation calculation based on fixed time window;
> >
> >
> > And these calculation functions should be engine independent. Therefore,
> I
> > plan to introduce some new APIs that allow users to directly define
> > incremental processing capabilities after each writing operation.
> >
> > The preliminary idea is that we can introduce a tool class, for example,
> > named: IncrementalProcessingBuilder or PipelineBuilder, which can be used
> > like this:
> >
> > IncrementalProcessingBuilder builder = new
> IncrementalProcessingBuilder();
> >
> > builder.source() //soure table
> >
> > .transform()
> >
> > .sink()  //derived table
> >
> > .build();
> >
> > IncrementalProcessingBuilder#mapAfterInsert(JavaRDD>
> > records, HudiMapFunction mapFunction);
> >
> > IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD>
> > records, HudiMapFunction mapFunction);
> >
> > IncrementalProcessingBuilder#filterAfterInsert(JavaRDD>
> > records, HudiFilterFunction mapFunction);
> >
> > //window function
> >
> >
> IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD>
> > records, HudiAggregateFunction aggFunction);
> >
> > It is suitable for scenarios where the commit interval (window) is
> moderate
> > and the delay of data ingestion is not very concerned.
> >
> >
> > What do you think? Looking forward to your thoughts and opinions.
> >
> >
> > Best,
> >
> > Vino
> >
>
>
>

[DISCUSS] Introduce incremental processing API in Hudi

2020-08-30 Thread vino yang

Hi everyone,


For a long time, in the field of big data, people hope that the tools they
use can give greater play to the processing and analysis capabilities of
big data. At present, from the perspective of API, Hudi mostly provides
APIs related to data ingestion, and relies on various big data query
engines on the query side to release capabilities, but does not provide a
more convenient API for data processing after transactional writing.

Currently, if a user wants to process the incremental data of a commit that
has just recently taken. It needs to go through three steps:


   1.

   Write data to a hudi table;
   2.

   Query or check completion of commit;
   3.

   After the data is committed, the data is found out through incremental
   query, and then the data is processed;


If you want a quick link here, you may use Hudi's recent written commit
callback function to simplify it into two steps:


   1.

   Write data to a hudi table;
   2.

   Based on the written commit callback function to trigger an incremental
   query to find out the data, and then perform data processing;


However, it is still very troublesome to split into two steps for scenarios
that want to perform more timely and efficient data analysis on the data
ingest pipeline. Therefore, I propose to merge the entire process into one
step and provide a set of incremental(or saying Pipelined) processing API
based on this:

Write the data to a hudi table, after obtaining the data through
JavaRDD, directly apply the user-defined function(UDF) to
process the data. The processing behavior can be described via these two
steps:


   1.

   Conventional conversion such as Map/Filter/Reduce;
   2.

   Aggregation calculation based on fixed time window;


And these calculation functions should be engine independent. Therefore, I
plan to introduce some new APIs that allow users to directly define
incremental processing capabilities after each writing operation.

The preliminary idea is that we can introduce a tool class, for example,
named: IncrementalProcessingBuilder or PipelineBuilder, which can be used
like this:

IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder();

builder.source() //soure table

.transform()

.sink()  //derived table

.build();

IncrementalProcessingBuilder#mapAfterInsert(JavaRDD>
records, HudiMapFunction mapFunction);

IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD>
records, HudiMapFunction mapFunction);

IncrementalProcessingBuilder#filterAfterInsert(JavaRDD>
records, HudiFilterFunction mapFunction);

//window function

IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD>
records, HudiAggregateFunction aggFunction);

It is suitable for scenarios where the commit interval (window) is moderate
and the delay of data ingestion is not very concerned.


What do you think? Looking forward to your thoughts and opinions.


Best,

Vino

Re: [DISCUSS] Release 0.6.0 timelines

2020-08-24 Thread vino yang

t; >> >> > > > >
>>> > >>
>>> > >>
>>> > >> >> > > >
>>> > >>
>>> > >>
>>> > >> >> > >
>>> > >>
>>> > >>
>>> > >> >> >
>>> > >>
>>> > >>
>>> > >> >>
>>> > >>
>>> >
>>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=397&projectKey=HUDI&view=detail&selectedIssue=HUDI-69
>>> > >>
>>> > >>
>>> > >> >> > > > > >> >)
>>> > >>
>>> > >>
>>> > >> >> > > > > >> ,
>>> > >>
>>> > >>
>>> > >> >> > > > > >>
>>> > >>
>>> > >>
>>> > >> >> > > > > >>- Spark Datasource/MOR
>>> > >>
>>> > >>
>>> > >> >> > https://github.com/apache/hudi/pull/1848
>>> > >>
>>> > >>
>>> > >> >> > > > > needs
>>> > >>
>>> > >>
>>> > >> >> > > > > >> to
>>> > >>
>>> > >>
>>> > >> >> > > > > >>be tested by gary/balaji (About to land)
>>> > >>
>>> > >>
>>> > >> >> > > > > >>- Hive Sync restructuring (Review done, about to
>>> land)
>>> > >>
>>> > >>
>>> > >> >> > > > > >>- Bootstrap
>>> > >>
>>> > >>
>>> > >> >> > > > > >>  - Vinoth working on code review, tests for PR
>>> 1876,
>>> > >>
>>> > >>
>>> > >> >> > > > > >>  - then udit will rework PR 1702 (In Code review)
>>> > >>
>>> > >>
>>> > >> >> > > > > >>  - then we will review, land PR 1870, 1869
>>> > >>
>>> > >>
>>> > >> >> > > > > >>- Bulk insert V2 PR 1834, lower risk, independent
>>> PR,
>>> > >> well
>>> > >>
>>> > >>
>>> > >> >> > tested
>>> > >>
>>> > >>
>>> > >> >> > > > > >> already
>>> > >>
>>> > >>
>>> > >> >> > > > > >>  - Dependent PR 1149 to be landed,
>>> > >>
>>> > >>
>>> > >> >> > > > > >>  - and modes to be respected in V2 impl as well
>>> (At
>>> > >> risk)
>>> > >>
>>> > >>
>>> > >> >> > > > > >>- Upgrade Downgrade Hooks, PR 1858 : (In Code
>>> review)
>>> > >>
>>> > >>
>>> > >> >> > > > > >>- HUDI-1054- Marker list perf improvement, Udit
>>> has a
>>> > PR
>>> > >> out
>>> > >>
>>> > >>
>>> > >> >> > > > > >>- HUDI-115 : Overwrite with... ordering issue,
>>> Sudha
>>> > has
>>> > >> a
>>> > >>
>>> > >>
>>> > >> >> PR
>>> > >>
>>> > >>
>>> > >> >> > > > nearing
>>> > >>
>>> > >>
>>> > >> >> > > > > >>landing
>>> > >>
>>> > >>
>>> > >> >> > > > > >>- HUDI-1098 : Marker file issue with non-existent
>>> > files.
>>> > >> (In
>>> > >>
>>> > >>
>>> > >> >> > Code
>>> > >>
>>> > >>
>>> > >> >> > > > > >> review)
>>> > >>
>>> > >>
>>> > >> >> > > > > >>- Spark Streaming + Async Compaction , test
>>> complete,
>

Re: [ANNOUNCE] Apache Hudi 0.6.0 released

2020-08-24 Thread vino yang

Great news!

Thanks to Bhavani Sudha for driving the release! And thanks to every one of
the whole community!

Best,
Vino

Bhavani Sudha  于2020年8月25日周二 上午11:37写道：

> The Apache Hudi team is pleased to announce the release of Apache Hudi
> 0.6.0.
>
> Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
> Incrementals. Apache Hudi manages storage of large analytical datasets on
> DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage) and
> provides the ability to query them.
>
> This release comes 2 months after 0.5.3. It includes more than 200
> resolved issues, comprising new features, perf improvements, as well as
> general improvements and bug-fixes. Hudi 0.6.0 introduces mechanisms to
> efficiently bootstrap large datasets into Hudi without having to copy the
> data (experimental feature), via both Spark datasource writer and
> DeltaStreamer tool. A new index (HoodieSimpleIndex) is added that can be
> faster than bloom index for cases where updates/deletes spread across a
> large portion of the table. With this version, rollbacks are done using
> marker files and a supporting upgrade and downgrade infrastructure is
> provided to users for smooth transition. HoodieMultiDeltaStreamer tool
> (experimental feature) is added in this version to support ingesting
> multiple kafka streams in a single DeltaStreamer deployment for enhancing
> operational experience. Bulk inserts are further improved by avoiding any
> dataframe-rdd conversions, accompanied with configurable sorting modes.
> While this conversion of dataframe to rdd, is not a bottleneck for
> upsert/deletes, subsequent releases will expand this to other write
> operations. Other performance improvements include supporting async
> compaction for spark streaming writes.
>
> For details on how to use Hudi, please look at the quick start page
> located at:
> https://hudi.apache.org/docs/quick-start-guide.html
>
> If you'd like to download the source release, you can find it here:
> https://github.com/apache/hudi/releases/tag/release-0.6.0
>
> You can read more about the release (including release notes) here:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12346663
>
> We would like to thank all contributors, the community, and the Apache
> Software Foundation for enabling this release and we look forward to
> continued collaboration. We welcome your help and feedback. For more
> information on how to report problems, and to get involved, visit the
> project website at:
> http://hudi.apache.org/
>
> Thanks to everyone involved!
> - Bhavani Sudha
>

Re: [DISCUSS] Codestyle: force multiline indentation

2020-08-22 Thread vino yang

Hi vc,

Yes, this part of the practice may have different preferences for different
developers. I have never opened the IDE's automatic formatting, nor have I
used the IDE's formatting functions artificially. Because I have
participated in multiple open source communities, each open source
community has its own conventions on code style. So, I just understand the
style of each community, after changing the code, and then compiling
locally, checkstyle will identify the related problems, and then report,
and then I will modify until the compilation is passed.

I admit that this is my personal behavior, and everything has its two
sides. IDE automatic formatting will make it more convenient for developers
to deal with code styles. On the other hand, it will also make the
community more complicated when considering related conventions and weigh
more factors.

Best,
Vino

Vinoth Chandar  于2020年8月22日周六 下午2:25写道：

> >But, IMO, we can ignore the IDE here, if it breaks the code style,
> checkstyle will stop building and spotless will work.
>
> I differ here slightly. Most people reformat code using the "format code"
> in the IDE. And IDEs also can reorganize the code when you save etc.
> We need a solid way to not be fighting the IDE all the time :). So it may
> be okay to not go with how IDE formats things, but we need to ensure IDE
> does not get in the way.
>
> thoughts?
>
> Thanks
> Vinoth
>
> On Fri, Aug 21, 2020 at 1:26 PM Nishith  wrote:
>
> > +1 for spotless, automating the formatting will definitely help
> > productivity and turn around time for PRs.
> >
> > -Nishith
> >
> > Sent from my iPhone
> >
> > > On Aug 21, 2020, at 11:53 AM, Sivabalan  wrote:
> > >
> > > totally +1 for spotless.
> > >
> > >
> > >> On Thu, Aug 20, 2020 at 8:53 AM leesf  wrote:
> > >>
> > >> +1 on using mvn spotless:apply to fix the codestyle.
> > >>
> > >> Bhavani Sudha  于2020年8月19日周三 下午12:31写道：
> > >>
> > >>> +1 on auto code formatting. I also think it should be okay to be even
> > >> more
> > >>> restrictive by failing builds when the code format is not adhered (in
> > any
> > >>> environment). That way everyone is forced to use the same formatting.
> > >>>
> > >>>> On Tue, Aug 18, 2020 at 8:47 PM vino yang 
> > wrote:
> > >>>
> > >>>>> the key challenge has been keeping checkstyle, IDE and spotless
> > >>> agreeing
> > >>>> on the same thing.
> > >>>>
> > >>>> Yes, it's the key thing. But, IMO, we can ignore the IDE here, if it
> > >>> breaks
> > >>>> the code style, checkstyle will stop building and spotless will
> work.
> > >>>>
> > >>>> Vinoth Chandar  于2020年8月19日周三 上午7:49写道：
> > >>>>
> > >>>>> the key challenge has been keeping checkstyle, IDE and spotless
> > >>> agreeing
> > >>>> on
> > >>>>> the same thing.
> > >>>>>
> > >>>>> your understanding is correct. CI will enforce in a similar
> fashion.
> > >>>>> Spotless just makes us productive by auto fixing all the checkstyle
> > >>>>> violations, without having to manually fix by hand.
> > >>>>>
> > >>>>> On Tue, Aug 18, 2020 at 4:42 PM Shiyan Xu <
> > >> xu.shiyan.raym...@gmail.com
> > >>>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> I think adding spotless as a tooling command to auto fix code is
> > >>>>> beneficial
> > >>>>>> and nothing harmful.
> > >>>>>> People are recommended to run it before commit or configure it in
> a
> > >>>>>> pre-commit hook.
> > >>>>>> From the CI point of view, it does not change the existing way of
> > >>>>> guarding
> > >>>>>> code style, does it? It'll still just run Checkstyle to report
> > >>> issues.
> > >>>>>> @Vinoth, am I understanding this correctly? Will Spotless be based
> > >> on
> > >>>> the
> > >>>>>> same style configured via Checkstyle?
> > >>>>>>
> > >>>>>> On Tue, Aug 18, 2020 at 4:16 PM vbal...@apache.org <
> > >>> vbal...@apache.org
> > >>>>>
> > >>&

Re: [VOTE] Release 0.6.0, release candidate #1

2020-08-21 Thread vino yang

+1 from my side

I checked:

- ran `mvn clean package` [OK]
- ran `mvn test` in my local [OK]
- signature [OK]

BTW, where is like of the release blog?

Best,
Vino

Bhavani Sudha  于2020年8月20日周四 下午12:03写道：

> Hi everyone,
> Please review and vote on the release candidate #1 for the version 0.6.0,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint 7F66CD4CE990983A284672293224F200E1FC2172 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "release-0.6.0-rc1" [5],
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Release Manager
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12346663
> [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.6.0-rc1/
> [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachehudi-1025/
> [5] https://github.com/apache/hudi/tree/release-0.6.0-rc1
>

Re: Request Contributor Access to JIRA

2020-08-19 Thread vino yang

Hi Jack,

Done and welcome to Hudi community!

Best,
Vino

Jack Ye  于2020年8月20日周四 上午8:34写道：

> Hi,
>
> I would like to request contributor access to the Hudi JIRA, my username is
> jackye.
>
> Thank you very much,
>
> Best,
> Jack Ye
>

Re: JIRA contributor permission

2020-08-19 Thread vino yang

Hi Guoguang,

Done and welcome to Hudi community!

Best,
Vino

Guoguang.Wang  于2020年8月19日周三 下午6:47写道：

> Hi,
> I want to contribute to Apache Hudi.
> Would you please give me the contributor permission?
> My JIRA ID is guoguang.wang

Re: [DISCUSS] Codestyle: force multiline indentation

2020-08-18 Thread vino yang

> the key challenge has been keeping checkstyle, IDE and spotless agreeing
on the same thing.

Yes, it's the key thing. But, IMO, we can ignore the IDE here, if it breaks
the code style, checkstyle will stop building and spotless will work.

Vinoth Chandar  于2020年8月19日周三 上午7:49写道：

> the key challenge has been keeping checkstyle, IDE and spotless agreeing on
> the same thing.
>
> your understanding is correct. CI will enforce in a similar fashion.
> Spotless just makes us productive by auto fixing all the checkstyle
> violations, without having to manually fix by hand.
>
> On Tue, Aug 18, 2020 at 4:42 PM Shiyan Xu 
> wrote:
>
> > I think adding spotless as a tooling command to auto fix code is
> beneficial
> > and nothing harmful.
> > People are recommended to run it before commit or configure it in a
> > pre-commit hook.
> > From the CI point of view, it does not change the existing way of
> guarding
> > code style, does it? It'll still just run Checkstyle to report issues.
> > @Vinoth, am I understanding this correctly? Will Spotless be based on the
> > same style configured via Checkstyle?
> >
> > On Tue, Aug 18, 2020 at 4:16 PM vbal...@apache.org 
> > wrote:
> >
> > >  +1 on standardizing code formatting. On Tuesday, August 18, 2020,
> > > 03:58:42 PM PDT, Vinoth Chandar  wrote:
> > >
> > >  can more people please chime in?  This will affect all of us on a
> daily
> > > basis :)
> > >
> > > On Thu, Aug 13, 2020 at 8:25 AM Gary Li 
> > wrote:
> > >
> > > > Vote for mvn spotless:apply to do the auto fix.
> > > >
> > > > On Thu, Aug 13, 2020 at 1:13 AM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Anyone has thoughts on this?
> > > > >
> > > > > esp leesf/vinoyang, given you both drove much of the initial
> > cleanups.
> > > > >
> > > > > On Mon, Aug 10, 2020 at 7:16 PM Shiyan Xu <
> > xu.shiyan.raym...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > in that case, yes, all for automation.
> > > > > >
> > > > > > On Mon, Aug 10, 2020 at 7:12 PM Vinoth Chandar <
> vin...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Overall, I think we should standardize this across the project.
> > > > > > > But most importantly, may be revive the long dormant spotless
> > > effort
> > > > > > first
> > > > > > > to enable autofixing of checkstyle issues, before we add more
> > > > checking?
> > > > > > >
> > > > > > > On Mon, Aug 10, 2020 at 7:04 PM Shiyan Xu <
> > > > xu.shiyan.raym...@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I noticed that throughout the codebase, when method arguments
> > > wrap
> > > > > to a
> > > > > > > new
> > > > > > > > line, there are cases where indentation is 4 and other cases
> > > align
> > > > > the
> > > > > > > > wrapped line to the previous line of argument.
> > > > > > > >
> > > > > > > > The latter is caused by intelliJ settings of "Align when
> > > multiline"
> > > > > > > > enabled. This won't be flagged by checkstyle due to not
> setting
> > > > > > > > *forceStrictCondition* to *true*
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://checkstyle.sourceforge.io/config_misc.html#Indentation_Properties
> > > > > > > >
> > > > > > > > I'm suggesting setting this to true to avoid the discrepancy
> > and
> > > > > > > redundant
> > > > > > > > diffs in PR caused by individual IDE settings. People who
> have
> > > set
> > > > > > "Align
> > > > > > > > when multiline" will need to disable it to pass the
> checkstyle
> > > > > > > validation.
> > > > > > > >
> > > > > > > > WDYT?
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Raymond
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [ANNOUNCE] Hudi Community Weekly Update(2020-08-02 ~ 2020-08-09)

2020-08-09 Thread vino yang

Thanks to leesf for continuously updating Hudi weekly.

It is great to see that more and more improvements are being proposed in
the community.

Best,
Vino

leesf  于2020年8月9日周日 下午11:24写道：

> Dear community,
>
> Nice to share Hudi community weekly update for 2020-08-02 ~ 2020-08-09
> with updates on features, bugfixs.
>
> ===
> Features
>
> [Writer Core] Support for RFC-12/Bootstrapping of external datasets to
> hudi [1]
> [Writer Core] Spark Streaming with async compaction support [2]
> [Spark Integration] Speedup spark read queries by caching metaclient in
> HoodieROPathFilter [3]
> [Metrics] Added a console metrics reporter and associated unit tests. [4]
> [Hive Integration] Abstract hudi-sync-common, and support hudi-hive-sync,
> hudi-dla-sync [5]
> [Writer Core] Parallelize fetching of source data files/partitions [6]
> [Spark Integration] Support Spark Datasource for MOR table - RDD approach
> [7]
> [Writer Core] Implement CLI support for performing bootstrap [8]
> [Metrics] Hudi Supports Prometheus Pushgateway [9]
>
> ===
> Bugs
>
> [Writer Core] lack of insert info in delta_commit inflight [10]
> [DeltaStreamer] Fix Jcommander issue for --hoodie-conf in DeltaStreamer
> [11]
> [DeltaStreamer] Fix NPE when no new data in kafka using
> HoodieDeltaStreamer [12]
>
> [1] https://issues.apache.org/jira/browse/HUDI-242
> [2] https://issues.apache.org/jira/browse/HUDI-575
> [3] https://issues.apache.org/jira/browse/HUDI-1144
> [4] https://issues.apache.org/jira/browse/HUDI-1149
> [5] https://issues.apache.org/jira/browse/HUDI-875
> [6] https://issues.apache.org/jira/browse/HUDI-999
> [7] https://issues.apache.org/jira/browse/HUDI-69
> [8] https://issues.apache.org/jira/browse/HUDI-971
> [9] https://issues.apache.org/jira/browse/HUDI-210
> [10] https://issues.apache.org/jira/browse/HUDI-525
> [11] https://issues.apache.org/jira/browse/HUDI-1140
> [12] https://issues.apache.org/jira/browse/HUDI-1151
>
>
>
> Best,
> Leesf
>

Re: Pushing changes to PRs

2020-08-08 Thread vino yang

Thanks, vc. Great work!

Inspired by this wiki article, I just tried to use Intellij IDEA to access
Github features.


leesf  于2020年8月8日周六 下午5:54写道：

> helpful and thanks for writing up.
>
> Vinoth Chandar  于2020年8月8日周六 下午12:53写道：
>
> > Hello all,
> >
> > Few people have asked me this on separate occasions. So thought I'll add
> a
> > wiki page on how to checkout, push changes to PRs . Would be useful for
> all
> > committers.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Resources#Resources-PushingChangesToPRs
> >
> >
> > Thanks
> > vinoth
> >
>

Re: Need contributor permissions

2020-08-06 Thread vino yang

Hi Cheshta,

Welcome, and I have given you contributor permission.

Best,
Vino

Cheshta Sharma  于2020年8月6日周四 下午5:55写道：

> Hi,
>
> I want to contribute to Apache Hudi. Please give me relevant jira
> permissions.
>
> Jira user - cheshta2904.
>

Re: Recording link to Apache Hudi - Design/Code Walkthrough Session

2020-08-04 Thread vino yang

Hi Sivabalan,

Thanks for sharing, great job!

Best,
Vino

Pratyaksh Sharma  于2020年8月4日周二 下午1:28写道：

> Great. Thank you for sharing.
>
> On Mon, Aug 3, 2020 at 7:50 PM Sivabalan  wrote:
>
> > Hey folks,
> >  Last week we had a session from Vinoth Chandar on design and code
> > walkthrough session for Apache Hudi. Here
> >  is the recording link.
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>

Re: [DISCUSS] Release 0.6.0 timelines

2020-07-29 Thread vino yang

+1 on Sudha being RM for the release. And looking forward to 0.6.0.

Best,
Vino

leesf  于2020年7月30日周四 上午9:15写道：

> +1 on Sudha on being RM, and PR#1810
> https://github.com/apache/hudi/pull/1810 (abstract hive sync module) would
> also goes to this release.
>
> Sivabalan  于2020年7月30日周四 上午2:18写道：
>
> > +1 on Sudha being RM for the release. Makes sense to push the release by
> a
> > week.
> >
> > On Wed, Jul 29, 2020 at 1:35 AM vbal...@apache.org 
> > wrote:
> >
> > >  +1 on Sudha on being RM for this release. Also agree on pushing the
> > > release date by a week.
> > > Balaji.V
> > > On Tuesday, July 28, 2020, 10:08:41 PM PDT, Bhavani Sudha <
> > > bhavanisud...@gmail.com> wrote:
> > >
> > >  Thanks Vinoth for the update. I can volunteer to RM this release.
> > >
> > > Understand 0.6.0 release is delayed than what we originally discussed.
> Q2
> > > has been really hard with COVID and everything going on. Given that we
> > are
> > > at this point, I feel by delaying the RC by a week or so more if we can
> > get
> > > some of the 'At risk' items in, I would vote for that. That is just my
> > > personal opinion. I ll let others chime in.
> > >
> > > Thanks,
> > > Sudha
> > >
> > > On Tue, Jul 28, 2020 at 9:48 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Hello all,
> > > >
> > > > Just wanted to kickstart a thread to firm up the RC cut date for
> 0.6.0
> > > and
> > > > pick a RM. (any volunteers?, if not I self nominate myself)
> > > >
> > > > Here's an update on where we are at with the remaining release
> > blockers.
> > > I
> > > > have marked items as "At risk" assuming we cut RC sometime next week.
> > > > Please chime in with your thoughts. Ideally, we don't take any more
> > > > blockers. If we also want to knock off the at risk items, then we
> would
> > > > at-least push dates by another week (my guess).
> > > >
> > > > 0.6.0 Release blocker status (board
> > > > <
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=397&projectKey=HUDI&view=detail&selectedIssue=HUDI-69
> > > > >)
> > > > ,
> > > >
> > > >- Spark Datasource/MOR https://github.com/apache/hudi/pull/1848
> > needs
> > > > to
> > > >be tested by gary/balaji
> > > >- Bootstrap
> > > >  - Vinoth working on code review, tests for PR 1876,
> > > >  - then udit will rework PR 1702
> > > >  - then we will review, land PR 1870, 1869
> > > >  - Also need to fix HUDI-999, HUDI-1021
> > > >- Bulk insert V2 PR 1834, lower risk, independent PR, well tested
> > > > already
> > > >  - Dependent PR 1149 to be landed,
> > > >  - and modes to be respected in V2 impl as well (At risk)
> > > >- Upgrade Downgrade Hooks, PR 1858 : Siva has a PR out, code
> > > completing
> > > >this week
> > > >- HUDI-1054- Marker list perf improvement, Udit has a PR out
> > > >- HUDI-115 : Overwrite with... ordering issue, Sudha has a PR
> > nearing
> > > >landing
> > > >- HUDI-1098 : Marker file issue with non-existent files. Siva to
> > begin
> > > >impl
> > > >- Spark Streaming + Async Compaction , test complete, code review
> > > >comments and land PR 1752
> > > >- Spark DataSource/Hive MOR Incremental Query HUDI-920 (At risk)
> > > >- Flink/Multi Engine refactor, will need a large rebase and
> rework,
> > > >review, land (At risk for 0.6.0, high scope, may not have enough
> > time)
> > > >- BloomIndex V2 - Global index implementation. (At risk)
> > > >- HUDI-845 : Parallel writing i.e allow multiple writers (At risk)
> > > >- HUDI-860 : Small File Handling without memory caching (At risk)
> > > >
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>

Re: [DISCUSS] Adding Metrics to Hudi Common

2020-07-27 Thread vino yang

Hi Modi,

+1 for this proposal.

I agree with your opinion that the metric report should not only report the
client's metrics.

And we should decouple the implementation of metrics from the client module
so that it could be developed independently.

Best,
Vino

Abhishek Modi  于2020年7月28日周二 上午4:17写道：

> Hi Everyone!
>
> I'm hoping to have a discussion around adding a lightweight metrics class
> to Hudi Common. There are parts of Hudi Common that have large performance
> implications, and I think adding metrics to these parts will help us track
> Hudi's health in production and help us understand the performance
> implications of changes we make.
>
> I've opened a Jira on this topic -
> https://issues.apache.org/jira/browse/HUDI-1025. This jira
> specifically suggests adding HoodieWrapperFileSystem as this class has
> performance implications not just for Hudi, but also for the underlying
> DFS.
>
> Looking forward to everyone's opinions on this :)
>
> Best,
> Modi
>

Re: Implement an asynchronous write commit callback

2020-07-23 Thread vino yang

+1 as well

We can introduce an async mode to provide a pub-sub model so that we can
integrate with other systems.

Best,
Vino

957029...@qq.com <957029...@qq.com> 于2020年7月23日周四 下午9:05写道：

> +1.
> it's great.
>
>
>
> 957029...@qq.com
>
> From: wangxianghu
> Date: 2020-07-23 21:02
> To: dev@hudi.apache.org
> Subject: Implement an asynchronous write commit callback
> Hi all,
> Currently, a write callback service implemented in HTTP has been merged,
> it will issue an HTTP request to inform downstream that a write commit has
> been executed successfully！User can leverage this notification to do some
> cascaded incremental processes.
>
> But, this implementation is synchronous, user must consume this
> notification immediately. This might not be what user wants.
>
> So,  I'd like to implement an asynchronous one，implemented by Kafka for
> example. user can consume the callback message as long as they want, no
> need to act immediately.
>
> WDYT? any feedback are appreciated :)
>
> Best
> Mathieu
>
>

Re: Incremental Query missing Deletions

2020-07-16 Thread vino yang

Hi Adam,

Good question, there was someone who has asked a similar question.[1]

You can view all the comments under that ticket, @Vinoth Chandar
 has given a low-level solution.

IMO, this is a good feature that should be supported.

Best,
Vino

[1]: https://issues.apache.org/jira/browse/HUDI-480

Adam Feldman  于2020年7月17日周五 上午2:58写道：

> Hi,
> When querying a table using an incremental query, the result is a table or
> records that have been added or updated between the beginInstantTime and
> the END_INSTANTTIME_OPT_KEY or How can we get the incremental query to
> return the full incremental difference including marked records that were
> deleted?
>
> Thanks,
>

Re: request permissions to the contributions group

2020-07-16 Thread vino yang

Hi Younggyu,

Done and welcome!

Best,
Vino

Younggyu Chun  于2020年7月17日周五 上午3:10写道：

> My user name is: younggyuchun
>
> On 2020/07/16 15:34:57, Younggyu Chun  wrote:
> > Hello,
> >
> > Can anyone add me to the contributions group so that I can be assigned
> issues.
> >
> > thank you,
> > Younggyu
> >
>

Re: DISCUSS code, config, design walk through sessions

2020-07-06 Thread vino yang

+1

Adam Feldman  于2020年7月6日周一 下午9:55写道：

> Interested
>
> On Mon, Jul 6, 2020, 08:29 Sivabalan  wrote:
>
> > +1 for sure
> >
> > On Mon, Jul 6, 2020 at 4:42 AM Gurudatt Kulkarni 
> > wrote:
> >
> > > +1
> > > Really a great idea. Will help in understanding the project better.
> > >
> > > On Mon, Jul 6, 2020 at 1:35 PM Pratyaksh Sharma  >
> > > wrote:
> > >
> > > > This is a great idea and really helpful one.
> > > >
> > > > On Mon, Jul 6, 2020 at 1:09 PM  wrote:
> > > >
> > > > > +1
> > > > > It can also attract more partners to join us.
> > > > >
> > > > >
> > > > >
> > > > > On 07/06/2020 15:34, Ranganath Tirumala wrote:
> > > > > +1
> > > > >
> > > > > On Mon, 6 Jul 2020 at 16:59, David Sheard <
> > > > > david.she...@datarefactory.com.au>
> > > > > wrote:
> > > > >
> > > > > > Perfect
> > > > > >
> > > > > > On Mon, 6 Jul. 2020, 1:30 pm Vinoth Chandar, 
> > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > As we scale the community, its important that more of us are
> able
> > > to
> > > > > help
> > > > > > > users, users becoming contributors.
> > > > > > >
> > > > > > > In the past, we have drafted faqs, trouble shooting guides.
> But I
> > > > feel
> > > > > > > sometimes, more hands on walk through sessions over video could
> > > help.
> > > > > > >
> > > > > > > I am happy to spend 2 hours each on code/configs,
> > > > > > design/perf/architecture.
> > > > > > > Have the session be recorded as well for future.
> > > > > > >
> > > > > > > What does everyone think?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vinoth
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > Ranganath Tirumala
> > > > >
> > > >
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>

Re: Apply for Confluence permission

2020-07-02 Thread vino yang

Hi Yajun,

Done and welcome!

You should have the Confluence permission.

Best,
Vino

Yajun Luo  于2020年7月3日周五 下午2:02写道：

> Hi,
>
> I want to contribute to Apache Hudi.
> Would you please give me the contributor permission?
> My Confluence ID is luoyajun.
>
> Best,
> Yajun Luo
>

Re: request the contributor permission

2020-07-02 Thread vino yang

Hi linshan,

I have checked it. You should have the jira contributor permission.

Best,
Vino

linshan  于2020年7月2日周四 下午5:43写道：

> Hi,
>
> I want to contribute to Apache Hudi. Would you please give me the
> contributor permission? My JIRA ID is linshan.

Re: request the contributor permission

2020-06-28 Thread vino yang

Hi,

Done and welcome!

Best,
Vino

胡宪洋  于2020年6月28日周日 下午10:31写道：

> Hi,
>
> I want to contribute to Apache Hudi. Would you please give me the
> contributor permission? My JIRA ID is hxysea.
>
>
>
>
> thank you
>
>
>
>
>
>

Re: [DISCUSS] Introduce a write committed callback hook

2020-06-22 Thread vino yang

Hi everyone,

Thanks for sharing your thoughts.

We have created a Jira issue to track this work.[1]

Best,
Vino

[1]: https://issues.apache.org/jira/browse/HUDI-1037

Vinoth Chandar  于2020年6月23日周二 上午6:38写道：

> Great, looks like a JIRA is in order? :), given we all agree
> enthusiastically
>
> On Sun, Jun 21, 2020 at 8:10 PM Gary Li  wrote:
>
> > +1.
> > That would be great to have a communication mechanism between downstream
> > CDC applications chain.
> > e.g. A->B->C->D. Right now I am using the commit timestamp to identify
> > whether there is a new commit came in. But if I need to recompute app B,
> > it’s difficult for C and D to aware they have to recompute as well,
> > especially when the triggering frequencies are different.
> >
> > On Sun, Jun 21, 2020 at 6:11 PM hddong  > hongdd2...@gmail.com>> wrote:
> > +1. a great feature.
> >
> > Sivabalan mailto:n.siv...@gmail.com>> 于2020年6月22日周一
> > 上午7:50写道：
> >
> > > +1. would be a nice addition.
> > >
> > > On Sun, Jun 21, 2020 at 12:02 PM vbal...@apache.org > vbal...@apache.org> mailto:vbal...@apache.org>>
> > > wrote:
> > >
> > > >
> > > > +1. This would be a really good feature to have when building
> dependent
> > > > ETL pipelines.
> > > >
> > > > On Friday, June 19, 2020, 05:13:45 PM PDT, vino yang <
> > > > vinoy...@apache.org<mailto:vinoy...@apache.org>> wrote:
> > > >
> > > >  Hi all,
> > > >
> > > > Currently, we have a need to incrementally process and build a new
> > table
> > > > based on an original hoodie table. We expect that after a new commit
> is
> > > > completed on the original hoodie table, it could be retrieved ASAP,
> so
> > > that
> > > > it can be used for incremental view queries. Based on the existing
> > > > capabilities, one approach we can use is to continuously poll
> Hoodie's
> > > > Timeline to check for new commits. This is a very common processing
> > way,
> > > > but it will cause unnecessary waste of resources.
> > > >
> > > > We expect to introduce a proactive notification(event callback)
> > > mechanism.
> > > > For example, a hook can be introduced after a successful commit.
> > External
> > > > processors interested in the commit, such as scheduling systems, can
> > use
> > > > the hook as their own trigger. When a certain commit is completed,
> the
> > > > scheduling system can pull up the task of obtaining incremental data
> > > > through the API in the callback. Thereby completing the processing of
> > > > incremental data.
> > > >
> > > > There is currently a `postCommit` method in Hudi's client module, and
> > the
> > > > existing implementation is mainly used for compression and cleanup
> > after
> > > > commit. And the triggering time is a little early. Not after
> everything
> > > is
> > > > processed, we found that it may still cause the rollback of the
> commit
> > > due
> > > > to the exception. We need to find a new location to trigger this hook
> > to
> > > > ensure that the commit is deterministic.
> > > >
> > > > This is one of our scene requirements, and it will be a very useful
> > > feature
> > > > combined with the incremental query, it can make the incremental
> > > processing
> > > > more timely.
> > > >
> > > > We hope to hear what the community thinks of this proposal. Any
> > comments
> > > > and opinions are appreciated.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>

Re: [DISCUSS] Publishing benchmarks for releases

2020-06-21 Thread vino yang

+1 as well,

it would be helpful to measure the performance between different versions.

Shiyan Xu  于2020年6月22日周一 上午8:37写道：

> +1 definitely useful info.
>
> On Sun, Jun 21, 2020 at 4:56 PM Sivabalan  wrote:
>
> > Hey folks,
> > Is it a common practise to publish benchmarks for releases? I have
> put
> > up an initial PR  to add jmh
> > benchmark support to a couple of Hudi operations. If the community feels
> > positive on publishing benchmarks, we can add support for more operations
> > and for every release, we could publish some benchmark numbers.
> >
> > --
> > Regards,
> > -Sivabalan
> >
>

Re: [DISCUSS] Regarding nightly builds

2020-06-21 Thread vino yang

+1 as well,

Currently, I am waiting for hudi-test-suite to be merged into the master
branch, so that when we have a new PR merged into the master branch, this
will cause the "hudi-test-suite" that is also on the master branch to be
triggered on Azure Pipeline " easier.

Sharing more information here:

Now, there is a warehouse about hudi-ci, which is used to try to connect
with Azure Pipeline. [1]

And our reference sample is Flink Azure Pipeline [2].

Best,
Vino

[1]: https://github.com/apachehudi-ci
[2]:
https://cwiki.apache.org/confluence/display/FLINK/2020/03/22/Migrating+Flink%27s+CI+Infrastructure+from+Travis+CI+to+Azure+Pipelines

Vinoth Chandar  于2020年6月21日周日 下午10:27写道：

> Hi Sudha,
>
> Thanks for getting this kicked off..  +1 on a new nightly build process..
> This will help us more easily make the bleeding edge testable..
>
> My initial thoughts here are
>
> - Figure out a way to get Azure Pipelines enabled for Hudi
> - Setup the nightly there (this will also help us transition off travis
> slowly over time)
> - We can leverage the hudi-test-suite that nishith/vinoyang have been
> working on, add tons of more scenarios to test every night
>
> Knowing the software is stable on a daily basis and having warning flags
> would help us make smoother releases as well.
>
> Others, please chime in as well..
>
> thanks
> vinoth
>
>
>
>
>
> On Thu, Jun 18, 2020 at 10:10 PM Bhavani Sudha 
> wrote:
>
> > Hello all,
> >
> > Should we have nightly builds that way we can point users to those builds
> > for the latest features introduced, instead of being blocked on the next
> > release. Also this kind of gives an early feedback on new features or
> fixes
> >  if any further improvements are needed.  Does anyone know if and how
> other
> > Apache projects handle nightly builds?
> >
> > Thanks,
> > Sudha
> >
>

[DISCUSS] Introduce a write committed callback hook

2020-06-19 Thread vino yang

Hi all,

Currently, we have a need to incrementally process and build a new table
based on an original hoodie table. We expect that after a new commit is
completed on the original hoodie table, it could be retrieved ASAP, so that
it can be used for incremental view queries. Based on the existing
capabilities, one approach we can use is to continuously poll Hoodie's
Timeline to check for new commits. This is a very common processing way,
but it will cause unnecessary waste of resources.

We expect to introduce a proactive notification(event callback) mechanism.
For example, a hook can be introduced after a successful commit. External
processors interested in the commit, such as scheduling systems, can use
the hook as their own trigger. When a certain commit is completed, the
scheduling system can pull up the task of obtaining incremental data
through the API in the callback. Thereby completing the processing of
incremental data.

There is currently a `postCommit` method in Hudi's client module, and the
existing implementation is mainly used for compression and cleanup after
commit. And the triggering time is a little early. Not after everything is
processed, we found that it may still cause the rollback of the commit due
to the exception. We need to find a new location to trigger this hook to
ensure that the commit is deterministic.

This is one of our scene requirements, and it will be a very useful feature
combined with the incremental query, it can make the incremental processing
more timely.

We hope to hear what the community thinks of this proposal. Any comments
and opinions are appreciated.

Best,
Vino

Re: [ANNOUNCE] Apache Hudi 0.5.3 released

2020-06-17 Thread vino yang

Great job!

Thanks for your hard work, Siva and Sudha!

Best,
Vino

nishith agarwal  于2020年6月18日周四 上午11:09写道：

> Great job Siva and Sudha, thanks for driving this!
>
> -Nishith
>
> On Wed, Jun 17, 2020 at 7:16 PM  wrote:
>
> > Super news :)  The very first release after graduation. Awesome job Siva
> > and Sudha for spearheading the release of 0.5.3.
> > Balaji.V
> >
> > Sent from Yahoo Mail for iPhone
> >
> >
> > On Wednesday, June 17, 2020, 5:50 PM, Sivabalan 
> > wrote:
> >
> > The Apache Hudi community is pleased to announce the release of Apache
> Hudi
> > 0.5.3.
> >
> >
> >
> > Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
> > Incrementals. Apache Hudi manages storage of large analytical datasets on
> > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage) and
> > provides the ability to update/delete records as well capture changes.
> >
> >
> >
> > 0.5.3 is a bug fix release and is the first release after graduating as
> > TLP. It includes more than 35 resolved issues, comprising general
> > improvements and bug-fixes. Hudi 0.5.3 enables Embedded Timeline Server
> and
> > Incremental Cleaning by default for both delta-streamer and spark
> > datasource writes. Apart from multiple bug fixes, this release also
> > improves write performance like avoiding unnecessary loading of data
> after
> > writes and improving parallelism while searching for existing files for
> > writing new records.
> >
> >
> >
> > For details on how to use Hudi, please look at the quick start page
> located
> > at https://hudi.apache.org/docs/quick-start-guide.html
> >
> > If you'd like to download the source release, you can find it here:
> >
> > https://github.com/apache/hudi/releases/tag/release-0.5.3
> >
> > You can read more about the release (including release notes) here:
> >
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12348256
> >
> >
> >
> > We would like to thank all contributors, the community, and the Apache
> > Software Foundation for enabling this release and we look forward to
> > continued collaboration. We welcome your help and feedback. For more
> > information on how to report problems, and to get involved, visit the
> > project website at: http://hudi.apache.org/
> >
> >
> > Kind regards,
> >
> > Sivabalan Narayanan (Hudi 0.5.3 Release Manager)
> >
> > On behalf of the Apache Hudi
> >
> >
> >
> >
>

Re: [VOTE] Release 0.5.3, release candidate #2

2020-06-11 Thread vino yang

+1  (binding)

* compiled the source code [OK]
* ran all the tests(except IT tests) in my local [OK]
* checked release note [OK]

Best,
Vino

hddong  于2020年6月12日周五 上午8:58写道：

> +1
>
> Mehrotra, Udit  于2020年6月12日周五 上午7:06写道：
>
> > +1 (non-binding)
> >
> > - Integration tests succeeded locally
> > - Release validation script succeeded:
> > Checking Signature
> > -e  Signature Check - [OK]
> >
> > Checking for binary files in source release
> > -e  No Binary Files in Source Release? - [OK]
> >
> > Checking for DISCLAIMER
> > -e  DISCLAIMER file exists ? [OK]
> >
> > Checking for LICENSE and NOTICE
> > -e  License file exists ? [OK]
> > -e  Notice file exists ? [OK]
> >
> > Performing custom Licensing Check
> > -e  Licensing Check Passed [OK]
> >
> > Running RAT Check
> > -e  RAT Check Passed [OK]
> > - Ran some basic COW/MOR tests on EMR cluster with this releases jars.
> >
> > Thanks,
> > Udit
> >
> > On 6/11/20, 1:51 PM, "nishith agarwal"  wrote:
> >
> > CAUTION: This email originated from outside of the organization. Do
> > not click links or open attachments unless you can confirm the sender and
> > know the content is safe.
> >
> >
> >
> > +1 (binding)
> >
> > - Ran tests locally
> > - Release script successful
> >
> > Checking Signature
> >
> > Signature Check - [OK]
> >
> >
> > Checking for binary files in source release
> >
> > No Binary Files in Source Release? - [OK]
> >
> >
> > Checking for DISCLAIMER
> >
> > DISCLAIMER file exists ? [OK]
> >
> >
> > Checking for LICENSE and NOTICE
> >
> > License file exists ? [OK]
> >
> > Notice file exists ? [OK]
> >
> >
> > Performing custom Licensing Check
> >
> > Licensing Check Passed [OK]
> >
> >
> > Running RAT Check
> >
> > RAT Check Passed [OK]
> >
> > Thanks,
> > Nishith
> >
> > On Thu, Jun 11, 2020 at 10:55 AM vbal...@apache.org <
> > vbal...@apache.org>
> > wrote:
> >
> > >
> > > +1(binding)
> > > 1. Ran integration tests locally2. Manually reviewed the commits
> > landing
> > > into 0.5.3 by comparing against 0.5.2 3. Ran deltastreamer in
> > continuous
> > > mode with async compaction for couple of hours on test data and
> > verified no
> > > errors.4. Release Validation script passed locally. ```
> > > varadarb-C02SH0P1G8WL:scripts varadarb$
> > > ./release/validate_staged_release.sh --release=0.5.3 --rc_num=2
> > >
> > > /tmp/validation_scratch_dir_001 ~/projects/new_ws/hudi/scripts
> > >
> > > Checking Checksum of Source Release
> > >
> > >   Checksum Check of Source Release - [OK]
> > >
> > >
> > >
> > >
> > >   % Total% Received % Xferd  Average Speed   TimeTime
> >  Time
> > > Current
> > >
> > >  Dload  Upload   Total   Spent
> > Left
> > > Speed
> > >
> > > 100 26722  100 267220 0  48234  0 --:--:-- --:--:--
> > --:--:--
> > > 48234
> > >
> > > Checking Signature
> > >
> > >   Signature Check - [OK]
> > >
> > >
> > >
> > >
> > > Checking for binary files in source release
> > >
> > >   No Binary Files in Source Release? - [OK]
> > >
> > >
> > >
> > >
> > > Checking for DISCLAIMER
> > >
> > >   DISCLAIMER file exists ? [OK]
> > >
> > >
> > >
> > >
> > > Checking for LICENSE and NOTICE
> > >
> > >   License file exists ? [OK]
> > >
> > >   Notice file exists ? [OK]
> > >
> > >
> > >
> > >
> > > Performing custom Licensing Check
> > >
> > >   Licensing Check Passed [OK]
> > >
> > >
> > >
> > >
> > > Running RAT Check
> > >
> > >   RAT Check Passed [OK]
> > >
> > >
> > >
> > >
> > > ~/projects/new_ws/hudi/scripts
> > >
> > > varadarb-C02SH0P1G8WL:scripts varadarb$ echo $?
> > >
> > > 0
> > > ```
> > > On Wednesday, June 10, 2020, 02:57:35 PM PDT, Sivabalan <
> > > n.siv...@gmail.com> wrote:
> > >
> > >  Hi everyone,
> > >
> > > Please review and vote on the release candidate #2 for the version
> > 0.5.3,
> > > as follows:
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific
> comments)
> > >
> > >  The complete staging area is available for your review, which
> > includes:
> > >
> > > * JIRA release notes [1],
> > > * the official Apache source release and binary convenience
> releases
> > to be
> > > deployed to dist.apache.org [2], which are signed with the key
> with
> > > fingerprint 001B66FA2B2543C151872CCC29A4FD82F1508833 [3],
> > > * all artifacts to be deployed to the Maven Central Repository [4],
>

Re: Apply for contributor permission.

2020-06-07 Thread vino yang

Hi Dillon,

Done and welcome!

Best,
Vino

Dillon Zhang  于2020年6月7日周日 下午9:46写道：

> *Hi,*
>
> *I want to contribute to Apache Hudi. Would you please give me the
> contributor permission? My JIRA ID is "*ZhangZhanchun*"**.*
>
> *Best regards.*
>

Re: want to develop

2020-06-07 Thread vino yang

Hi xiaofeng,

Done and welcome!

Best,
Vino

黄晓峰  于2020年6月7日周日 下午1:15写道：

> jira ：hanchen168482
>

Re: TLP Announcement

2020-06-04 Thread vino yang

Great news!

Thanks for the whole community!

Best,
Vino

Pratyaksh Sharma  于2020年6月4日周四 下午11:23写道：

> That is a great news.
>
> On Thu, Jun 4, 2020 at 7:58 PM Vinoth Chandar  wrote:
>
> > Hello all,
> >
> > The ASF press release announcing Apache Hudi as TLP is live! Thanks for
> all
> > your contributions! We could not have been achieved that without such a
> > great community effort!
> >
> > Please help spread the word!
> >
> > - GlobeNewswire
> >
> >
> http://www.globenewswire.com/news-release/2020/06/04/2043732/0/en/The-Apache-Software-Foundation-Announces-Apache-Hudi-as-a-Top-Level-Project.html
> >  - ASF "Foundation" blog https://s.apache.org/odtwv
> >  - @TheASF twitter feed
> > https://twitter.com/TheASF/status/1268528110959497217
> >  - The ASF on LinkedIn
> > https://www.linkedin.com/company/the-apache-software-foundation
> >
> > Thanks
> > Vinoth
> >
>

1 2 3 >

1 - 100 of 282 matches

Mail list logo