[ANNOUNCE] Apache HUDI 0.14.0 released

2023-10-04 Thread Prashant Wason
The Apache Hudi team is pleased to announce the release of Apache Hudi
0.14.0.

Apache Hudi is a transactional data lake platform that brings database and
data warehouse capabilities to the data lake. Hudi reimagines slow
old-school batch data processing with a powerful new incremental processing
framework for low latency minute-level analytics.

This release comes 4 months after 0.13.1. It includes more than 405
resolved issues, new features as well as general improvements and
bug-fixes. Some of the exciting features include:

   1. The introduction of Record Level Index,
   2. Support for Hudi tables with Auto-generated keys
   3. The hudi_table_changes function for incremental reads
   4. Support for Spark 3.4 and Flink 1.17
   5. New feature support for Flink - consistent hashing index, Update and
   Delete statement support
   6. Spark read and write side enhancements


Please review the release notes
<https://hudi.apache.org/releases/release-0.14.0> for details on release
highlights, breaking changes, and behavior changes before adopting the
0.14.0 release. If you'd like to download the source release, you can find
it here: https://github.com/apache/hudi/releases/tag/release-0.14.0

For details on how to use Hudi, please look at the quick start page located
at https://hudi.apache.org/docs/quick-start-guide.html

We welcome your help and feedback. For more information on how to report
problems, and to get involved, visit the project website at
https://hudi.apache.org/


Thanks to everyone involved!
Prashant Wason


[RESULT] [VOTE] Release 0.14.0, release candidate #3

2023-09-25 Thread Prashant Wason
Hello devs,

I'm happy to announce that we have unanimously approved this release.

There are 14 approving votes, 7 of which are binding:

(binding)

Vinoth Chandar
Bhavani Sudha
Y Ethan Guo
Balaji Varadarajan
Udit Mehrotra
Nishith Agarwal
Shiyan Xu

(non-binding)
Aditya Goenka
Sagar Sumit
Jonathan Vexler
Lokesh Jain
Amrish Lal
Hussein Awala
Shawn Chang


There are no disapproving votes.

Voting Thread:
https://lists.apache.org/thread/s3cshvlmg01rqpow80qoot6ndg4jgxwc


Thanks everyone!
Prashant Wason


[VOTE] Release 0.14.0, release candidate #3

2023-09-19 Thread Prashant Wason
Hi everyone,

Please review and vote on the *release candidate #3* for the version
0.14.0, as follows:

[ ] +1, Approve the release

[ ] -1, Do not approve the release (please provide specific comments)



The complete staging area is available for your review, which includes:

* JIRA release notes [1],

* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org
<https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc3/> [2], which
are signed with the key with
fingerprint 75C5744E9E5CD5C48E19C082C4D858D73B9DB1B8 [3],

* all artifacts to be deployed to the Maven Central Repository [4],

* source code tag "0.14.0-rc3" [5],



The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.



Thanks,

Prashant Wason



[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352700

[2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc3/

[3] https://dist.apache.org/repos/dist/release/hudi/KEYS

[4] https://repository.apache.org/content/repositories/orgapachehudi-1127/

[5] https://github.com/apache/hudi/releases/tag/release-0.14.0-rc3
<https://github.com/apache/hudi/releases/tag/release-0.14.0-rc3>


Re: [VOTE] Release 0.14.0, release candidate #2

2023-09-14 Thread Prashant Wason
Hello Everyone,

Since sending out the RC2, a critical issue has been reported and fixed.
https://github.com/apache/hudi/pull/9711

I will be working on RC3 to incorporate the above fix. I encourage everyone
to test RC2 and report any other issues that you encounter.

Thanks for helping us release 0.14.

Prashant



On Thu, Sep 14, 2023 at 7:47 AM Vinoth Chandar  wrote:

> For all, link [2] should be
> https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc2/
>
> On Wed, Sep 13, 2023 at 11:53 AM Prashant Wason 
> wrote:
>
> > Hi everyone,
> >
> > Please review and vote on the *release candidate #2* for the version
> > 0.14.0, as follows:
> >
> > [ ] +1, Approve the release
> >
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> >
> >
> > The complete staging area is available for your review, which includes:
> >
> > * JIRA release notes [1],
> >
> > * the official Apache source release and binary convenience releases to
> be
> > deployed to dist.apache.org
> > <https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc2/> [2],
> which
> > are signed with the key with
> > fingerprint 75C5744E9E5CD5C48E19C082C4D858D73B9DB1B8 [3],
> >
> > * all artifacts to be deployed to the Maven Central Repository [4],
> >
> > * source code tag "0.14.0-rc2" [5],
> >
> >
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> >
> >
> > Thanks,
> >
> > Prashant Wason
> >
> >
> >
> > [1]
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352700
> >
> > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc1/
> >
> > [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> >
> > [4]
> https://repository.apache.org/content/repositories/orgapachehudi-1126/
> >
> > [5] https://github.com/apache/hudi/releases/tag/release-0.14.0-rc2
> > <https://github.com/apache/hudi/releases/tag/release-0.14.0-rc2>
> >
>


[VOTE] Release 0.14.0, release candidate #2

2023-09-13 Thread Prashant Wason
Hi everyone,

Please review and vote on the *release candidate #2* for the version
0.14.0, as follows:

[ ] +1, Approve the release

[ ] -1, Do not approve the release (please provide specific comments)



The complete staging area is available for your review, which includes:

* JIRA release notes [1],

* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org
<https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc2/> [2], which
are signed with the key with
fingerprint 75C5744E9E5CD5C48E19C082C4D858D73B9DB1B8 [3],

* all artifacts to be deployed to the Maven Central Repository [4],

* source code tag "0.14.0-rc2" [5],



The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.



Thanks,

Prashant Wason



[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352700

[2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc1/

[3] https://dist.apache.org/repos/dist/release/hudi/KEYS

[4] https://repository.apache.org/content/repositories/orgapachehudi-1126/

[5] https://github.com/apache/hudi/releases/tag/release-0.14.0-rc2
<https://github.com/apache/hudi/releases/tag/release-0.14.0-rc2>


Re: [VOTE] Release 0.14.0, release candidate #1

2023-08-28 Thread Prashant Wason
Thanks for your feedback Danny. I will work on the RC2 with theaw critical
fixes today.

Thanks
Prashant


On Thu, Aug 24, 2023, 2:41 PM Danny Chan  wrote:

> -1 for some critical fixes:
>
> I saw some critical fixes on the master:
>
> 1. https://github.com/apache/hudi/pull/9483
> 2. https://github.com/apache/hudi/pull/9499 (very critical)
> 3. https://github.com/apache/hudi/pull/9467
> 4. https://github.com/apache/hudi/pull/9511
>
> The 2 is very critical, it fixes the class not found for the bundle jar.
> And I'm not sure whether https://github.com/apache/hudi/pull/9477
> should be included, I also see some opening PRs tagged with token
> "release-0.14.0" and "blocker",
> should we also clear those too?
>
> Best,
> Danny
>
> Prashant Wason  于2023年8月24日周四 23:51写道:
> >
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version
> 0.14.0,
> > as follows:
> >
> > [ ] +1, Approve the release
> >
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> >
> >
> > The complete staging area is available for your review, which includes:
> >
> > * JIRA release notes [1],
> >
> > * the official Apache source release and binary convenience releases to
> be
> > deployed to dist.apache.org
> > <https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc1/> [2],
> which
> > are signed with the key with
> > fingerprint 75C5744E9E5CD5C48E19C082C4D858D73B9DB1B8 [3],
> >
> > * all artifacts to be deployed to the Maven Central Repository [4],
> >
> > * source code tag "0.14.0-rc1" [5],
> >
> >
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> >
> >
> > Thanks,
> >
> > Prashant Wason
> >
> >
> >
> > [1]
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352700
> >
> > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc1/
> >
> > [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> >
> > [4]
> https://repository.apache.org/content/repositories/orgapachehudi-1125/
> >
> > [5] https://github.com/apache/hudi/releases/tag/release-0.14.0-rc1
>


[VOTE] Release 0.14.0, release candidate #1

2023-08-24 Thread Prashant Wason
Hi everyone,

Please review and vote on the release candidate #1 for the version 0.14.0,
as follows:

[ ] +1, Approve the release

[ ] -1, Do not approve the release (please provide specific comments)



The complete staging area is available for your review, which includes:

* JIRA release notes [1],

* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org
<https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc1/> [2], which
are signed with the key with
fingerprint 75C5744E9E5CD5C48E19C082C4D858D73B9DB1B8 [3],

* all artifacts to be deployed to the Maven Central Repository [4],

* source code tag "0.14.0-rc1" [5],



The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.



Thanks,

Prashant Wason



[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352700

[2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc1/

[3] https://dist.apache.org/repos/dist/release/hudi/KEYS

[4] https://repository.apache.org/content/repositories/orgapachehudi-1125/

[5] https://github.com/apache/hudi/releases/tag/release-0.14.0-rc1


About 0.14 Release Timeline

2023-08-01 Thread Prashant Wason
Hello Everyone,

I wanted to update you all on the 0.14 release progress. We passed the
earlier deadline of July 15 but since then we have fixed a lot of issues
found in testing and merged many open PRs.

We are now aiming for this coming weekend (Aug 5) as the feature freeze
date and to have the RC build ready in 2 weeks.

Thanks
Prashant Wason
RM for 0.14.0


Re: Record level index with not unique keys

2023-07-13 Thread Prashant Wason
Hi Nicolas,

The RI feature is designed for max performance as it is at a record-count
scale. Hence, the schema is simplified and minimized.

With non unique keys how would tagging of records (for updates / deletes)
work? How would record Index know which mapping of the array to return for
a given record key?

Thanks
Prashant



On Wed, Jul 12, 2023 at 2:02 AM nicolas paris 
wrote:

> hi there,
>
> Just tested preview of RLI (rfc-08), amazing feature. Soon the fast COW
> (rfc-68) will be based on RLI to get the parquet offsets and allow
> targeting parquet row groups.
>
> RLI is a global index, therefore it assumes the hudi key is present in
> at most one parquet file. As a result in the MDT, the RLI is of type
> struct, and there is a 1:1 mapping w/ a given file.
>
> Type:
>|-- recordIndexMetadata: struct (nullable = true)
>||-- partition: string (nullable = false)
>||-- fileIdHighBits: long (nullable = false)
>||-- fileIdLowBits: long (nullable = false)
>||-- fileIndex: integer (nullable = false)
>||-- instantTime: long (nullable = false)
>
> Content:
>|event_id:1|{part=3, -6811947225812876253,
> -7812062179961430298, 0, 1689147210233}|
>
> We would love to use both RLI and FCOW features, but I'm afraid our
> keys are not unique in our kafka archives. Same key might be present
> in multiple partitions, and even in multiple slices within partitions.
>
> I wonder if the future, RLI could support multiple parquet files (by
> storing an array of struct for eg). This would enable to leverage LRI
> in more contexts
>
> Thx
>
>
>
>
>


About 0.14.0 Release Timeline

2023-06-21 Thread Prashant Wason
Hello Everyone,

I would like to start the discussion on the 0.14.0 release timeline. How
about Jun 30 for feature freeze and July 15 for creating the release
branch?


Thanks
Prashant Wason
RM for 0.14.0


Re: Calling for 0.14.0 Release Manager

2023-05-03 Thread Prashant Wason
I volunteer to drive the 0.14.0.

Thanks
Prashant


On Wed, May 3, 2023 at 1:28 PM Sivabalan  wrote:

> It's been few months since we released 0.13.0. It's time to start
> preparing for the next major release. Can we have a volunteers to
> drive the 0.14.0 release.
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-03-30 Thread Prashant Wason
Could be useful. Also, may be useful for backup / replication scenario
(keeping a copy of data in alternate/cloud DC).

HoodieDeltaStreamer already has the concept of "sources". This can be
implemented as a "sink" concept.

On Thu, Mar 30, 2023 at 8:12 PM Vinoth Chandar  wrote:

> Essentially.
>
> Old architecture :(operational database) ==> some tool ==> (data
> warehouse raw data) ==> SQL ETL ==> (data warehouse derived data)
>
> New architecture : (operational database) ==> Hudi delta Streamer ==> (Hudi
> raw data) ==> Spark/Flink Hudi ETL ==> (Hudi derived data) ==> Hudi Reverse
> Streamer ==> (Data Warehouse/Kafka/Operational Database)
>
> On Thu, Mar 30, 2023 at 8:09 PM Vinoth Chandar  wrote:
>
> > Hi all,
> >
> > Any interest in building a reverse streaming tool, that does the reverse
> > of what the DeltaStreamer tool does? It will read Hudi table
> incrementally
> > (only source) and write out the data to a variety of sinks - Kafka, JDBC
> > Databases, DFS.
> >
> > This has come up many times with data warehouse users. Often times, they
> > want to use Hudi to speed up or reduce costs on their data ingestion and
> > ETL (using Spark/Flink), but want to move the derived data back into a
> data
> > warehouse or an operational database for serving.
> >
> > What do you all think?
> >
> > Thanks
> > Vinoth
> >
>


Re: [DISCUSS] Build tool upgrade

2022-10-03 Thread Prashant Wason
+1 for incremental builds with build cache which will be a huge prod bosot
especially when working with multiple branches at the same time.

Prashant


On Mon, Oct 3, 2022 at 11:42 PM Alexey Kudinkin  wrote:

> I think full project build slowly gravitates towards 15min already (it’s
> about 12-14min on my 2021 Macbook).
>
> @Vinoth the most important aspect that Maven couldn’t provide us with are
> local incremental builds. Currently you have to build full dependency
> hierarchy of the project whenever you’re changing even a single file.
> There’re some limited workarounds but they aren’t really a replacement for
> fully incremental builds.
>
> Fully incremental builds will be a huge boost to Dev productivity.
>
> On Sun, Oct 2, 2022 at 11:40 PM Pratyaksh Sharma 
> wrote:
>
> > My two cents. I have seen open source projects take more than 20-25
> minutes
> > for building on maven, so I guess we are fine for now. But we can
> > definitely investigate and try to optimize if we can.
> >
> > On Sun, Oct 2, 2022 at 9:33 AM Shiyan Xu 
> > wrote:
> >
> > > Yes, Vinoth, agree on the efforts and impact being big.
> > >
> > > Some perf comparison on gradle vs maven can be found in
> > > https://gradle.org/gradle-vs-maven-performance/ where it claims
> > multi-fold
> > > build time reduction. I'd estimate maybe 2-4 min for a full build and
> > based
> > > on that.
> > >
> > > I mainly hope to collect some feedback on if build time is a dev
> > experience
> > > concern or if it's okay for people in general. If it's the latter case,
> > > then no need to investigate further at this point.
> > >
> > > On Sat, Oct 1, 2022 at 1:52 PM Vinoth Chandar 
> wrote:
> > >
> > > > Hi Raymond.
> > > >
> > > > This would be a large undertaking and a big change for everyone.
> > > >
> > > > What does the build time look like if we switch to gradle or bazel?
> And
> > > do
> > > > we know why it takes 10 min to build and why is that not okay? Given
> we
> > > all
> > > > use IDEs mostly anyway
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Fri, Sep 30, 2022 at 22:48 Shiyan Xu  >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I'd like to raise a discussion around the build tool for Hudi.
> > > > >
> > > > > Maven has been a mature yet slow (10min to build on 2021 macbook
> pro)
> > > > build
> > > > > tool compared to modern ones like gradle or bazel. We all want
> faster
> > > > > builds, however, we also need to consider the efforts and risks to
> > > > upgrade,
> > > > > and the developers' feedback on usability.
> > > > >
> > > > > What do you all think about upgrading to gradle or bazel? Please
> > share
> > > > your
> > > > > thoughts. Thanks.
> > > > >
> > > > > --
> > > > > Best,
> > > > > Shiyan
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Shiyan
> > >
> >
>


Re: [VOTE] Move content off cWiki

2021-07-19 Thread Prashant Wason
+1 - Approve the move

On Mon, Jul 19, 2021 at 3:44 PM Vinoth Chandar  wrote:

> Hi all,
>
> Starting a vote based on the DISCUSS thread here [1], to consolidate
> content from cWiki into Github wiki and project's master branch (for design
> docs)
>
> Please chime with a
>
> +1 - Approve the move
> -1  - Disapprove the move (please state your reasoning)
>
> The vote will use lazy consensus, needing three +1s to pass, remaining open
> for 72 hours.
>
> Thanks
> Vinoth
>
> [1]
>
> https://lists.apache.org/thread.html/rb0a96bc10788c9635cc1a35ade7d5d42997a5c9591a5ec5d5a99adf0%40%3Cdev.hudi.apache.org%3E
>


Re: [HELP] unstable tests in the travis CI

2021-06-23 Thread Prashant Wason
Sure. I will take a look today. I wonder how the CI passed during the merge.


On Wed, Jun 23, 2021 at 7:57 AM pzwpzw 
wrote:

> Hi @Prashant Wason, I found that after the [HUDI-1717]( commit hash:
> 11e64b2db0ddf8f816561f8442b373de15a26d71)  has merged yesterday, the test
> case TestHoodieBackedMetadata#testOnlyValidPartitionsAdded will always
> crash:
>
> org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve
> files in partition
> /var/folders/my/841b2c052038ppns0csrf8g8gn/T/junit3095347769583879437/dataset/p1
> from metadata
>
> at
> org.apache.hudi.metadata.BaseTableMetadata.getAllFilesInPartition(BaseTableMetadata.java:129)
> at
> org.apache.hudi.metadata.TestHoodieBackedMetadata.testOnlyValidPartitionsAdded(TestHoodieBackedMetadata.java:210)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> Can you take a look at this,  Thanks~
>
>
>
> 2021年6月23日 下午1:49,Danny Chan  写道:
>
> Hi, fellows, there are two test cases in the travis CI that fails very
> often, which blocks our coding too many times, please, if these tests are
> not stable, can we disable them first ?
> They are annoying ~
>
>
> TestHoodieBackedMetadata.testOnlyValidPartitionsAdded[1]
> HoodieSparkSqlWriterSuite: schema evolution for ... [2]
>
> [1] https://travis-ci.com/github/apache/hudi/jobs/518067391
> [2] https://travis-ci.com/github/apache/hudi/jobs/518067393
>
> Best,
> Danny Chan
>
>


Re: Welcome new committers and PMC Members!

2021-05-11 Thread Prashant Wason
Congratulations Gary and Wenning!

On Tue, May 11, 2021 at 3:59 PM Raymond Xu 
wrote:

> Big congrats to Gary and Wenning!
>
> On Tue, May 11, 2021 at 1:14 PM vbal...@apache.org 
> wrote:
>
> >  Many Congratulations Gary Li and Wenning Ding. Well deserved !!
> > Balaji.V
> > On Tuesday, May 11, 2021, 01:06:47 PM PDT, Bhavani Sudha <
> > bhavanisud...@gmail.com> wrote:
> >
> >  Congratulations @Gary Li and @Wenning Ding!
> > On Tue, May 11, 2021 at 12:42 PM Vinoth Chandar 
> wrote:
> >
> > Hello all,
> > Please join me in congratulating our newest set of committers and PMCs.
> > Wenning Ding (Committer) Wenning has been a consistent contributor to
> > Hudi, over the past year or so. He has added some critical bug fixes,
> lots
> > of good contributions around Spark!
> > Gary Li (PMC Member) Gary is a regular feature on all our support
> > channels. He has contributed numerous features to Hudi, and evangelized
> > across many companies including Bosch/Bytedance. Most of all, he is a
> solid
> > team player and an asset to the project.
> > Thanks so much for your continued contributions, to make Hudi better and
> > better!
> > ThanksVinoth
> >
> >
>


Re: Congrats to our newest committers!

2021-01-27 Thread Prashant Wason
Congratulations to both of you!

On Wed, Jan 27, 2021 at 2:01 PM Udit Mehrotra  wrote:

> Congratulations to both ! Well deserved..
>
> - Udit
>
> On Wed, Jan 27, 2021 at 1:18 PM nishith agarwal 
> wrote:
>
> > Congratulations to both!
> >
> > -Nishith
> >
> > On Wed, Jan 27, 2021 at 11:49 AM Sivabalan  wrote:
> >
> > > Congratulations folks !
> > >
> > > On Wed, Jan 27, 2021 at 12:48 PM Pratyaksh Sharma <
> pratyaks...@gmail.com
> > >
> > > wrote:
> > >
> > > > Congratulations both of you!
> > > >
> > > > On Wed, Jan 27, 2021 at 8:43 PM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Congrats both! Well deserved indeed! Glad to have you on the
> > community.
> > > > >
> > > > > On Wed, Jan 27, 2021 at 7:00 AM Shi ShaoFeng <
> shaofeng...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Congratulations, Wang Xianghu and Li Wei!
> > > > > >
> > > > > > 在 2021/1/27 下午9:17,“vino yang” 写入:
> > > > > >
> > > > > > Congrats to both of them!
> > > > > > Well deserved!
> > > > > >
> > > > > > Best,
> > > > > > Vino
> > > > > >
> > > > > > Trevor Zhang  于2021年1月27日周三 下午7:20写道:
> > > > > >
> > > > > > > Congratulations to  Wang Xianghu and  Li Wei.
> > > > > > >
> > > > > > > Best ,
> > > > > > >
> > > > > > > Trevor
> > > > > > >
> > > > > > > leesf  于2021年1月27日周三 下午7:16写道:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I am very happy to announce our newest committers.
> > > > > > > >
> > > > > > > > Wang Xianghu: Xianghu has done a great job in decoupling
> > hudi
> > > > > with
> > > > > > spark
> > > > > > > > and implemented the first version of flink and
> contributed
> > > bug
> > > > > > fixes,
> > > > > > > also
> > > > > > > > he is very active in answering users questions in china
> > > wechat
> > > > > > group.
> > > > > > > >
> > > > > > > > Li Wei: Liwei has also done a great job in driving major
> > > > features
> > > > > > like
> > > > > > > > RFC-19 together with satish, also contributed many
> features
> > > and
> > > > > > bug fixes
> > > > > > > > in core modules.
> > > > > > > >
> > > > > > > > Please join me in congratulating them!
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Leesf
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>


Re: Congrats to our newest committers!

2020-12-03 Thread Prashant Wason
Thanks everyone. 

Over the past one year I have really enjoyed learning and developing with HUDI. 
Excited to be part of the group. 

> On Dec 3, 2020, at 11:37 AM, Balaji Varadarajan  
> wrote:
> 
> Very Well deserved !! Many congratulations to Satish and Prashant.
> Balaji.V
>On Thursday, December 3, 2020, 11:07:09 AM PST, Bhavani Sudha 
>  wrote:  
> 
> Congratulations Satish and Prashant!
> On Thu, Dec 3, 2020 at 11:03 AM Pratyaksh Sharma  
> wrote:
> 
> Congratulations Satish and Prashant!
> 
> On Fri, Dec 4, 2020 at 12:22 AM Vinoth Chandar  wrote:
> 
>> Hi all,
>> 
>> I am really happy to announce our newest set of committers.
>> 
>> *Satish Kotha*: Satish has ramped very quickly across our entire code base
>> and contributed bug fixes and also drove large, unique features like
>> clustering, replace/overwrite which are about to go out in the 0.7.0
>> release. These efforts largely complete parts of our vision and it could
>> have happened without Satish.
>> 
>> *Prashant Wason*: In addition to a number of patches, Prashant has been
>> shouldering massive responsibility on RFC-15, and thanks to his efforts, we
>> have a simplified design, very solid implementation right now, that is
>> being tested now for 0.7.0 release again.
>> 
>> Please join me in congratulating them on this great milestone!
>> 
>> Thanks,
>> Vinoth
>> 
> 



Re: HUDI Table Primary Key - UUID or Custom For Better Performance

2020-10-16 Thread Prashant Wason
Hi Tanu,

Some points to consider:
1. UUID is fixed size compared to domain_object_keys (dont know the size).
Smaller keys will reduce the storage requirements.
2. UUIDs don't compress. Your domain object keys may compress better.
3. From the bloom filter perspective, I dont think there is any difference
unless the size difference of keys is very large.
4. If the domain object keys are already unique, what is the use of
suffixing the create_date?
5. If you query by "primary key minus timestamp", the entire record key
column will have to be read to match it. So bloom filters won't be useful
here.
6. What do the domain object keys look like? Are they going to be included
in any other field in the record? Would you ever want to query on domain
object keys?

Thanks
Prashant


On Thu, Oct 15, 2020 at 8:21 PM tanu dua  wrote:

> read query pattern will be (partition key + primary key minus timestamp)
> where my primary key is domain keys + timestamp.
>
> Read Write queries are as per dataset but mostly all the tables are read
> and write frequently and equally
>
> Read will be mostly done by providing the partitions and not by blanket
> query.
>
> If we have to choose between read and write I will choose write but I want
> to stick only with COW table.
>
> Please let me know if you need more information.
>
>
> On Thu, 15 Oct 2020 at 5:48 PM, Sivabalan  wrote:
>
> > Can you give us a sense of how your read workload looks like? Depending
> on
> > that read perf could vary.
> >
> > On Thu, Oct 15, 2020 at 4:06 AM Tanuj  wrote:
> >
> > > Hi all,
> > > We don't have an "UPDATE" use case and all ingested rows will be
> "INSERT"
> > > so what is the best way to define PRIMARY key. As of now we have
> designed
> > > primary key as per domain object with create_date which is -
> > > ,,
> > >
> > > Since its always an INSERT for us , I can potentially use UUID as well
> .
> > >
> > > We use keys for Bloom Index in HUDI so just wanted to know if I get a
> > > better performance in writing if I will have the UUID vs composite
> domain
> > > keys.
> > >
> > > I believe read is not impacted as per the Primary Key as its not being
> > > considered ?
> > >
> > > Please suggest
> > >
> > >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


Re: schema compatibility check and change column type

2020-09-07 Thread Prashant Wason
Yes, the schema change looks fine. That would mean its an issue with the
schema compatibility checker. The are explicit checks for such cases so
can't say where the issue lies.

I am out on a vacation this week. I will look into this as soon as I am
back.

Thanks
Prashant

On Sun, Sep 6, 2020, 11:18 AM Vinoth Chandar  wrote:

> That does sound like a backwards compatible change.
> @prashant , any ideas here? (since you have the best context on the schema
> validation checks)
>
> On Thu, Sep 3, 2020 at 8:12 PM cadl  wrote:
>
> > Hi All,
> >
> > I want to change the type of one column in my COW table, from int to
> long.
> > When I set “hoodie.avro.schema.validate = true” and upsert new data with
> > long type, I got a “Failed upsert schema compatibility check” error.
> Dose
> > it break backwards compatibility? If I disable
> hoodie.avro.schema.validate,
> > I can upsert and read normally.
> >
> >
> > code demo: https://gist.github.com/cadl/be433079747aeea88c9c1f45321cc2eb
> >
> > stacktrace:
> >
> >
> > org.apache.hudi.exception.HoodieUpsertException: Failed upsert schema
> > compatibility check.
> >   at
> >
> org.apache.hudi.table.HoodieTable.validateUpsertSchema(HoodieTable.java:572)
> >   at
> >
> org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:190)
> >   at
> >
> org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:260)
> >   at
> >
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
> >   at
> org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:125)
> >   at
> >
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> >   at
> >
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> >   at
> >
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> >   at
> >
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> >   at
> >
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> >   at
> >
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> >   at
> >
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> >   at
> >
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> >   at
> >
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> >   at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> >   at
> >
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> >   at
> >
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> >   at
> >
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> >   at
> >
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> >   at
> >
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> >   at
> >
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
> >   at
> >
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> >   at
> >
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> >   at
> >
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> >   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> >   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> >   ... 69 elided
> > Caused by: org.apache.hudi.exception.HoodieException: Failed schema
> > compatibility check for writerSchema
> >
> :{"type":"record","name":"foo_record","namespace":"hoodie.foo","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"a","type":"long"},{"name":"b","type":"string"},{"name":"__row_key","type":"int"},{"name":"__row_version","type":"int"}]},
> > table schema
> >
> 

Re: Hudi - Concurrent Writes

2020-07-09 Thread Prashant Wason
With a large number of table you also run into the following potential
issues:
1. Consistency: There is no single timeline so different tables (per
partition) expose data from different times of ingestion. If the data
within partitions is inter-dependent then the queries may see
inconsistent results.

2. Complicated error handling / debugging: If some of the pipelines fail
then data in some partitions may not have been updated for some time. This
may lead to data consistency issues on the query side. Debugging any issue
when 1000 separate datasets are involved is much more complicated then a
single dataset (e.g. hudi-cli connects to one dataset at a time).

3. (Possibly minor) Excess load on the infra: With several parallel
operations, the worst case load on the Namenode may go up N times (N=number
of parallel pipelines). Under provisioned NameNode may lead to out of
resource errors.

4. Adding new partitions would be complicated: Assuming you would want a
new partition in future, the steps would be more involved.

If only a few partitions are having the load issues, you can also look into
the partitioning scheme.
1. Maybe invent a new column in the schema which is more
uniformly distributed
2. Maybe split the loaded partitions into two partitions (range based or
something like that)
3. If possible (depending on the ingestion source), prioritize ingestion
for particular partitions (partition priority queue)
4. Limit the number of records ingested at a time to limit maximum job
time

Thanks
Prashant


On Thu, Jul 9, 2020 at 12:00 AM Shayan Hati  wrote:

> Thanks for your response.
>
> @Mario: So the metastore can be something like a Glue/Hive metastore which
> basically has the metadata about different partitions in a single table.
> One challenge is per partition Hudi table can be queried using Hudi library
> bundle, but across partitions it has to be queried based on the metastore
> itself.
>
> @Vinoth: The use-case is we have different partitions and the data as well
> as the load is skewed on them. So one partition has to ingest much more
> data than another. Basically, one large partition delta affects the
> ingestion time of a smaller size partition as well. Also failure/corrupt
> data of one partition delta affects others if we have single write. So we
> wanted these writes to be independent per partition.
>
> Also any timeline when 0.6.0 will be released?
>
> Thanks,
> Shayan
>
>
> On Thu, Jul 9, 2020 at 9:22 AM Vinoth Chandar  wrote:
>
> > We are looking into adding support for parallel writers in 0.6.0. So that
> > should help.
> >
> > I am curious to understand though why you prefer to have 1000 different
> > writer jobs, as opposed to having just one writer. Typical use cases for
> > parallel writing I have seen are related to backfills and such.
> >
> > +1 to Mario’s comment. Can’t think of anything else if your users are
> happy
> > querying 1000 tables.
> >
> > On Wed, Jul 8, 2020 at 7:28 AM Mario de Sá Vera 
> > wrote:
> >
> > > hey Shayan,
> > >
> > > that seems actually a very good approach ... just curious with the glue
> > > metastore you mentioned. Would it be an external metastore for spark to
> > > query over ??? external in terms of not managed by Hudi ???
> > >
> > > that would be my only concern ... how to maintain the sync between all
> > > metadata partitions but , again, a very promising approach !
> > >
> > > regards,
> > >
> > > Mario.
> > >
> > > Em qua., 8 de jul. de 2020 às 15:20, Shayan Hati  >
> > > escreveu:
> > >
> > > > Hi folks,
> > > >
> > > > We have a use-case where we want to ingest data concurrently for
> > > different
> > > > partitions. Currently Hudi doesn't support concurrent writes on the
> > same
> > > > Hudi table.
> > > >
> > > > One of the approaches we were thinking was to use one hudi table per
> > > > partition of data. So let us say we have 1000 partitions, we will
> have
> > > 1000
> > > > Hudi tables which will enable us to write concurrently on each
> > partition.
> > > > And the metadata for each partition will be synced to a single
> > metastore
> > > > table (Assumption here is schema is same for all partitions). So this
> > > > single metastore table can be used for all the spark, hive queries
> when
> > > > querying data. Basically this metastore glues all the different hudi
> > > table
> > > > data together in a single table.
> > > >
> > > > We already tested this approach and its working fine and each
> partition
> > > > will have its own timeline and hudi table.
> > > >
> > > > We wanted to know if there are some gotchas or any other issues with
> > this
> > > > approach to enable concurrent writes? Or if there are any other
> > > approaches
> > > > we can take?
> > > >
> > > > Thanks,
> > > > Shayan
> > > >
> > >
> >
>
>
> --
> Shayan Hati
>


Re: [DISSCUSS] Trigger a Travis-CI rebuild without pushing a commit

2020-05-27 Thread Prashant Wason
I have used force push (git push -f) to re-trigger Travis build. I don't
know if force push has any side effect but it does save an extra commit.

Thanks
Prashant


On Wed, May 27, 2020 at 11:11 AM Lamber Ken  wrote:

> Thanks Sivabalan
>
> For committers / pmcs, they can use these tools to trigger rebuilid
> directly,
> But for contributors, they can open the url, but the retrigger button will
> hidden.
>
> Best,
> Lamber-Ken
>
>
> On 2020/05/27 13:13:53, Sivabalan  wrote:
> > Not sure if this is a common practice. But can't we trigger via travis-ci
> > directly? You can go here
> > <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__travis-2Dci.org_github_apache_hudi_pull-5Frequests=DwIBaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=xtI8Sm__cNsr2SNcNOd1TvHfk6eCk-zcl3mn1IagbGE=PTLkzOm-ocf96YucyzwNrhJ_yfQ3EB4zuNQSttiv6ow=
> > or here
> > <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__travis-2Dci.org_github_apache_hudi_builds=DwIBaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=xtI8Sm__cNsr2SNcNOd1TvHfk6eCk-zcl3mn1IagbGE=wLjF5KLCkmXbjanjyE825M7qaSu2zf4qy2aUycf14ok=
> > and there you can find an
> > option to restart the build (right most column in every row) again if
> need
> > be. Wouldn't this suffice?
> >
> > On Wed, May 27, 2020 at 5:50 AM vino yang  wrote:
> >
> > > Hi Lamber-Ken,
> > >
> > > Thanks for opening this discussion.
> > >
> > > +1 to fix this issue.
> > >
> > > About the solution, can we consider to introduce a "CI Bot" just like
> > > Apache Flink has done?[1]
> > >
> > > Just a thought.
> > >
> > > Best,
> > > Vino
> > >
> > > [1]:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_flink-2Dci_ci-2Dbot_=DwIBaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=xtI8Sm__cNsr2SNcNOd1TvHfk6eCk-zcl3mn1IagbGE=Inll6__DlgwNzNDUKRbIhsW82R5SOdV45WMJQc4bS9o=
> > >
> > > Lamber Ken  于2020年5月27日周三 下午2:08写道:
> > >
> > > > Dear community,
> > > >
> > > > Use case: A build fails due to an externality. The source is actually
> > > > correct. It would build OK and pass if simply re-run. Is there some
> way
> > > to
> > > > nudge Travis-CI to do another build, other than pushing a "dummy"
> commit?
> > > >
> > > > The way I often used is `git commit --allow-empty -m 'trigger
> rebuild'`,
> > > > push a dummy commit, the travis will rebuild. Also noticed some
> apache
> > > > projects have supported this feature.
> > > >
> > > > For example:
> > > > 1. Carbondata use "retest this please"
> > > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_carbondata_pull_3387=DwIBaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=xtI8Sm__cNsr2SNcNOd1TvHfk6eCk-zcl3mn1IagbGE=vFZU-5zKAkGzQ9IgZEDjmjis0VRW4k2FvIPtMh6OfJY=
> > > >
> > > > 2. Bookkeeper use "run pr validation"
> > > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_bookkeeper_pull_2158=DwIBaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=xtI8Sm__cNsr2SNcNOd1TvHfk6eCk-zcl3mn1IagbGE=bn4KGvHB9vXo7XsNhM4rcS_j1zK0m838kS4n0Fba0Bk=
> > > >
> > > > But, I can't find a effective solution from Github and Travis's
> > > > documentation[1], any thoughts or opinions?
> > > >
> > > > Best,
> > > > Lamber-Ken
> > > >
> > > > [1]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.travis-2Dci.com=DwIBaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=xtI8Sm__cNsr2SNcNOd1TvHfk6eCk-zcl3mn1IagbGE=sOMap9ncNGc2hArVtrms1e4f7kLrLA0r9sfbFaFws0w=
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__support.github.com=DwIBaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=xtI8Sm__cNsr2SNcNOd1TvHfk6eCk-zcl3mn1IagbGE=zQCZcD0hb-5FqLkW5W_jX0BF1ET48sQa1vpEZq7LXmU=
> > > >
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


Re: [DISCUSS] moving blog from cwiki to website

2020-05-04 Thread Prashant Wason
Cool. Will get this done today.

On Mon, May 4, 2020 at 11:02 AM Vinoth Chandar  wrote:

> Hi Prashant,
>
> We already have a site setup and apache hosting it.
> See
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_tree_asf-2Dsite=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=5u0w58RfiD0Vod0RjbyvOYIK0BU_AmZQWZArVDFwGR8=R1hYB1xiYrqc4hu51dZCIDeQwE56IkuUjW25UVMxp5k=
> for instructions
> for building locally and making changes etc.
>
> Like I mentioned before, it should be a simple matter of moving the posts
> in proper formatting to
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_tree_asf-2Dsite_docs_-5Fposts=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=5u0w58RfiD0Vod0RjbyvOYIK0BU_AmZQWZArVDFwGR8=a6t03PKBXO8cpS6nNDLdCaIawH33YR-gIq39Jy22CnM=
> and they
> should show up
>
> Thanks
> Vinoth
>
>
>
>
>
> On Mon, May 4, 2020 at 10:55 AM Prashant Wason 
> wrote:
>
> > Hi Team,
> >
> > I surveyed several Apache projects and this is how they are blogging:
> >
> > 1. Use ASF's blogging platform. Hosted by ASF
> >  Example:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__blogs.apache.org_kafka_=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=5u0w58RfiD0Vod0RjbyvOYIK0BU_AmZQWZArVDFwGR8=IiWSKg55CTO5jmWtwlw45BHP3V1e2ljpOU6wku1TpkE=
> >
> > 2. Aggregating links to blog posts posted elsewhere
> > Example:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.pinot.apache.org_community-2D1_blogs=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=5u0w58RfiD0Vod0RjbyvOYIK0BU_AmZQWZArVDFwGR8=6ypnVPGjWBNky8_HNbZ9N7gO7MkCSIxQ3Yb-rYi5ClE=
> >
> > 3. Externally hosted blog with a template matching the project website
> > Example:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.couchdb.org_=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=5u0w58RfiD0Vod0RjbyvOYIK0BU_AmZQWZArVDFwGR8=jPrp6hkxm4r5z8b4ibokc571MXwSyEECu6IhVpH3ZGE=
> >
> > 4. Self hosted blog
> > Example:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cassandra.apache.org_blog_=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=5u0w58RfiD0Vod0RjbyvOYIK0BU_AmZQWZArVDFwGR8=z6YxhKOgpaeLzGBwyX_MzJA8dfVb2SR7NWYb9wp3AaQ=
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__libcloud.apache.org_blog_=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=5u0w58RfiD0Vod0RjbyvOYIK0BU_AmZQWZArVDFwGR8=FdhLVeiybE4TQIFwB1fdEcJQ_HspoStHzE8LE8d1OxM=
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__mesos.apache.org_blog_=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=5u0w58RfiD0Vod0RjbyvOYIK0BU_AmZQWZArVDFwGR8=OtRYoi8oT1XUVq-Q4XU4ppDfLHKiP7FTnhfO8Og9bKI=
> >
> > 5. Blogging on
> https://urldefense.proofpoint.com/v2/url?u=http-3A__medium.com=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=5u0w58RfiD0Vod0RjbyvOYIK0BU_AmZQWZArVDFwGR8=aBuYc_y0Um4_EtcU1fNkI7VWsT9EjxkqAgyJlwe-Yek=
> or
> https://urldefense.proofpoint.com/v2/url?u=http-3A__wordpress.com=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=5u0w58RfiD0Vod0RjbyvOYIK0BU_AmZQWZArVDFwGR8=_F_qWQ13MX_gzDD3JKPyesKcWOK8X5xbZqhEdgJAuYc=
> . Several Tech companies also
> > host their engineering blogs there (Example:
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_airbnb-2Dengineering=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=5u0w58RfiD0Vod0RjbyvOYIK0BU_AmZQWZArVDFwGR8=ONu7PZ2izFImyfNvW7dau4A2ErkqIB0x9Kqve4jirFo=
> ,
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__engineering.salesforce.com_=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=5u0w58RfiD0Vod0RjbyvOYIK0BU_AmZQWZArVDFwGR8=PVZOz8Onh5k3cprKzy2-Vh63YlvgBilWdqMs261SV40=
> > ).
> >
> > Do we have access to a web-server as part of Apache project to host the
> > blog? If yes, are there any restrictions (e.g. java only)?
> >
> > We don't have a very large number of blog posts so whichever method we
> > choose, it should be quick to move the posts ready once the infra is set
> > up.
> >
> > Please chime in with your suggestions and preferences.
> >
> > Thanks
> > Prashant
> >
> >
> > On Fri, May 1, 2020 at 9:06 AM Vinoth Chandar  wrote:
> >
> > > That’d be awesome! Thanks!
> > >
> > > On Fri, May 1, 2020 at 9:06 AM Prashant Wason  >
> > > wrote:
> > >
> > > > Hi Vinoth,
> > > >
> > > > Sure, I will prioritize this. Hope to have something by this weekend.
> > > >
> > > > Thanks
> > > > Prashant
> > > >
> > > >
> > > > On Wed, Apr 29, 2020 at 8:31 PM Vinoth Chandar 
>

Re: [DISCUSS] moving blog from cwiki to website

2020-05-04 Thread Prashant Wason
Hi Team,

I surveyed several Apache projects and this is how they are blogging:

1. Use ASF's blogging platform. Hosted by ASF
 Example: https://blogs.apache.org/kafka/

2. Aggregating links to blog posts posted elsewhere
Example: https://docs.pinot.apache.org/community-1/blogs

3. Externally hosted blog with a template matching the project website
Example: https://blog.couchdb.org/

4. Self hosted blog
Example: https://cassandra.apache.org/blog/
http://libcloud.apache.org/blog/ http://mesos.apache.org/blog/

5. Blogging on medium.com or wordpress.com. Several Tech companies also
host their engineering blogs there (Example:
https://medium.com/airbnb-engineering,   https://engineering.salesforce.com/
).

Do we have access to a web-server as part of Apache project to host the
blog? If yes, are there any restrictions (e.g. java only)?

We don't have a very large number of blog posts so whichever method we
choose, it should be quick to move the posts ready once the infra is set up.

Please chime in with your suggestions and preferences.

Thanks
Prashant


On Fri, May 1, 2020 at 9:06 AM Vinoth Chandar  wrote:

> That’d be awesome! Thanks!
>
> On Fri, May 1, 2020 at 9:06 AM Prashant Wason 
> wrote:
>
> > Hi Vinoth,
> >
> > Sure, I will prioritize this. Hope to have something by this weekend.
> >
> > Thanks
> > Prashant
> >
> >
> > On Wed, Apr 29, 2020 at 8:31 PM Vinoth Chandar 
> wrote:
> >
> > > Hi Prashant,
> > >
> > > Have you started on this already? Any rough etas?
> > >  It might be good to have this in place soon so people can start
> working
> > on
> > > the blogs together with major features on the next release.. we have
> > tight
> > > rope to walk in may.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Wed, Apr 22, 2020 at 10:51 AM Vinoth Chandar 
> > wrote:
> > >
> > > > Great!  Just sharing the prior conversation on this.
> > > >
> > > > We were hoping to replace the ill-maintained activity page here
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_activity.html=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=xNirgm0ZMwkyIf_3PGmv5m0CEbnJKgwVgzhe4ruoMSo=goFt0Ixt--F_9tai6CGFoTfiH2FMpzYVQZS91gG1rtU=
> > > with a blog section and move stuff
> > > > there.
> > > > We should already have all the tools/markups for code highlighting
> > etc..
> > > >
> > > > On Wed, Apr 22, 2020 at 10:08 AM Prashant Wason
> >  > > >
> > > > wrote:
> > > >
> > > >> I can help drive this. Let me take a look at some other projects and
> > > >> suggest how to go about it.
> > > >>
> > > >> Thanks
> > > >> Prashant
> > > >>
> > > >>
> > > >> On Wed, Apr 22, 2020, 9:31 AM Vinoth Chandar 
> > wrote:
> > > >>
> > > >> > Any volunteers to drive this? (also may be a small section in
> > > >> contribution
> > > >> > guide for contributing a blog) :)
> > > >> >
> > > >> > On Wed, Apr 22, 2020 at 9:11 AM vbal...@apache.org <
> > > vbal...@apache.org>
> > > >> > wrote:
> > > >> >
> > > >> > >  +1 on moving blogs to website.
> > > >> > > On Wednesday, April 22, 2020, 08:35:02 AM PDT, leesf <
> > > >> > > leesf0...@gmail.com> wrote:
> > > >> > >
> > > >> > >  +1
> > > >> > >
> > > >> > > vino yang  于2020年4月22日周三 下午1:50写道:
> > > >> > >
> > > >> > > > +1 from my side.
> > > >> > > >
> > > >> > > > Pratyaksh Sharma  于2020年4月22日周三
> > 下午1:38写道:
> > > >> > > >
> > > >> > > > > +1
> > > >> > > > >
> > > >> > > > > I have seen other Apache projects having blogs on their
> > website
> > > >> like
> > > >> > > > Apache
> > > >> > > > > Pinot.
> > > >> > > > >
> > > >> > > > > On Wed, Apr 22, 2020 at 11:05 AM Bhavani Sudha Saktheeswaran
> > > >> > > > >  wrote:
> > > >> > > > >
> > > >> > > > > > +1
> > > >> > > > > >
> > > >> > > > > > On Tue, Apr 21, 2020 at 10:23 PM tison <
> > wander4...@gmail.com>
> > > >> > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi Vinoth,
> > > >> > > > > > >
> > > >> > > > > > > +1 for moving blogs.
> > > >> > > > > > >
> > > >> > > > > > > cwiki looks belong to developer's scope and the first
> > > >> experience
> > > >> > of
> > > >> > > > > users
> > > >> > > > > > > is more likely our website.
> > > >> > > > > > >
> > > >> > > > > > > Best,
> > > >> > > > > > > tison.
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > Vinoth Chandar  于2020年4月22日周三
> > 下午1:09写道:
> > > >> > > > > > >
> > > >> > > > > > > > Hi community,
> > > >> > > > > > > >
> > > >> > > > > > > > What does everyone feel about moving blogs we have on
> > > cwiki
> > > >> now
> > > >> > > > over
> > > >> > > > > to
> > > >> > > > > > > > site so they are better discovered?
> > > >> > > > > > > >
> > > >> > > > > > > > Thanks
> > > >> > > > > > > > Vinoth
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> >
> > > >>
> > > >
> > >
> >
>


Re: [DISCUSS] moving blog from cwiki to website

2020-05-01 Thread Prashant Wason
Hi Vinoth,

Sure, I will prioritize this. Hope to have something by this weekend.

Thanks
Prashant


On Wed, Apr 29, 2020 at 8:31 PM Vinoth Chandar  wrote:

> Hi Prashant,
>
> Have you started on this already? Any rough etas?
>  It might be good to have this in place soon so people can start working on
> the blogs together with major features on the next release.. we have tight
> rope to walk in may.
>
> Thanks
> Vinoth
>
> On Wed, Apr 22, 2020 at 10:51 AM Vinoth Chandar  wrote:
>
> > Great!  Just sharing the prior conversation on this.
> >
> > We were hoping to replace the ill-maintained activity page here
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_activity.html=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=xNirgm0ZMwkyIf_3PGmv5m0CEbnJKgwVgzhe4ruoMSo=goFt0Ixt--F_9tai6CGFoTfiH2FMpzYVQZS91gG1rtU=
> with a blog section and move stuff
> > there.
> > We should already have all the tools/markups for code highlighting etc..
> >
> > On Wed, Apr 22, 2020 at 10:08 AM Prashant Wason  >
> > wrote:
> >
> >> I can help drive this. Let me take a look at some other projects and
> >> suggest how to go about it.
> >>
> >> Thanks
> >> Prashant
> >>
> >>
> >> On Wed, Apr 22, 2020, 9:31 AM Vinoth Chandar  wrote:
> >>
> >> > Any volunteers to drive this? (also may be a small section in
> >> contribution
> >> > guide for contributing a blog) :)
> >> >
> >> > On Wed, Apr 22, 2020 at 9:11 AM vbal...@apache.org <
> vbal...@apache.org>
> >> > wrote:
> >> >
> >> > >  +1 on moving blogs to website.
> >> > > On Wednesday, April 22, 2020, 08:35:02 AM PDT, leesf <
> >> > > leesf0...@gmail.com> wrote:
> >> > >
> >> > >  +1
> >> > >
> >> > > vino yang  于2020年4月22日周三 下午1:50写道:
> >> > >
> >> > > > +1 from my side.
> >> > > >
> >> > > > Pratyaksh Sharma  于2020年4月22日周三 下午1:38写道:
> >> > > >
> >> > > > > +1
> >> > > > >
> >> > > > > I have seen other Apache projects having blogs on their website
> >> like
> >> > > > Apache
> >> > > > > Pinot.
> >> > > > >
> >> > > > > On Wed, Apr 22, 2020 at 11:05 AM Bhavani Sudha Saktheeswaran
> >> > > > >  wrote:
> >> > > > >
> >> > > > > > +1
> >> > > > > >
> >> > > > > > On Tue, Apr 21, 2020 at 10:23 PM tison 
> >> > wrote:
> >> > > > > >
> >> > > > > > > Hi Vinoth,
> >> > > > > > >
> >> > > > > > > +1 for moving blogs.
> >> > > > > > >
> >> > > > > > > cwiki looks belong to developer's scope and the first
> >> experience
> >> > of
> >> > > > > users
> >> > > > > > > is more likely our website.
> >> > > > > > >
> >> > > > > > > Best,
> >> > > > > > > tison.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Vinoth Chandar  于2020年4月22日周三 下午1:09写道:
> >> > > > > > >
> >> > > > > > > > Hi community,
> >> > > > > > > >
> >> > > > > > > > What does everyone feel about moving blogs we have on
> cwiki
> >> now
> >> > > > over
> >> > > > > to
> >> > > > > > > > site so they are better discovered?
> >> > > > > > > >
> >> > > > > > > > Thanks
> >> > > > > > > > Vinoth
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> >
> >>
> >
>


Re: [DISCUSS] moving blog from cwiki to website

2020-04-22 Thread Prashant Wason
I can help drive this. Let me take a look at some other projects and
suggest how to go about it.

Thanks
Prashant


On Wed, Apr 22, 2020, 9:31 AM Vinoth Chandar  wrote:

> Any volunteers to drive this? (also may be a small section in contribution
> guide for contributing a blog) :)
>
> On Wed, Apr 22, 2020 at 9:11 AM vbal...@apache.org 
> wrote:
>
> >  +1 on moving blogs to website.
> > On Wednesday, April 22, 2020, 08:35:02 AM PDT, leesf <
> > leesf0...@gmail.com> wrote:
> >
> >  +1
> >
> > vino yang  于2020年4月22日周三 下午1:50写道:
> >
> > > +1 from my side.
> > >
> > > Pratyaksh Sharma  于2020年4月22日周三 下午1:38写道:
> > >
> > > > +1
> > > >
> > > > I have seen other Apache projects having blogs on their website like
> > > Apache
> > > > Pinot.
> > > >
> > > > On Wed, Apr 22, 2020 at 11:05 AM Bhavani Sudha Saktheeswaran
> > > >  wrote:
> > > >
> > > > > +1
> > > > >
> > > > > On Tue, Apr 21, 2020 at 10:23 PM tison 
> wrote:
> > > > >
> > > > > > Hi Vinoth,
> > > > > >
> > > > > > +1 for moving blogs.
> > > > > >
> > > > > > cwiki looks belong to developer's scope and the first experience
> of
> > > > users
> > > > > > is more likely our website.
> > > > > >
> > > > > > Best,
> > > > > > tison.
> > > > > >
> > > > > >
> > > > > > Vinoth Chandar  于2020年4月22日周三 下午1:09写道:
> > > > > >
> > > > > > > Hi community,
> > > > > > >
> > > > > > > What does everyone feel about moving blogs we have on cwiki now
> > > over
> > > > to
> > > > > > > site so they are better discovered?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vinoth
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>


Re: Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Prashant Wason
Thanks for the info Vinoth / Balaji.

To me it feels a split between easier-to-understand design and
current-implementation. I feel it is simpler to reason (based on how file
systems work in general) that restoreToInstant is a complete point-in-time
shift to the past (like restoring a file system from a snapshot/backup).

If I have restored the Table to commitTime=005, then having any instants
with commitTime > 005 are confusing as it implies that even though my table
is at an older time, some future operations will be applied onto it at some
point.

I will have to read more about incremental timeline syncing and timeline
server to understand how it uses the clean instants. BTW, the comment on
the function HoodieWriteClient::restoreToInstant reads "NOTE : This action
requires all writers (ingest and compact) to a table to be stopped before
proceeding". So probably the embedded timeline server can recreate the view
next time it comes back up?

Thanks
Prashant


On Wed, Mar 18, 2020 at 11:37 AM Balaji Varadarajan
 wrote:

>  Prashanth,
> I think we should not be reverting clean operations here. Cleans are done
> on the oldest file slices and a restore/rollback is not completely undoing
> the work of clean that happened before it.
> For incremental timeline syncing, embedded timeline server needs to read
> these clean metadata to sync its cached file-system view.
> Let me know your thoughts.
> Balaji.V
> On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason
>  wrote:
>
>  HI Team,
>
> I noticed that when a table is restored to a previous commit (
> HoodieWriteClient::restoreToInstant
> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_blob_master_hudi-2Dclient_src_main_java_org_apache_hudi_client_HoodieWriteClient.java-23L735=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=ASTWkm7UUMnhZ7sBzpXGPkTc1PhNTJeO7q5IXlBCprY=43rqua7SdhvO91hA0ZhOPNQw8ON1nL3bAsCue5o8aYw=
> >),
> only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
> their corresponding files are deleted from the timeline. If there are some
> CLEAN instants, they are left over.
>
> Is there a reason why CLEAN are not removed? Won't they be referring to
> files  which are no longer present and hence not useful?
>
> Thanks
> Prashant
>


Query regarding restoring HUDI tables to older commits

2020-03-18 Thread Prashant Wason
HI Team,

I noticed that when a table is restored to a previous commit (
HoodieWriteClient::restoreToInstant
),
only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
their corresponding files are deleted from the timeline. If there are some
CLEAN instants, they are left over.

Is there a reason why CLEAN are not removed? Won't they be referring to
files  which are no longer present and hence not useful?

Thanks
Prashant


HUDI Metrics for hudi-common components

2020-03-10 Thread Prashant Wason
Hi Team,

I am interested in adding metrics support to HoodieWrapperFileSystem. This
will help track the counts of operations and their latencies and will
provide valuable data to implement and test newer ideas (e.g. RFC 15
which
is
proposing a consolidated metadata reducing the number of file system
operations).

HUDI metrics are currently implemented in module hudi-client. Modules like
hudi-utilities have hudi-client as their dependency (via pom.xml). But this
cannot be done for hudi-common as this module is itself a dependency for
hudi-client.

Hence, I feel it may be better to move the HUID metrics code to hudi-common
as most modules anyways depend on hudi-common.

What do you think about this? Any other ideas of how to approach this?

Thanks
Prashant


Re: Hudi logo not found in apache projects logos

2020-03-10 Thread Prashant Wason
Found this link on that page:
http://apache.org/logos/about.html


On Tue, Mar 10, 2020 at 12:42 PM Sivabalan  wrote:

> Hi folks,
>Do you guys know to add hudi log in the list of apache logos here
> <
> https://urldefense.proofpoint.com/v2/url?u=http-3A__apache.org_logos_-23=DwIBaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=a-WB2asg2vfXR90t_i3Eb6Fva_aNqT5AuqBNedr1X-A=aWtMDsN0-skl4idJCpS9eHJNcuT0HrjrtmXA6-D5gdM=
> >. I do see other incubating projects as well.
>
> --
> Regards,
> -Sivabalan
>


Re: Issue related to [HUDI-377] Adding Delete() support to DeltaStreamer

2020-03-10 Thread Prashant Wason
Thanks for the update Sivabalan. I will wait for your fix.

On Tue, Mar 10, 2020 at 12:36 PM Sivabalan  wrote:

> thanks for bringing this to my attention Prasant. Yes, I bumped into the
> bug couple of days back. I am working on the fix, and the expected no of
> records might have to be fixed as well. I am running into issues debugging
> continuous tests as of now. But I am working on it.
>
>
> On Tue, Mar 10, 2020 at 12:32 PM Prashant Wason 
> wrote:
>
> > Hi Team,
> >
> > While exploring HUDI source code I came across this PR:
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_pull_1073=DwIBaQ=r2dcLCtU9q6n0vrtnDw9vg=c89AU9T1AVhM4r2Xi3ctZA=9WZ2tqIxWwOrZRAqmP_InSRBlFhGKElcWnFP-DPgCkY=s8ZOjL4LXWaB6kfrL-BUZdOwb22h4RA4ff9KdUrfTNk=
> >
> > As part of the above PR, generation of delete records was added
> > to HoodieTestDataGenerator. Within the class HoodieTestDataGenerator, the
> > existingKeys Map maintains the current keys. In the above PR, the
> following
> > code was added to delete from the Map:
> >
> > existingKeys.remove(kp);
> >
> > This is delete by value rather than the key (private final Map > KeyPartition> existingKeys;)
> >
> > I tried fixing this issue but this leads to unit test failures
> > in TestHoodieDeltaStreamer within the testUpsertsCOWContinuousMode. The
> > code which is failing is this check (bold):
> >
> > TestHelpers.waitTillCondition((r) -> {
> >   if (tableType.equals(HoodieTableType.MERGE_ON_READ)) {
> > TestHelpers.assertAtleastNDeltaCommits(5, tableBasePath, dfs);
> > TestHelpers.assertAtleastNCompactionCommits(2, tableBasePath,
> dfs);
> >   } else {
> > TestHelpers.assertAtleastNCompactionCommits(5, tableBasePath,
> dfs);
> >   }
> >   *TestHelpers.assertRecordCount(totalRecords + 200, tableBasePath +
> > "/*/*.parquet", sqlContext);*
> >   *TestHelpers.assertDistanceCount(totalRecords + 200, tableBasePath
> +
> > "/*/*.parquet", sqlContext);*
> >   return true;
> >
> > I did not understand why a +200 was added in the checks above? Is this
> > related to the existingKeys.remove() which does not remove the records
> from
> > the Map?
> >
> > I have left these comments on the PR itself so they are easier to read.
> >
> > Thanks
> > Prashant
> >
>
>
> --
> Regards,
> -Sivabalan
>


Issue related to [HUDI-377] Adding Delete() support to DeltaStreamer

2020-03-10 Thread Prashant Wason
Hi Team,

While exploring HUDI source code I came across this PR:
https://github.com/apache/incubator-hudi/pull/1073

As part of the above PR, generation of delete records was added
to HoodieTestDataGenerator. Within the class HoodieTestDataGenerator, the
existingKeys Map maintains the current keys. In the above PR, the following
code was added to delete from the Map:

existingKeys.remove(kp);

This is delete by value rather than the key (private final Map existingKeys;)

I tried fixing this issue but this leads to unit test failures
in TestHoodieDeltaStreamer within the testUpsertsCOWContinuousMode. The
code which is failing is this check (bold):

TestHelpers.waitTillCondition((r) -> {
  if (tableType.equals(HoodieTableType.MERGE_ON_READ)) {
TestHelpers.assertAtleastNDeltaCommits(5, tableBasePath, dfs);
TestHelpers.assertAtleastNCompactionCommits(2, tableBasePath, dfs);
  } else {
TestHelpers.assertAtleastNCompactionCommits(5, tableBasePath, dfs);
  }
  *TestHelpers.assertRecordCount(totalRecords + 200, tableBasePath +
"/*/*.parquet", sqlContext);*
  *TestHelpers.assertDistanceCount(totalRecords + 200, tableBasePath +
"/*/*.parquet", sqlContext);*
  return true;

I did not understand why a +200 was added in the checks above? Is this
related to the existingKeys.remove() which does not remove the records from
the Map?

I have left these comments on the PR itself so they are easier to read.

Thanks
Prashant