Re: Dropping support for Spark 2.2 and lower

2019-09-10 Thread Shiyan Xu
+1

On Tue, Sep 10, 2019 at 7:16 AM Vinoth Chandar  wrote:

> Hello all,
>
> I am trying to gauge what spark version everyone is on. We would like to
> move the spark version to 2.4 and simplify a whole bunch of stuff. Any
> objections? As a best effort, we can try to make 2.3 work reliably. Any
> objections?
>
> Note that if you are using the RDD based hudi-client primarily, this should
> not affect you per se.
>
> Thanks
> Vinoth
>


Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-12 Thread Shiyan Xu
Thank you all for the +1s! I'll go ahead add a RFC page then.

On Tue, Nov 12, 2019 at 8:41 AM nishith agarwal  wrote:

> +1 on the exporter tool idea.
>
> -Nishith
>
> On Tue, Nov 12, 2019 at 5:06 AM leesf  wrote:
>
> > +1. and we would discuss it further when design docs are available.
> >
> > Best,
> > Leesf
> >
> > Balaji Varadarajan  于2019年11月12日周二 下午4:17写道:
> >
> > > +1 on the exporter tool idea.
> > >
> > > On Mon, Nov 11, 2019 at 10:36 PM vino yang 
> > wrote:
> > >
> > > > Hi Shiyan,
> > > >
> > > > +1 for this proposal, Also, it looks like an exporter tool.
> > > >
> > > > @Vinoth Chandar   Any thoughts about where to
> place
> > > it?
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > Vinoth Chandar  于2019年11月12日周二 上午8:58写道:
> > > >
> > > > > We can wait for others to chime in as well. :)
> > > > >
> > > > > On Mon, Nov 11, 2019 at 4:37 PM Shiyan Xu <
> > xu.shiyan.raym...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Yes, Vinoth, you're right that it is more of an exporter, which
> > > > exports a
> > > > > > snapshot from Hudi dataset.
> > > > > >
> > > > > > It should support MOR too; it shall just leverage on existing
> > > > > > SnapshotCopier logic to find the latest file slices.
> > > > > >
> > > > > > So is it good to create a RFC for further discussion?
> > > > > >
> > > > > >
> > > > > > On Mon, Nov 11, 2019 at 4:31 PM Vinoth Chandar <
> vin...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > What you suggest sounds more like an `Exporter` tool?  I
> imagine
> > > you
> > > > > will
> > > > > > > support MOR as well?  +1 on the idea itself. It could be useful
> > if
> > > > > plain
> > > > > > > parquet snapshot was generated as a backup.
> > > > > > >
> > > > > > > On Mon, Nov 11, 2019 at 4:21 PM Shiyan Xu <
> > > > xu.shiyan.raym...@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > The existing SnapshotCopier under Hudi Utilities is a
> > > Hudi-to-Hudi
> > > > > copy
> > > > > > > and
> > > > > > > > primarily for backup purpose.
> > > > > > > >
> > > > > > > > I would like to start a RFC for a more generic Hudi
> > snapshotter,
> > > > > which
> > > > > > > >
> > > > > > > >- Supports existing SnapshotCopier features
> > > > > > > >- Add option to export a Hudi dataset to plain parquet
> files
> > > > > > > >   - output latest records via Spark dataframe writer
> > > > > > > >   - remove Hudi metadata fields
> > > > > > > >   - support custom repartition requirements
> > > > > > > >
> > > > > > > > Is this a good idea to start an RFC?
> > > > > > > >
> > > > > > > > Thank you.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Raymond Xu
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


[DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-11 Thread Shiyan Xu
Hi All,

The existing SnapshotCopier under Hudi Utilities is a Hudi-to-Hudi copy and
primarily for backup purpose.

I would like to start a RFC for a more generic Hudi snapshotter, which

   - Supports existing SnapshotCopier features
   - Add option to export a Hudi dataset to plain parquet files
  - output latest records via Spark dataframe writer
  - remove Hudi metadata fields
  - support custom repartition requirements

Is this a good idea to start an RFC?

Thank you.

Regards,
Raymond Xu


Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-11 Thread Shiyan Xu
Yes, Vinoth, you're right that it is more of an exporter, which exports a
snapshot from Hudi dataset.

It should support MOR too; it shall just leverage on existing
SnapshotCopier logic to find the latest file slices.

So is it good to create a RFC for further discussion?


On Mon, Nov 11, 2019 at 4:31 PM Vinoth Chandar  wrote:

> What you suggest sounds more like an `Exporter` tool?  I imagine you will
> support MOR as well?  +1 on the idea itself. It could be useful if plain
> parquet snapshot was generated as a backup.
>
> On Mon, Nov 11, 2019 at 4:21 PM Shiyan Xu 
> wrote:
>
> > Hi All,
> >
> > The existing SnapshotCopier under Hudi Utilities is a Hudi-to-Hudi copy
> and
> > primarily for backup purpose.
> >
> > I would like to start a RFC for a more generic Hudi snapshotter, which
> >
> >- Supports existing SnapshotCopier features
> >- Add option to export a Hudi dataset to plain parquet files
> >   - output latest records via Spark dataframe writer
> >   - remove Hudi metadata fields
> >   - support custom repartition requirements
> >
> > Is this a good idea to start an RFC?
> >
> > Thank you.
> >
> > Regards,
> > Raymond Xu
> >
>


Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-12 Thread Shiyan Xu
Came up with the first draft. Thank you.
https://cwiki.apache.org/confluence/display/HUDI/RFC-9%3A+%28WIP%29+Hudi+Dataset+Snapshotter


On Tue, Nov 12, 2019 at 12:44 PM Shiyan Xu 
wrote:

> Thank you all for the +1s! I'll go ahead add a RFC page then.
>
> On Tue, Nov 12, 2019 at 8:41 AM nishith agarwal 
> wrote:
>
>> +1 on the exporter tool idea.
>>
>> -Nishith
>>
>> On Tue, Nov 12, 2019 at 5:06 AM leesf  wrote:
>>
>> > +1. and we would discuss it further when design docs are available.
>> >
>> > Best,
>> > Leesf
>> >
>> > Balaji Varadarajan  于2019年11月12日周二 下午4:17写道:
>> >
>> > > +1 on the exporter tool idea.
>> > >
>> > > On Mon, Nov 11, 2019 at 10:36 PM vino yang 
>> > wrote:
>> > >
>> > > > Hi Shiyan,
>> > > >
>> > > > +1 for this proposal, Also, it looks like an exporter tool.
>> > > >
>> > > > @Vinoth Chandar   Any thoughts about where to
>> place
>> > > it?
>> > > >
>> > > > Best,
>> > > > Vino
>> > > >
>> > > > Vinoth Chandar  于2019年11月12日周二 上午8:58写道:
>> > > >
>> > > > > We can wait for others to chime in as well. :)
>> > > > >
>> > > > > On Mon, Nov 11, 2019 at 4:37 PM Shiyan Xu <
>> > xu.shiyan.raym...@gmail.com
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Yes, Vinoth, you're right that it is more of an exporter, which
>> > > > exports a
>> > > > > > snapshot from Hudi dataset.
>> > > > > >
>> > > > > > It should support MOR too; it shall just leverage on existing
>> > > > > > SnapshotCopier logic to find the latest file slices.
>> > > > > >
>> > > > > > So is it good to create a RFC for further discussion?
>> > > > > >
>> > > > > >
>> > > > > > On Mon, Nov 11, 2019 at 4:31 PM Vinoth Chandar <
>> vin...@apache.org>
>> > > > > wrote:
>> > > > > >
>> > > > > > > What you suggest sounds more like an `Exporter` tool?  I
>> imagine
>> > > you
>> > > > > will
>> > > > > > > support MOR as well?  +1 on the idea itself. It could be
>> useful
>> > if
>> > > > > plain
>> > > > > > > parquet snapshot was generated as a backup.
>> > > > > > >
>> > > > > > > On Mon, Nov 11, 2019 at 4:21 PM Shiyan Xu <
>> > > > xu.shiyan.raym...@gmail.com
>> > > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi All,
>> > > > > > > >
>> > > > > > > > The existing SnapshotCopier under Hudi Utilities is a
>> > > Hudi-to-Hudi
>> > > > > copy
>> > > > > > > and
>> > > > > > > > primarily for backup purpose.
>> > > > > > > >
>> > > > > > > > I would like to start a RFC for a more generic Hudi
>> > snapshotter,
>> > > > > which
>> > > > > > > >
>> > > > > > > >- Supports existing SnapshotCopier features
>> > > > > > > >- Add option to export a Hudi dataset to plain parquet
>> files
>> > > > > > > >   - output latest records via Spark dataframe writer
>> > > > > > > >   - remove Hudi metadata fields
>> > > > > > > >   - support custom repartition requirements
>> > > > > > > >
>> > > > > > > > Is this a good idea to start an RFC?
>> > > > > > > >
>> > > > > > > > Thank you.
>> > > > > > > >
>> > > > > > > > Regards,
>> > > > > > > > Raymond Xu
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>


RFC process step 1 votes

2019-11-12 Thread Shiyan Xu
Hi all,

As per the RFC process
https://cwiki.apache.org/confluence/display/HUDI/RFC+Process

We usually start with an email thread to raise an idea before step 2:
creating an RFC page.

It'll be good to reach an agreement on how many votes (+1) do we need to
proceed to step 2.

The idea is not to make it too easy, otherwise it defeats the purpose of
setting step 1, but also not to make the bar too high.

I would like to propose at least 5 votes (+1) needed to proceed:
- 2 votes from the committers, and
- 3 votes from the community (can be committer or non-committer)

Consider the growing community, we shall be able to increase the numbers as
deemed necessary.

Any thoughts? How about 5 votes to pass this meta-voting? :)


- Raymond


[QUESTION] Handle record partition change

2019-12-11 Thread Shiyan Xu
Hi Hudi devs,

Upon upsert operations, does Hudi detect record's partition path change? As
for the same record, the partition path field may get updated while the
record key (the primary id) stays the same, then the insert would result in
duplicate record (based on record key) in the dataset. Is there any
relevant logic of this kind of detection and/or clean-up in the codebase?

Best,
Raymond


Re: [DISCUSS] Simplification of terminologies

2019-11-11 Thread Shiyan Xu
[1] +1; "query" indeed sounds better
[2] +1 on the term "snapshot"; so basically we follow the convention that
when we say "snapshot", it means "give me the most up-to-date facts (lowest
data latency) even if it takes some query time"
[3] Though I agree with the renaming, I have a different perspective to
raise on the table types:
MOR is a superset of COW; I suppose a user can theoretically configure a
Hudi streamer to write to a MOR table and make it behave equivalently to a
COW table, am I right? I imagine that involves scheduling compaction/clean
right after write operations and hence making RO view and RT view close to
each other. So what will be the advantage of defining COW tables if MOR can
do everything? Would be happy to get more insights on the benefits of
defining COW over MOR.
So based on COW being an subset and yielded from a special configuration of
MOR, my thoughts are: can we just keep MOR and deprecate COW? In case when
users don't need RT view, can we provide a flag like
"--disable-realtime-view/query" to help achieve the original COW features?
So back to the renaming, if COW can be achieved by changing configs of MOR,
then we could potentially save the hassles of renaming and just deprecate
the type.

On Mon, Nov 11, 2019 at 5:05 PM Bhavani Sudha 
wrote:

> +1 on all three rename proposals. I think this would make the concepts
> super easy to follow for new users.
>
> If changing [3] seems to be a stretch, we should definitely do [1] & [2] at
> the least IMO. I will be glad to help out on the renames to whatever extent
> possible should the Hudi community incline to pursue this.
>
> Thanks,
> Sudha
>
>
>
> On Mon, Nov 11, 2019 at 3:46 PM Vinoth Chandar  wrote:
>
> > Hello all,
> >
> > I wanted to raise an important topic with the community around whether we
> > should rename some of our terminologies in code/docs to be more
> > user-friendly and understandable..
> >
> > Let me also provide some context for each, since I am probably guilty of
> > introducing most of them in the first place :).
> >
> > *1. Rename "views" to "query" : *Instead of saying incremental view or
> > read-optimized view, talk about them as "incremental query" and
> > "read-optimized query". The term "view" is very technical, and what I was
> > trying to convey was that we ingest/store the data once and expose views
> on
> > top. But new users (atleast half dozen of them to me) tend to confuse
> this
> > with views/materialized views found in databases. Almost always we talk
> > about views mostly in terms of expected behavior for a query on the
> view. I
> > am proposing to just call these different query types since its a more
> > universally accepted terminology and IMO clearer.
> >
> > *2. Rename "Read-Optimized/Realtime" views to Snapshot views + Have
> > Read-Optimized view only for MOR storage :* This one is probably the
> > trickiest. Hudi was always designed with MOR in mind, even as we were
> > working on COW storage and consequently we named the pure parquet backed
> > view as Read-Optimized, hoping to name parquet + avro based view as
> > Write-Optimized. However, we opted to name it Realtime to emphasize the
> > data freshness aspect. In retrospect, the views should have not been
> named
> > after their performance characteristics but rather the classes of queries
> > done on them and guarantees for those (point above #1). Moreover, once we
> > have parquet embedded into the log format, then the tradeoffs may not be
> > the same anyways.
> >
> > So combining with the renaming proposed in #1, we would end up with the
> > following..
> >
> > Copy-On-Write :
> > [Old]  Read-Optimized View =>  [New] Snapshot Query
> > [Old]  Incremental View => [New] Incremental Query
> >
> > Merge-On-Read:
> > [Old] Realtime View => [New] Snapshot Query
> > [Old] Incremental View => [New] Incremental Query
> > [Old] ReadOptimzied View => [New] Read-Optimized Query (since it is read
> > optimized compared to Snapshot query always, at the cost of staler data)
> >
> > Both changes #1 & #2 could be simpler changes to just code references,
> docs
> > and configs.. we can support both string for sometime and deprecate
> > eventually since queries are stateless.
> >
> > *3. Rename COPY_ON_WRITE to MERGE_ON_WRITE :* Name originated since the
> > design was very similar to https://en.wikipedia.org/wiki/Copy-on-write
> > filesystems
> > & snapshotting and we once hoped to push some of this logic into the
> > storage itself, all in vain. but the name stuck, even though once we had
> > MERGE_ON_READ the focus was often on merge costs etc, which the name
> > COPY_ON_WRITE does not convey directly. I don't feel very strong about
> this
> > and there is also cost to changing this since its persisted inside
> > hoodie.properties and we will support both strings internally in code for
> > backwards compatibility anyway
> >
> > Naming something is very hard (yes, try :)).I believe these changes will
> > make the project simpler to 

Re: Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2019-12-18 Thread Shiyan Xu
Thank you @lamber-ken for the work! It is definitely a greater browsing
experience.

On Tue, Dec 17, 2019 at 8:28 PM lamberken  wrote:

>
> Hi, @Vinoth
>
>
>
> I'm glad to hear your thoughts on the new UI, thanks. So we keep its style
> as it is now.
> The development of new UI can be completed these days, any questions are
> welcome.
>
>
> best,
> lamber-ken
>
>
> At 2019-12-18 11:44:27, "Vinoth Chandar" 
> wrote:
> >The case for right navigation for me, is mainly from pages like
> >
> >https://lamber-ken.github.io/docs/docker_demo
> >https://lamber-ken.github.io/docs/querying_data
> >https://lamber-ken.github.io/docs/writing_data
> >
> >which often have commands/text you want to selectively copy paste from a
> >section.
> >For content you read sequentially, it matters less. I agree..
> >
> >BTW the new site looks very sleek.. :)
> >
> >
> >
> >On Tue, Dec 17, 2019 at 4:50 PM lamberken  wrote:
> >
> >>
> >> hi, allOne more thing that is missing.In the new UI, I put a "BACK TO
> TOP"
> >> button at the bottom of all pages to help us back to top.
> >> We can also discuss whether we need the right navigation at the
> community
> >> meeting today.best,
> >> lamber-ken
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> At 2019-12-18 08:41:49, "lamberken"  wrote:
> >> >
> >> >Hi @Vinoth,
> >> >
> >> >
> >> >Thanks for raising this point, but I have some different views.
> >> >
> >> >
> >> >I've thought about it very seriously before, and I remove the right
> >> navigation finally.
> >> >1, I have a deep analysis of the characteristics of our documents, most
> >> of them have many
> >> >commands, if the right navigation exists, it will affect us to
> read.
> >> >2, Most documents are short, we can visit them all just at one page.
> >> >3, The max width of web page is 1280px, left navigation is 250px(at
> >> least), right navigation is 250px(at least),
> >> >if so, the width of the main content is only left 800px, may it's
> not
> >> suitable for readers.
> >> >4, I also analysised other projects, like
> >> >1) flink, spark, zeppelin, kafka, superset, elasticsearch, arrow,
> >> kudu, hadoop don't have right navigation
> >> >2) druid, kylin, beam have right navigation.
> >> >These are my personal views. Welcome all community members to join in
> the
> >> discussion.
> >> >In the end, I will follow our community, thanks.
> >> >
> >> >
> >> >BTW, I have synced most of the documents[1], we can use these documents
> >> as a reference to see
> >> >if we need the navigation bar on the right in the new UI.
> >> >
> >> >
> >> >[1] https://lamber-ken.github.io/docs/admin_guide
> >> >[2] https://lamber-ken.github.io/docs/writing_data
> >> >[3] https://lamber-ken.github.io/docs/quick-start-guide/
> >> >
> >> >
> >> >best,
> >> >lamber-ken
> >> >
> >> >
> >> >
> >> >
> >> >At 2019-12-18 04:44:04, "Vinoth Chandar"  wrote:
> >> >>One more thing that is missing.
> >> >>
> >> >>Current site has a navigation links on the right, which lets you jump
> to
> >> >>the right section directly. This is also a must-have IMHO.
> >> >>I would suggest wait for more folks to come back from vacation,
> before we
> >> >>finalize anything on this, as there could be more feedback
> >> >>
> >> >>
> >> >>
> >> >>On Mon, Dec 16, 2019 at 9:15 PM lamberken  wrote:
> >> >>
> >> >>>
> >> >>> Hi Vinoth,
> >> >>>
> >> >>>
> >> >>> 1, I'll update the site content this week, clean some useless
> templete
> >> >>> codes, adjust the content etc...
> >> >>> It will take a little long time for syncing the content.
> >> >>> 2, I will adjust the style as much as I can to keep the theming blue
> >> and
> >> >>> white.
> >> >>>
> >> >>>
> >> >>> When the above work is completed, I will notify you all again.
> >> >>> best,
> >> >>> lamber-ken
> >> >>>
> >> >>>
> >> >>> At 2019-12-17 12:49:23, "Vinoth Chandar"  wrote:
> >> >>> >Hi Lamber,
> >> >>> >
> >> >>> >+1 on the look and feel. Definitely feels slick and fast. Love the
> >> syntax
> >> >>> >highlighting.
> >> >>> >
> >> >>> >
> >> >>> >Few things :
> >> >>> >- Can we just update the site content as-is? ( I'd rather change
> just
> >> the
> >> >>> >look-and-feel and evolve the content from there, per usual means)
> >> >>> >- Can we keep the theming blue and white, like now, since it gels
> well
> >> >>> with
> >> >>> >the logo and images.
> >> >>> >
> >> >>> >
> >> >>> >On Mon, Dec 16, 2019 at 8:02 AM lamberken 
> wrote:
> >> >>> >
> >> >>> >>
> >> >>> >>
> >> >>> >> Thanks for your reply @lees @vino @vinoth :)
> >> >>> >>
> >> >>> >>
> >> >>> >> best,
> >> >>> >> lamber-ken
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> 在 2019-12-16 12:24:26,"leesf"  写道:
> >> >>> >> >Hi Lamber,
> >> >>> >> >
> >> >>> >> >Thanks for your work, have gone through the new web ui, looks
> good.
> >> >>> >> >Hence +1 from my side.
> >> >>> >> >
> >> >>> >> >Best,
> >> >>> >> >Leesf
> >> >>> >> >
> >> >>> >> >vino yang  于2019年12月16日周一 上午10:17写道:
> >> >>> >> >
> >> >>> >> >> Hi 

Re: [QUESTION] Handle record partition change

2019-12-18 Thread Shiyan Xu
Hi Sivabalan,

Sorry for the late reply. I now see that GLOBAL_BLOOM allows records to be
looked up in different partitions. This is indeed helpful in the situation
where the same record key gets updated on its partition path.

Now I'm thinking when we "tagLocationBacktoRecords
<https://github.com/apache/incubator-hudi/blob/2745b7552f2f2ee7a61d3ea49139ef2af3ffe13f/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L112>",
we could potentially create a delete operation for the record in the old
partition while keeping the incoming insert operation for it in the new
partition. This is crucial for avoiding duplicate records (with the same
record keys) in the Hudi dataset. Is this some functionality already
implemented? I might have missed some part of the logic from the codebase.
Please kindly point out if I got any misunderstanding.

Thank you.

Best,
Raymond

On Wed, Dec 11, 2019 at 11:16 AM Sivabalan  wrote:

> Depends on whether you are using regular BLOOM or GLOBAL_BLOOM. May I know
> which one are you talking about?
>
>
> On Wed, Dec 11, 2019 at 9:12 AM Shiyan Xu 
> wrote:
>
> > Hi Hudi devs,
> >
> > Upon upsert operations, does Hudi detect record's partition path change?
> As
> > for the same record, the partition path field may get updated while the
> > record key (the primary id) stays the same, then the insert would result
> in
> > duplicate record (based on record key) in the dataset. Is there any
> > relevant logic of this kind of detection and/or clean-up in the codebase?
> >
> > Best,
> > Raymond
> >
>
>
> --
> Regards,
> -Sivabalan
>


Re: [QUESTION] Handle record partition change

2019-12-18 Thread Shiyan Xu
Thanks Sivabalan. Exactly, that's what I meant.
I can think of a usecase for option 2: a Hudi dataset manages people info
and partitioned by birthday. In most cases, where people info are updated,
birthdays are not to be changed (that's why we choose it as partition
field). But in some edge cases where birthday info are input wrongly and we
want to manually fix it or allow user to updated it occasionally. In this
case, option 2 would be helpful in keeping records in the expected
partition, so that a query like "show me people who were born after 2000"
would work.

I guess a configuration like "MIGRATE_RECORD_PARTITION=true" could help
achieve both options.

On Wed, Dec 18, 2019 at 10:32 AM Sivabalan  wrote:

> Raymond,
>  The patch <https://github.com/apache/incubator-hudi/pull/1091> which
> I
> have put up works differently. If initial record is in Partition1, and
> updates are sent to Partition2, we silently update the record in
> Partition1. Guess you are asking for opposite, i.e. insert in Partition2
> and delete record in Partition1. I am not sure about the usability of this
> in general. Let's ask our experts in our group.
>
> @vinoth, balaji and others:
> Do we support both functionality or just one. If we plan to support both,
> then it might incur api changes. or we could tackle with a config as well.
>
> Here is the use-case.
> - Insert record1 to partition1 with global bloom.
> - Update record1 with partition set to partition2(different partition
> compared to where the record is present as of now).
>
> Option1:
> Update record1 to Partition1 and do nothing in Partition2.
>- Since with global bloom, the primary key is just the record key and
> hence partition is ignored.
>
> Option2:
> Insert a new record, record1 to Partition2. and Delete record1 from
> Partition1.
>
> I have already put up a patch for Option1. but looks like Raymond is
> looking for Option2.
>
>
>
>
>
> On Wed, Dec 18, 2019 at 8:48 AM Shiyan Xu 
> wrote:
>
> > Hi Sivabalan,
> >
> > Sorry for the late reply. I now see that GLOBAL_BLOOM allows records to
> be
> > looked up in different partitions. This is indeed helpful in the
> situation
> > where the same record key gets updated on its partition path.
> >
> > Now I'm thinking when we "tagLocationBacktoRecords
> > <
> >
> https://github.com/apache/incubator-hudi/blob/2745b7552f2f2ee7a61d3ea49139ef2af3ffe13f/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L112
> > >",
> > we could potentially create a delete operation for the record in the old
> > partition while keeping the incoming insert operation for it in the new
> > partition. This is crucial for avoiding duplicate records (with the same
> > record keys) in the Hudi dataset. Is this some functionality already
> > implemented? I might have missed some part of the logic from the
> codebase.
> > Please kindly point out if I got any misunderstanding.
> >
> > Thank you.
> >
> > Best,
> > Raymond
> >
> > On Wed, Dec 11, 2019 at 11:16 AM Sivabalan  wrote:
> >
> > > Depends on whether you are using regular BLOOM or GLOBAL_BLOOM. May I
> > know
> > > which one are you talking about?
> > >
> > >
> > > On Wed, Dec 11, 2019 at 9:12 AM Shiyan Xu  >
> > > wrote:
> > >
> > > > Hi Hudi devs,
> > > >
> > > > Upon upsert operations, does Hudi detect record's partition path
> > change?
> > > As
> > > > for the same record, the partition path field may get updated while
> the
> > > > record key (the primary id) stays the same, then the insert would
> > result
> > > in
> > > > duplicate record (based on record key) in the dataset. Is there any
> > > > relevant logic of this kind of detection and/or clean-up in the
> > codebase?
> > > >
> > > > Best,
> > > > Raymond
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


Re: [QUESTION] Handle record partition change

2019-12-18 Thread Shiyan Xu
Sure. I can create a JIRA and note down the discussion points there.

On Wed, Dec 18, 2019 at 7:14 PM Vinoth Chandar  wrote:

> Interesting discussion. We can file a JIRA for option 2? It seems to also
> make the semantics  simpler.
>
> On Wed, Dec 18, 2019 at 11:21 AM Shiyan Xu 
> wrote:
>
> > Thanks Sivabalan. Exactly, that's what I meant.
> > I can think of a usecase for option 2: a Hudi dataset manages people info
> > and partitioned by birthday. In most cases, where people info are
> updated,
> > birthdays are not to be changed (that's why we choose it as partition
> > field). But in some edge cases where birthday info are input wrongly and
> we
> > want to manually fix it or allow user to updated it occasionally. In this
> > case, option 2 would be helpful in keeping records in the expected
> > partition, so that a query like "show me people who were born after 2000"
> > would work.
> >
> > I guess a configuration like "MIGRATE_RECORD_PARTITION=true" could help
> > achieve both options.
> >
> > On Wed, Dec 18, 2019 at 10:32 AM Sivabalan  wrote:
> >
> > > Raymond,
> > >  The patch <https://github.com/apache/incubator-hudi/pull/1091>
> > which
> > > I
> > > have put up works differently. If initial record is in Partition1, and
> > > updates are sent to Partition2, we silently update the record in
> > > Partition1. Guess you are asking for opposite, i.e. insert in
> Partition2
> > > and delete record in Partition1. I am not sure about the usability of
> > this
> > > in general. Let's ask our experts in our group.
> > >
> > > @vinoth, balaji and others:
> > > Do we support both functionality or just one. If we plan to support
> both,
> > > then it might incur api changes. or we could tackle with a config as
> > well.
> > >
> > > Here is the use-case.
> > > - Insert record1 to partition1 with global bloom.
> > > - Update record1 with partition set to partition2(different partition
> > > compared to where the record is present as of now).
> > >
> > > Option1:
> > > Update record1 to Partition1 and do nothing in Partition2.
> > >- Since with global bloom, the primary key is just the record key
> and
> > > hence partition is ignored.
> > >
> > > Option2:
> > > Insert a new record, record1 to Partition2. and Delete record1 from
> > > Partition1.
> > >
> > > I have already put up a patch for Option1. but looks like Raymond is
> > > looking for Option2.
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Dec 18, 2019 at 8:48 AM Shiyan Xu  >
> > > wrote:
> > >
> > > > Hi Sivabalan,
> > > >
> > > > Sorry for the late reply. I now see that GLOBAL_BLOOM allows records
> to
> > > be
> > > > looked up in different partitions. This is indeed helpful in the
> > > situation
> > > > where the same record key gets updated on its partition path.
> > > >
> > > > Now I'm thinking when we "tagLocationBacktoRecords
> > > > <
> > > >
> > >
> >
> https://github.com/apache/incubator-hudi/blob/2745b7552f2f2ee7a61d3ea49139ef2af3ffe13f/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L112
> > > > >",
> > > > we could potentially create a delete operation for the record in the
> > old
> > > > partition while keeping the incoming insert operation for it in the
> new
> > > > partition. This is crucial for avoiding duplicate records (with the
> > same
> > > > record keys) in the Hudi dataset. Is this some functionality already
> > > > implemented? I might have missed some part of the logic from the
> > > codebase.
> > > > Please kindly point out if I got any misunderstanding.
> > > >
> > > > Thank you.
> > > >
> > > > Best,
> > > > Raymond
> > > >
> > > > On Wed, Dec 11, 2019 at 11:16 AM Sivabalan 
> wrote:
> > > >
> > > > > Depends on whether you are using regular BLOOM or GLOBAL_BLOOM.
> May I
> > > > know
> > > > > which one are you talking about?
> > > > >
> > > > >
> > > > > On Wed, Dec 11, 2019 at 9:12 AM Shiyan Xu <
> > xu.shiyan.raym...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Hudi devs,
> > > > > >
> > > > > > Upon upsert operations, does Hudi detect record's partition path
> > > > change?
> > > > > As
> > > > > > for the same record, the partition path field may get updated
> while
> > > the
> > > > > > record key (the primary id) stays the same, then the insert would
> > > > result
> > > > > in
> > > > > > duplicate record (based on record key) in the dataset. Is there
> any
> > > > > > relevant logic of this kind of detection and/or clean-up in the
> > > > codebase?
> > > > > >
> > > > > > Best,
> > > > > > Raymond
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > -Sivabalan
> > > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>


Re: HudiDeltaStreamer on EMR

2020-02-24 Thread Shiyan Xu
It's likely that the source parquet data has a column of Spark Timestamp
type, which is not convertible to avro.
By the way, ParquetDFSSource is not available in 0.5.0. Only added in
0.5.1. You'll probably need to add a custom class which follows its
existing implementation, and get rid of it once EMR upgrade Hudi version.

On Mon, Feb 24, 2020 at 10:41 AM Raghvendra Dhar Dubey
 wrote:

> Hi Team,
>
> I was trying to use HudiDeltaStreamer on EMR, which reads parquet data from
> S3 and write data into Hudi Dataset, but I am getting into an issue like
> AvroSchemaConverter not able to convert INT96, INT96 not yet implemented.
> spark-submit command that I am using
>
> spark-submit --class
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages
> org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode client
> /usr/lib/hudi/hudi-utilities-bundle-0.5.0-incubating.jar --storage-type
> COPY_ON_WRITE --source-ordering-field action_date --source-class
> org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path
> s3://xxx/hudi_table --target-table hudi_table --payload-class
> org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
>
> hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://xxx/Hoodi/
>
> Error I am getting is
>
> exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
> Lost task 0.3 in stage 0.0 (TID 3, ip-172-30-37-9.ec2.internal, executor
> 1): java.lang.IllegalArgumentException: INT96 not yet implemented. at
>
> org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279)
> at
>
> org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264)
> at
>
> org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:297)
> at
>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:263)
> at
>
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:241)
> at
>
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:231)
> at
>
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130)
> at
>
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
> at
>
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
> at
>
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
> at
>
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:199)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:196)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:151) at
> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at
> org.apache.spark.scheduler.Task.run(Task.scala:123) at
>
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> Please help me into this.
>
> Thanks
> Raghvendra
>


Re: Weekly sync notes 20201225

2020-02-25 Thread Shiyan Xu
link
https://cwiki.apache.org/confluence/display/HUDI/20200225+Weekly+Sync+Minutes

On Tue, Feb 25, 2020 at 9:39 PM vbal...@apache.org 
wrote:

> Please find the weekly sync notes here
> 20200225 Weekly Sync Minutes - HUDI - Apache Software Foundation
>
> Thanks,Balaji.V


Re: [DISCUSS] RFC - 08 : Record level indexing mechanisms for Hudi datasets

2020-02-24 Thread Shiyan Xu
+1 great reading and values!

On Mon, 24 Feb 2020, 15:31 nishith agarwal,  wrote:

> +100
> - Reduces index lookup time hence improves job runtime
> - Paves the way for streaming style ingestion
> - Eliminates dependency on Hbase (alternate "global index" support at the
> moment)
>
> -Nishith
>
> On Mon, Feb 24, 2020 at 10:56 AM Vinoth Chandar  wrote:
>
> > +1 from me as well. This will be a product defining feature, if we can do
> > it/
> >
> > On Sun, Feb 23, 2020 at 6:27 PM vino yang  wrote:
> >
> > > Hi Sivabalan,
> > >
> > > Thanks for your proposal.
> > >
> > > Big +1 from my side, indexing for record granularity is really good for
> > > performance. It is also towards the streaming processing.
> > >
> > > Best,
> > > Vino
> > >
> > > Sivabalan  于2020年2月23日周日 上午12:52写道:
> > >
> > > > As Aapche Hudi is getting widely adopted, performance has become the
> > need
> > > > of the hour. This RFC focusses on improving performance of the Hudi
> > index
> > > > by introducing record level index. The proposal is to implement a new
> > > index
> > > > format that is a mapping of (recordKey <-> partition, fileId) or
> > > > ((recordKey, partitionPath) → fileId). This mapping will be stored
> and
> > > > maintained by Hudi as another implementation of HoodieIndex. This
> > record
> > > > level indexing will definitely give a boost to both read and write
> > > > performance.
> > > >
> > > > Here
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+08+%3A+Record+level+indexing+mechanisms+for+Hudi+datasets
> > > > >
> > > > is the link to RFC.
> > > >
> > > > Appreciate your review and thoughts.
> > > >
> > > > --
> > > > Regards,
> > > > -Sivabalan
> > > >
> > >
> >
>


Re: Refactor and enhance Hudi Transformer

2020-02-23 Thread Shiyan Xu
Thanks. After reading the discussion in HUDI-561, I just realized that the
previously-mentioned built-in partition transformer is better suited to a
custom key generator. Hopefully other suitable ideas of built-in
transformer would come up later.

On Sun, Feb 23, 2020 at 6:34 PM vino yang  wrote:

> Hi Shiyan,
>
> Really sorry, I forgot to attach the reference, the relevant Jira ID is
> HUDI-561: https://issues.apache.org/jira/browse/HUDI-561
>
> It seems both of you faced the same issue. While the solution is not the
> same. Never mind, you can move the discussion to that issue.
>
> Best,
> Vino
>
>
> Shiyan Xu  于2020年2月24日周一 上午10:21写道:
>
> > Thanks Vino. Are you referring to HUDI-613? How about making it an
> umbrella
> > task due to its big scope? (btw it is stated as "bug", which should be
> > fixed too). I can create another specific task under it for the idea of
> > datetime -> partition path transformer, if it makes sense.
> >
> > On Sun, Feb 23, 2020 at 5:57 PM vino yang  wrote:
> >
> > > Hi Shiyan,
> > >
> > > Thanks for rasing this thread up again and sharing your thoughts. They
> > are
> > > valuable.
> > >
> > > Regarding the date-time specific transform, there is an issue[1] that
> > > describes this business requirement.
> > >
> > > Best,
> > > Vino
> > >
> > > Shiyan Xu  于2020年2月24日周一 上午7:22写道:
> > >
> > > > Late to the party. :P
> > > >
> > > > I really favor the idea of built-in support enrichment. It is a very
> > > common
> > > > case where we want to set datetime fields for partition path. We
> could
> > > have
> > > > a built-in support to normalize ISO format / unix timestamp. For
> > example
> > > > `HourlyPartitionTransformer` will normalize whatever field user
> > specified
> > > > as partition path. Let's say user set `create_ts` as partition path
> > > field,
> > > > the transfromer will apply change create_ts => _hoodie_partition_path
> > > >
> > > >
> > > >- 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
> > > >- 1582497702.123456789 => 2020/02/23/22
> > > >
> > > > Does that make sense? If so, I may file a jira for this.
> > > >
> > > > As for FilterTransformer or FlatMapTransformer which is designed for
> > > > generic purpose, they seem to belong to Spark or Flink's realm.
> > > > You can do these 2 transformation with Spark Dataset now. Or once
> > > > decoupled from Spark, you'll probably have an abstract Dataset class
> > > > to perform engine-agnostic transformation
> > > >
> > > > My understanding of transformer in HUDI is more specifically
> purposed,
> > > > where the underlying transformation is handled by the actual
> > > > processing engine (Spark or Flink)
> > > >
> > > >
> > > > On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Thanks Hamid and Vinoyang for the great discussion
> > > > >
> > > > > On Fri, Feb 14, 2020 at 5:18 AM vino yang 
> > > wrote:
> > > > >
> > > > > > I have filed a Jira issue[1] to track this work.
> > > > > >
> > > > > > [1]: https://issues.apache.org/jira/browse/HUDI-613
> > > > > >
> > > > > > vino yang  于2020年2月13日周四 下午9:51写道:
> > > > > >
> > > > > > > Hi hamid,
> > > > > > >
> > > > > > > Agree with your opinion.
> > > > > > >
> > > > > > > Let's move forward step by step.
> > > > > > >
> > > > > > > Will file an issue to track refactor about Transformer.
> > > > > > >
> > > > > > > Best,
> > > > > > > Vino
> > > > > > >
> > > > > > > hamid pirahesh  于2020年2月13日周四 下午6:38写道:
> > > > > > >
> > > > > > >> I think it is a good idea to decouple  the transformer from
> > spark
> > > so
> > > > > > that
> > > > > > >> it can be used with other flow engines.
> > > > > > >> Once you do that, then it is worth considering a much bigger
> > play
> > > > > rather
> > > > > > >> than another incremental pl

Re: Apache Hudi on AWS EMR

2020-02-27 Thread Shiyan Xu
 Hudi Delta streamer support parquet file. You can do a bulkInsert
> > for the
> > > > first job then use delta streamer for the Upsert job.
> > > >
> > > > 3 - What should be the parquet file size and row group size for
> > better
> > > > performance on querying Hudi Dataset?
> > > > --
> > > > That depends on the query engine you are using and it should be
> > documented
> > > > somewhere. For impala, the optimal size for query performance is
> > 256MB, but
> > > > the larger file size will make upsert more expensive. The size I
> > personally
> > > > choose is 100MB to 128MB.
> > > >
> > > > Thanks,
> > > > Gary
> > > >
> > > >
> > > >
> > > > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
> > 
> > > > wrote:
> > > >
> > > > > Athena is indeed Presto inside, but there is lot of custom code
> > which has
> > > > > gone on top of Presto there.
> > > > > Couple months back I tried running a glue crawler to catalog a
> > Hudi data
> > > > > set and then query it from Athena. The results were not same as
> > what I
> > > > > would get with running the same query using spark SQL on EMR.
> > Did not try
> > > > > Presto on EMR, but assuming it will work fine on EMR.
> > > > >
> > > > > Athena integration with Hudi data set is planned shortly, but
> > not sure of
> > > > > the date yet.
> > > > >
> > > > > However, recently Athena started supporting integration to a
> > Hive catalog
> > > > > apart from Glue. What that means is in Athena, if I connect to
> > the Hive
> > > > > catalog on EMR, which is able to provide the Hudi views
> > correctly, I
> > > > should
> > > > > be able to get correct results on Athena. Have not tested it
> > though. The
> > > > > feature is in Preview already.
> > > > >
> > > > > Thanks
> > > > > Raghu
> > > > > -Original Message-
> > > > > From: Shiyan Xu 
> > > > > Sent: Tuesday, February 18, 2020 6:20 AM
> > > > > To: dev@hudi.apache.org
> > > > > Cc: Mehrotra, Udit ; Raghvendra Dhar Dubey
> > > > > 
> > > > > Subject: Re: Apache Hudi on AWS EMR
> > > > >
> > > > > For 2) I think running presto on EMR is able to let you run
> > > > read-optimized
> > > > > queries.
> > > > > I don't quite understand how exactly Athena not support Hudi as
> > it is
> > > > > Presto underlying.
> > > > > Perhaps @Udit could give some insights from AWS?
> > > > >
> > > > > As @Raghvendra you mentioned, another option is to export Hudi
> > dataset to
> > > > > plain parquet files for Athena to query on
> > > > > RFC-9 is for this usecase
> > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
> > > > > The task is inactive now. Feel free to pick up if this is
> > something you'd
> > > > > like to work on. I'd be happy to help with that.
> > > > >
> > > > >
> > > > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <
> > vin...@apache.org>
> > > > wrote:
> > > > >
> > > > > > Hi Raghvendra,
> > > > > >
> > > > > > Quick sidebar.. Please subscribe to the mailing list, so your
> > message
> > > > > > get published automatically. :)
> > > > > >
> > > > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
> > > > > >  wrote:
> > > > > >
> > > > > > > Hi Udit,
> > > > > > >
> > > > > > > Thanks for information.
> > > > > > > Actually I am struggling on following points
> > > > > > > 1 - How can we process 

Re: Apache Hudi on AWS EMR

2020-02-17 Thread Shiyan Xu
For 2) I think running presto on EMR is able to let you run read-optimized
queries.
I don't quite understand how exactly Athena not support Hudi as it is
Presto underlying.
Perhaps @Udit could give some insights from AWS?

As @Raghvendra you mentioned, another option is to export Hudi dataset to
plain parquet files for Athena to query on
RFC-9 is for this usecase
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
The task is inactive now. Feel free to pick up if this is something you'd
like to work on. I'd be happy to help with that.


On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar  wrote:

> Hi Raghvendra,
>
> Quick sidebar.. Please subscribe to the mailing list, so your message get
> published automatically. :)
>
> On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
>  wrote:
>
> > Hi Udit,
> >
> > Thanks for information.
> > Actually I am struggling on following points
> > 1 - How can we process S3 parquet files(hourly partitioned) through
> Apache
> > Hudi? Is there any streaming layer we need to introduce? 2 - Is there any
> > workaround to query Hudi Dataset from Athena? we are thinking to dump
> > resulting Hudi dataset to S3, and then querying from Athena. 3 - What
> > should be the parquet file size and row group size for better performance
> > on querying Hudi Dataset?
> >
> > Thanks
> > Raghvendra
> >
> >
> > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit 
> wrote:
> >
> > > Hi Raghvendra,
> > >
> > > You would have to re-write you Parquet Dataset in Hudi format. Here are
> > > the links you can follow to get started:
> > >
> > >
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
> > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
> > >
> > > Thanks,
> > > Udit
> > >
> > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> > >  wrote:
> > >
> > > Hi Team,
> > >
> > > I want to setup incremental view of my AWS S3 parquet data through
> > > Apache
> > > Hudi, and want to query this data through Athena, but currently
> > Athena
> > > not
> > > supporting Hudi Dataset.
> > >
> > > so there are few questions which I want to understand here
> > >
> > > 1 - How to stream s3 parquet file to Hudi dataset running on EMR.
> > >
> > > 2 - How to query Hudi Dataset running on EMR
> > >
> > > Please help me to understand this.
> > >
> > > Thanks
> > >
> > > Raghvendra
> > >
> > >
> > >
> >
>


Re: Snapshot from cold storage store and continues with latest data from biglog

2020-02-17 Thread Shiyan Xu
Hi Syed, as Vinoth mentioned, the HoodieSnapshotCopier is meant for this
purpose

You may also read more on the RFC-9, which plans to introduce a
backward-compatible tool to cover HoodieSnapshotCopier
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
Unfortunately I'm not actively working on this. If you're interested free
feel to pick it up. I'd be happy to help with that.

On Wed, Feb 12, 2020 at 7:25 PM Vinoth Chandar  wrote:

> Hi Syed,
>
> Apologies for the delay.  If you are using copy-on-write, you can look into
> savepoints (although I realize its only exposed at the rdd api level).. We
> do have a tool called HoodieSnapshotCopier in hoodie-utilities, to take
> periodic copies/snapshots of a table for backup purposes, as of a given
> commit. Raymond (if you arr here) , has an RFC to enhance that even..
> Running the copier (please test it first, since its not used in OSS that
> much IIUC) periodically, say every day would achieve your goals I believe..
>
>
> https://github.com/apache/incubator-hudi/blob/c2c0f6b13d5b72b3098ed1b343b0a89679f854b3/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotCopier.java
>
> Any issues in the tool would be simple to fix. Tool itself is couple
> hundred lines, that all.
>
> Thanks
> Vinoth
>
> On Mon, Feb 10, 2020 at 3:56 AM Syed Abdul Kather 
> wrote:
>
> > Yes. Also for restoring the data from cold storage.
> >
> > Use case here :
> > We stream data using debezium and push to Kafka we have retention in
> Kafka
> > as 7 days. In case the destination table created using the hudi got
> crashed
> > or we need to repopulate then we need a way that can help us restore the
> > data.
> >
> > Thanks and Regards,
> > S SYED ABDUL KATHER
> > *Data platform Lead @ Tathastu.ai*
> >
> > *+91 - 7411011661*
> >
> >
> > On Mon, Jan 13, 2020 at 10:17 PM Vinoth Chandar 
> wrote:
> >
> > > Hi Syed,
> > >
> > > If I follow correctly, are you asking how to do a bulk load first and
> > then
> > > use delta streamer on top of that dataset to apply binlogs from Kafka?
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Mon, Jan 13, 2020 at 12:39 AM Syed Abdul Kather  >
> > > wrote:
> > >
> > > > Hi Team,
> > > >
> > > > We have on-board a few tables that have really huge number of records
> > > (100
> > > > M records ). The plan is like enable the binlog for database that is
> no
> > > > issues as stream can handle the load  . But for loading the snapshot
> .
> > We
> > > > have use sqoop to import whole table to s3.
> > > >
> > > > What we required here?
> > > >  Can we load the whole dump sqooped record to hudi table then we
> would
> > > use
> > > > the stream(binlog data comes vai kafka)
> > > >
> > > > Thanks and Regards,
> > > > S SYED ABDUL KATHER
> > > >  *Bigdata l...@tathastu.ai*
> > > > *   +91-7411011661*
> > > >
> > >
> >
>


Re: Refactor and enhance Hudi Transformer

2020-02-23 Thread Shiyan Xu
Late to the party. :P

I really favor the idea of built-in support enrichment. It is a very common
case where we want to set datetime fields for partition path. We could have
a built-in support to normalize ISO format / unix timestamp. For example
`HourlyPartitionTransformer` will normalize whatever field user specified
as partition path. Let's say user set `create_ts` as partition path field,
the transfromer will apply change create_ts => _hoodie_partition_path


   - 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
   - 1582497702.123456789 => 2020/02/23/22

Does that make sense? If so, I may file a jira for this.

As for FilterTransformer or FlatMapTransformer which is designed for
generic purpose, they seem to belong to Spark or Flink's realm.
You can do these 2 transformation with Spark Dataset now. Or once
decoupled from Spark, you'll probably have an abstract Dataset class
to perform engine-agnostic transformation

My understanding of transformer in HUDI is more specifically purposed,
where the underlying transformation is handled by the actual
processing engine (Spark or Flink)


On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar  wrote:

> Thanks Hamid and Vinoyang for the great discussion
>
> On Fri, Feb 14, 2020 at 5:18 AM vino yang  wrote:
>
> > I have filed a Jira issue[1] to track this work.
> >
> > [1]: https://issues.apache.org/jira/browse/HUDI-613
> >
> > vino yang  于2020年2月13日周四 下午9:51写道:
> >
> > > Hi hamid,
> > >
> > > Agree with your opinion.
> > >
> > > Let's move forward step by step.
> > >
> > > Will file an issue to track refactor about Transformer.
> > >
> > > Best,
> > > Vino
> > >
> > > hamid pirahesh  于2020年2月13日周四 下午6:38写道:
> > >
> > >> I think it is a good idea to decouple  the transformer from spark so
> > that
> > >> it can be used with other flow engines.
> > >> Once you do that, then it is worth considering a much bigger play
> rather
> > >> than another incremental play.
> > >> Given the scale of Hudi, we need to look at airflow, particularly in
> the
> > >> context of what google is doing with Composer, addressing autoscaling,
> > >> scheduleing, monitoring, etc.
> > >> You need all of that to manage a serious tetl/elt flow.
> > >>
> > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang 
> wrote:
> > >>
> > >> > Currently, Hudi has a component that has not been widely used:
> > >> Transformer.
> > >> > As we all know, before the original data fell into the data lake, a
> > very
> > >> > common operation is data preprocessing and ETL. This is also the
> most
> > >> > common use scenario of many computing engines, such as Flink and
> > Spark.
> > >> Now
> > >> > that Hudi has taken advantage of the power of the computing engine,
> it
> > >> can
> > >> > also naturally take advantage of its ability of data preprocessing.
> We
> > >> can
> > >> > refactor the Transformer to make it become more flexible. To
> > summarize,
> > >> we
> > >> > can refactor from the following aspects:
> > >> >
> > >> >- Decouple Transformer from Spark
> > >> >- Enrich the Transformer and provide built-in transformer
> > >> >- Support Transformer-chain
> > >> >
> > >> > For the first point, the Transformer interface is tightly coupled
> with
> > >> > Spark in design, and it contains a Spark-specific context. This
> makes
> > it
> > >> > impossible for us to take advantage of the transform capabilities
> > >> provided
> > >> > by other engines (such as Flink) after supporting multiple engines.
> > >> > Therefore, we need to decouple it from Spark in design.
> > >> >
> > >> > For the second point, we can enhance the Transformer and provide
> some
> > >> > out-of-the-box Transformers, such as FilterTransformer,
> > >> FlatMapTrnasformer,
> > >> > and so on.
> > >> >
> > >> > For the third point, the most common pattern for data processing is
> > the
> > >> > pipeline model, and the common implementation of the pipeline model
> is
> > >> the
> > >> > responsibility chain model, which can be compared to the Apache
> > commons
> > >> > chain[1], combining multiple Transformers can make data-processing
> > >> become
> > >> > more flexible and expandable.
> > >> >
> > >> > If we enhance the capabilities of Transformer components, Hudi will
> > >> provide
> > >> > richer data processing capabilities based on the computing engine.
> > >> >
> > >> > What do you think?
> > >> >
> > >> > Any opinions and feedback are welcome and appreciated.
> > >> >
> > >> > Best,
> > >> > Vino
> > >> >
> > >> > [1]: https://commons.apache.org/proper/commons-chain/
> > >> >
> > >>
> > >
> >
>


Re: Refactor and enhance Hudi Transformer

2020-02-23 Thread Shiyan Xu
Thanks Vino. Are you referring to HUDI-613? How about making it an umbrella
task due to its big scope? (btw it is stated as "bug", which should be
fixed too). I can create another specific task under it for the idea of
datetime -> partition path transformer, if it makes sense.

On Sun, Feb 23, 2020 at 5:57 PM vino yang  wrote:

> Hi Shiyan,
>
> Thanks for rasing this thread up again and sharing your thoughts. They are
> valuable.
>
> Regarding the date-time specific transform, there is an issue[1] that
> describes this business requirement.
>
> Best,
> Vino
>
> Shiyan Xu  于2020年2月24日周一 上午7:22写道:
>
> > Late to the party. :P
> >
> > I really favor the idea of built-in support enrichment. It is a very
> common
> > case where we want to set datetime fields for partition path. We could
> have
> > a built-in support to normalize ISO format / unix timestamp. For example
> > `HourlyPartitionTransformer` will normalize whatever field user specified
> > as partition path. Let's say user set `create_ts` as partition path
> field,
> > the transfromer will apply change create_ts => _hoodie_partition_path
> >
> >
> >- 2020-02-23T22:41:42.123456789Z => 2020/02/23/22
> >- 1582497702.123456789 => 2020/02/23/22
> >
> > Does that make sense? If so, I may file a jira for this.
> >
> > As for FilterTransformer or FlatMapTransformer which is designed for
> > generic purpose, they seem to belong to Spark or Flink's realm.
> > You can do these 2 transformation with Spark Dataset now. Or once
> > decoupled from Spark, you'll probably have an abstract Dataset class
> > to perform engine-agnostic transformation
> >
> > My understanding of transformer in HUDI is more specifically purposed,
> > where the underlying transformation is handled by the actual
> > processing engine (Spark or Flink)
> >
> >
> > On Tue, Feb 18, 2020 at 11:00 AM Vinoth Chandar 
> wrote:
> >
> > > Thanks Hamid and Vinoyang for the great discussion
> > >
> > > On Fri, Feb 14, 2020 at 5:18 AM vino yang 
> wrote:
> > >
> > > > I have filed a Jira issue[1] to track this work.
> > > >
> > > > [1]: https://issues.apache.org/jira/browse/HUDI-613
> > > >
> > > > vino yang  于2020年2月13日周四 下午9:51写道:
> > > >
> > > > > Hi hamid,
> > > > >
> > > > > Agree with your opinion.
> > > > >
> > > > > Let's move forward step by step.
> > > > >
> > > > > Will file an issue to track refactor about Transformer.
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > hamid pirahesh  于2020年2月13日周四 下午6:38写道:
> > > > >
> > > > >> I think it is a good idea to decouple  the transformer from spark
> so
> > > > that
> > > > >> it can be used with other flow engines.
> > > > >> Once you do that, then it is worth considering a much bigger play
> > > rather
> > > > >> than another incremental play.
> > > > >> Given the scale of Hudi, we need to look at airflow, particularly
> in
> > > the
> > > > >> context of what google is doing with Composer, addressing
> > autoscaling,
> > > > >> scheduleing, monitoring, etc.
> > > > >> You need all of that to manage a serious tetl/elt flow.
> > > > >>
> > > > >> On Thu, Feb 6, 2020 at 8:25 PM vino yang 
> > > wrote:
> > > > >>
> > > > >> > Currently, Hudi has a component that has not been widely used:
> > > > >> Transformer.
> > > > >> > As we all know, before the original data fell into the data
> lake,
> > a
> > > > very
> > > > >> > common operation is data preprocessing and ETL. This is also the
> > > most
> > > > >> > common use scenario of many computing engines, such as Flink and
> > > > Spark.
> > > > >> Now
> > > > >> > that Hudi has taken advantage of the power of the computing
> > engine,
> > > it
> > > > >> can
> > > > >> > also naturally take advantage of its ability of data
> > preprocessing.
> > > We
> > > > >> can
> > > > >> > refactor the Transformer to make it become more flexible. To
> > > > summarize,
> > > > >> w

Re: Please welcome our new PPMCs and Committer

2020-02-14 Thread Shiyan Xu
Congrats! Very well deserved!

On Fri, 14 Feb 2020, 13:11 vbal...@apache.org,  wrote:

>  Congratulations to Leesf, Vino Yang and Siva.
> +1 Very well deserved :) Looking forward to your continued contributions.
> Balaji.V
> On Friday, February 14, 2020, 12:11:18 PM PST, Bhavani Sudha <
> bhavanisud...@gmail.com> wrote:
>
>  Hearty congratulations to all of you - @leesf 
> @vinoyang
> and @Sivabalan . Very well deserved.
>
> Thanks,
> Sudha
>
> On Fri, Feb 14, 2020 at 11:58 AM Vinoth Chandar  wrote:
>
> > Hello all,
> >
> > I am incredibly excited to share that we have two new PPMC members :
> > *leesf*
> > and *vinoyang*, who have been doing such sustained, great work on the
> > project over a good part of the last year! I and rest of the PPMC, do
> hope
> > there a bigger and better things to come!
> >
> > We also have a new committer : *Sivabalan*, who has stepped up to own the
> > indexing component in the past few months, and has already delivered
> > several key contributions and currently driving some foundational work on
> > record level indexing.
> >
> > Please join me in congratulating them!
> >
> > Thanks
> > Vinoth
> >
>


Re: [DISCUSS] Delay code freeze date for next release until Jan 19th (Sunday)

2020-01-15 Thread Shiyan Xu
+1

I assume you meant UTC-8 

On Wed, 15 Jan 2020, 11:26 nishith agarwal,  wrote:

> +1, sunday sounds good.
>
> -Nishith
>
> On Wed, Jan 15, 2020 at 9:08 AM Balaji Varadarajan
>  wrote:
>
> >  +1 Sunday should give breathing space to fix the blockers.
> > Balaji.V
> > On Wednesday, January 15, 2020, 06:50:28 AM PST, Vinoth Chandar <
> > vin...@apache.org> wrote:
> >
> >  +1 from me. I feel sunday is good in general, because the weekend gives
> > enough time for taking care of last minute things
> >
> > On Wed, Jan 15, 2020 at 2:11 AM leesf  wrote:
> >
> > > Dear Community,
> > >
> > > As discussed in the weekly sync meeting, we marked that there are 5
> > > blockers[1] that should be resolved before cutting  next release, and
> > > kindly welcome to review these PRs[2]. Regrading Jan 15th is a bit
> tight
> > to
> > > get them all on land. I propose to delay the code freeze date until Jan
> > > 19th, thus Sunday this week. what do you think? Thanks.
> > >
> > > Best,
> > > Leesf
> > >
> > > [1]
> > > https://issues.apache.org/jira/browse/HUDI-537
> > > https://issues.apache.org/jira/browse/HUDI-535
> > > https://issues.apache.org/jira/browse/HUDI-509
> > > https://issues.apache.org/jira/browse/HUDI-403
> > > https://issues.apache.org/jira/browse/HUDI-238
> > >
> > > [2]
> > > https://github.com/apache/incubator-hudi/pull/1226
> > > https://github.com/apache/incubator-hudi/pull/1212
> > > https://github.com/apache/incubator-hudi/pull/1229
> > >
> >
>


Re: [DISCUSS] Unify Hudi code cleanup and improvement

2020-01-21 Thread Shiyan Xu
The clean-up work can actually be split by modules.

Though it is generally a good practice to follow, my concern is the
clean-up is likely to cause conflicts with some on-going changes. If I may
suggest, the dedicated clean-up tasks should avoid
- modules that are undergoing multiple feature changes/PRs
- modules that are planned to have major refactoring due to design changes
(since clean-up can be done altogether during refactoring)

On Tue, Jan 21, 2020 at 4:17 AM Vinoth Chandar  wrote:

> Not sure if I fully agree with sweeping statements being made. But,  +1 for
> structuring this work via Jiras and having some committer “accept” the
> issue first.  Some of these tend to be subjective and we do need to make
> different tradeoffs.
>
> On Tue, Jan 21, 2020 at 1:28 AM vino yang  wrote:
>
> > Hi Pratyaksh,
> >
> > Thanks for your thought.
> >
> > Let's listen to others' comments. If there is no objection, we will
> follow
> > this way.
> >
> > Best,
> > Vino
> >
> >
> > Pratyaksh Sharma  于2020年1月21日周二 下午4:56写道:
> >
> > > Hi Vino,
> > >
> > > Big +1 for this initiative. I have done this code cleanup for test
> > classes
> > > in the past and strongly feel there is a need to do the same at other
> > > places as well. I would definitely like to volunteer for this.
> > >
> > > On Tue, Jan 21, 2020 at 1:52 PM vino yang 
> wrote:
> > >
> > > > Hi folks,
> > > >
> > > > Currently, the code quality of some Hudi module is not very well. As
> > many
> > > > developers have seen, the Intellij IDEA has shown many intellisense
> > about
> > > > cleanup and improvement. The community does not object to doing the
> > > cleanup
> > > > and improvement work and the work has been started via some direct
> > > "minor"
> > > > PRs by some volunteers. The current way is unorganized and hard to
> > > manage.
> > > > For tracking this work, I prefer to manage this work with the Jira
> > issue.
> > > > We can create an umbrella issue. Then, split the work into several
> > > > subtasks.
> > > >
> > > > Since those "bad smell" lays anywhere in the whole project. It's
> > > difficult
> > > > to give a standard to split the subtasks. For example, some files
> have
> > a
> > > > lot while some modules have few. So I suggest the standard would
> depend
> > > on
> > > > the volume of the changes. Before working, any subtask should find a
> > > > committer as a mentor who would judge and approve the scope is
> > suitable.
> > > >
> > > > What do you think?
> > > >
> > > > Any comments and suggestions would be appreciated.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > >
> >
>


Re: [DISCUSS] Code freeze date for next release(0.5.1)

2020-01-08 Thread Shiyan Xu
+1. Good idea for testing phase.

On Wed, 8 Jan 2020, 08:26 Vinoth Chandar,  wrote:

> +1 one More week to land two weeks of testing is a good plan
>
> On Wed, Jan 8, 2020 at 2:41 AM leesf  wrote:
>
> > Dear Community,
> >
> > As discussed before[1], the proposed release date of *end of Jan* for
> Hudi
> > 0.5.1 is getting closer. And we have many bug fixes and features[2] since
> > the first release about three months ago.
> >
> > To make the release version more stable, I would suggest a bug fixing and
> > testing period of two weeks to be on the safe side. Given the testing
> > period, I would propose to do the code freeze on the 15th of Jan 23:59
> PST
> > in order to keep the release date. It means that we would cut the Hudi
> > 0.5.1 release branch on this date and no more feature contributions would
> > be accepted for this branch. And the uncompleted features would be
> shipped
> > with next release.
> >
> > There are 25 jira issues unfinished[3] yet, and you would still pick up
> the
> > issues you are interested, among all issues, the major issues would be
> > update to spark2.4[4], Replace Databricks spark-avro with native
> > spark-avro[5] and Migrating to Scala 2.12[6], which are important for
> > runing Hudi on Google Cloud, and we incline to get them on land.
> >
> > What do you think about the proposed code freeze date? Glad to hear your
> > thoughts.
> >
> > Best,
> > Leesf
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/HUDI/20191210+Weekly+Sync+Minutes
> > [2]
> >
> >
> https://issues.apache.org/jira/browse/HUDI-507?jql=project%20%3D%20HUDI%20AND%20fixVersion%20%3D%200.5.1
> > [3]
> >
> >
> https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-printable/temp/SearchRequest.html?jqlQuery=project+%3D+HUDI+AND+fixVersion+%3D+0.5.1+AND+status+%21%3D+Fixed++AND+status+%21%3D+Resolved+AND+status+%21%3D+Closed+=1000
> > [4] https://issues.apache.org/jira/browse/HUDI-12
> > [5] https://issues.apache.org/jira/browse/HUDI-91
> > [6] https://issues.apache.org/jira/browse/HUDI-238
> >
>


Re: running Hudi in AWS Glue Spark

2020-03-06 Thread Shiyan Xu
I can answer this as my team faces exactly the same problems.
We recently sync'ed up with AWS EMR team and got some directions.

Hudi dataset <> Glue
An interim approach is needed: configure S3 notification to detect new
commit file after each compaction, upon the notification update an manifest
file for Glue to update
This is some workaround before Athena officially support Hudi dataset

Athena support
This is planned but no definite timeline given. High level approach is
use Athena
Hive external metadata store

but
Athena needs some changes to adapt to Hudi dataset

The considerations from my team is: the interim approach should work nicely
but require additional operational efforts.
We have an alternative plan of using the new feature of Hudi snapshot
exporter (https://issues.apache.org/jira/browse/HUDI-344) which is about to
be merged.
It helps exporting Hudi dataset to plain parquet files and work natively
with Athena or Glue. We don't have very low latency requirements at the
moment so periodic export works for us.
The feature should be available in 6.0 but the class can be used as a
standalone tool.

On Fri, Mar 6, 2020 at 6:26 PM Sanchez, Jorge
 wrote:

> Hi Vinoth,
>
> Thanks for the reply, our design is to utilize Glue for ETL processing. We
> would have to support both real time IOT data and batch ETL flows ( jdbc
> source and static files like csv ).
> The access layer would be through the presto cluster which would be
> running on EC2 within AWS environment.
>
> We would like to utilize the historization of the data as it is one of the
> requirements. My impression is that the Hudi is getting lot of attention
> from AWS as it is now mainstreamed into EMR, what I don't see is the use
> cases using the Glue environment - all the documentation mentions the EMR.
>
> My questions would be:
> * how difficult would be to have the Hudi integrated to AWS Glue
> * is the Glue metadata catalog fully supported for Hudi tables
> * is the Glue crawler able to crawler and catalog the Hudi tables
> * is there any plan for the Athena to support access to Hudi tables in the
> future
>
> I understand that these question should be addressed to the AWS guys,
> hoping that there are some of them on this channel.
>
> Regards,
>
> Jorge
>
> -Original Message-
> From: Vinoth Chandar 
> Sent: Friday, March 6, 2020 6:43 PM
> To: dev@hudi.apache.org
> Subject: Re: running Hudi in AWS Glue Spark
>
> EXTERNAL EMAIL – Use caution with any links or file attachments.
>
> https://aws.amazon.com/emr/features/hudi/ mentions that its integrated
> with the glue catalog.
>
> It should be similar to other datasources you use on Glue IIUC.. I have
> seen users talk about this on slack (IIRC)..
> Are you running into specific issues we can help with? May be the AWS
> folks here can chime in more?
>
> On Fri, Mar 6, 2020 at 3:47 AM Sanchez, Jorge 
> 
> wrote:
>
> > Hello,
> >
> > Did anybody tried to run Hudi within AWS Glue job, I searched the JIRA
> > issues but did not find anybody mentioning that.
> >
> >
> > Thanks,
> >
> > Jorge
> > Notice:  This e-mail message, together with any attachments, contains
> > information of Merck & Co., Inc. (2000 Galloping Hill Road,
> > Kenilworth, New Jersey, USA 07033), and/or its affiliates Direct
> > contact information for affiliates is available at
> > http://www.merck.com/contact/contacts.html) that may be confidential,
> > proprietary copyrighted and/or legally privileged. It is intended
> > solely for the use of the individual or entity named on this message.
> > If you are not the intended recipient, and have received this message
> > in error, please notify us immediately by reply e-mail and then delete
> > it from your system.
> >
> Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
> New Jersey, USA 07033), and/or its affiliates Direct contact information
> for affiliates is available at
> http://www.merck.com/contact/contacts.html) that may be confidential,
> proprietary copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error,
> please notify us immediately by reply e-mail and then delete it from
> your system.
>


Re: New PPMC Member : Bhavani Sudha

2020-04-08 Thread Shiyan Xu
Congrats Sudha! Well deserved!

On Tue, Apr 7, 2020 at 8:46 PM vino yang  wrote:

> Congrats sudha, well deserved!
>
> Best,
> Vino
>
> leesf  于2020年4月8日周三 上午9:31写道:
>
> > Congrats sudha, well deserved!
> >
> > Balaji Varadarajan  于2020年4月8日周三 上午6:55写道:
> >
> > >  Congratulations Sudha :) Well deserved.  Welcome to PPMC.
> > > Balaji.V
> > >
> > > On Tuesday, April 7, 2020, 03:04:37 PM PDT, Gary Li <
> > > yanjia.gary...@gmail.com> wrote:
> > >
> > >  Congrats Sudha! Appreciated all the work you have done!
> > >
> > > On Tue, Apr 7, 2020 at 2:57 PM Y Ethan Guo 
> > > wrote:
> > >
> > > > Congrats!!!
> > > >
> > > > On Tue, Apr 7, 2020 at 2:55 PM Vinoth Chandar 
> > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I am very excited to share that we have new PPMC member - Sudha.
> She
> > > has
> > > > > been a great champion for the project for almost couple years now,
> > > > driving
> > > > > a lot of presto/query engine facing changes and most of all being
> the
> > > > face
> > > > > of our community to new users on Slack, over the past few months.
> > > > >
> > > > > Please join me in congratulating her!
> > > > >
> > > > > On behalf of Hudi PPMC,
> > > > > Vinoth
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] Upgrade unit test: Junit 5 & AssertJ

2020-04-08 Thread Shiyan Xu
Thank you all for the feedback.

> This increases the scope to a overhaul of tests across the project..
Wonder if we can do a RFC for this?
Indeed it is overhaul type of change. IMO RFC is needed specifically for
the test utility re-design part. Guess it can be created when it's good to
start? Since it'll be a long-running task, have an umbrella ticket for this
topic first? @vinoth


On Sat, Apr 4, 2020 at 2:47 AM leesf  wrote:

> +1 to upgrade the unit test. junit5 combine better with java8, and there
> are some migration guides already, and we maybe could upgrade by module.
>
> vino yang  于2020年4月2日周四 下午4:38写道:
>
> > Hi Shiyan,
> >
> > +1 from my side.
> >
> > Best,
> > Vino
> >
> > Vinoth Chandar  于2020年3月30日周一 下午11:00写道:
> >
> > > Hi Raymond,
> > >
> > > Sounds good to me. This increases the scope to a overhaul of tests
> across
> > > the project.. Wonder if we can do a RFC for this? But overall +1 from
> me.
> > >
> > > I would like to call upon the community to chime in more though :) .
> > let's
> > > give it a few days..
> > >
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Fri, Mar 27, 2020 at 5:18 PM Shiyan Xu  >
> > > wrote:
> > >
> > > > Understand Vinoth. To me AssertJ is nice-to-have. I agree with the
> > > learning
> > > > overhead.
> > > >
> > > > The current CI time is too long and we do need to use more mocking
> and
> > > > optimize spark jobs setup.
> > > >
> > > > Based on your points, I imagine the path forward can be planned as
> this
> > > >
> > > > 1. An initial PR to add Junit 5 to co-exist with 4 in the project
> with
> > a
> > > > simple testcase converted to 5 as a working proof
> > > > 2. A design task to refactor test utilities (create new utilities
> with
> > > > Junit 5 for easy switch-over of affected testcases)
> > > > 3. Track all test improvement PRs (using Junit 5). Each PR should aim
> > to
> > > > solve 1 of the problems below
> > > >   - test can be improved with mocking
> > > >   - test can be optimized on spark job setup
> > > > 4. Clean unused test utilities (from step 2)
> > > >
> > > > We should recognize these steps to be carried out in a long-running
> > > ongoing
> > > > fashion.
> > > >
> > > > Any thoughts or feedback?
> > > >
> > > > On Wed, Mar 25, 2020 at 7:52 AM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > +1 on Junit5.
> > > > >  does seem nicer with support for lambdas. assuming we do a gradual
> > > > > rollout. At any point, we cannot have any of the core tests
> disabled
> > :)
> > > > > May be we can use the vintage framework for now, do minimal changes
> > > > migrate
> > > > > and then proceed to redoing the tests
> > > > >
> > > > > On AssertJ type frameworks, I wonder if there is a cost to this
> type
> > of
> > > > > framework for new devs.
> > > > > They already need to learn junit 5, mockito, all the TestUtils and
> > like
> > > > one
> > > > > more framework for asserting
> > > > >
> > > > > Orthogonally, I will be thrilled if you also took upon a large
> > > > > restructuring on tests cleanly into
> > > > > - unit tests that test class functionality using mocks
> > > > > - functional tests that bring up a spark context and actually run
> the
> > > job
> > > > > (we have a lot of these tests masquerading as unit tests)
> > > > > - Clean redesign of the test utility classes
> > > > >
> > > > > Sorry to expand scope, but when someone is going to take a look at
> > > every
> > > > > test, I could not pass up an opportunity to sneak this in :)
> > > > >
> > > > > Love to hear others thoughts.. any one with experience working with
> > > > > Junit5/Assertj-Hamcrest?
> > > > >
> > > > > On Tue, Mar 24, 2020 at 9:36 PM Shiyan Xu <
> > xu.shiyan.raym...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Some references
> > > > > > https://junit.org/junit5/docs/current/user-guide/
> > > > > > https://joel-cost

Re: New Committer: lamber-ken

2020-04-08 Thread Shiyan Xu
Congrats Lamber-ken! Well deserved!

On Wed, Apr 8, 2020 at 4:52 AM Sivabalan  wrote:

> Congrats Lamber! Well deserved.
>
> On Wed, Apr 8, 2020 at 5:21 AM Pratyaksh Sharma 
> wrote:
>
> > Congratulations lamberken!
> >
> > On Wed, Apr 8, 2020 at 11:10 AM Jiayi Liao 
> > wrote:
> >
> > > Congratulations!
> > >
> > > Best,
> > > Jiayi Liao
> > >
> > > On Wed, Apr 8, 2020 at 12:15 PM tison  wrote:
> > >
> > > > Congrats lamber!
> > > >
> > > > Best,
> > > > tison.
> > > >
> > > >
> > > > vino yang  于2020年4月8日周三 上午11:45写道:
> > > >
> > > > > Congrats lamber! Well deserved!
> > > > >
> > > > > Best,
> > > > > Vino
> > > > >
> > > > > leesf  于2020年4月8日周三 上午9:30写道:
> > > > >
> > > > > > Congrats lamber-ken, well deserved!
> > > > > >
> > > > > > Balaji Varadarajan  于2020年4月8日周三
> > > 上午6:45写道:
> > > > > >
> > > > > > >  Many Congratulations Lamber-Ken.  Well deserved !!
> > > > > > > Balaji.V
> > > > > > > On Tuesday, April 7, 2020, 02:23:51 PM PDT, Y Ethan Guo <
> > > > > > > ethan.guoyi...@gmail.com> wrote:
> > > > > > >
> > > > > > >  Congrats!!!
> > > > > > >
> > > > > > > On Tue, Apr 7, 2020 at 2:22 PM Gary Li <
> yanjia.gary...@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Congrats lamber! Well deserved!
> > > > > > > >
> > > > > > > > On Tue, Apr 7, 2020 at 2:18 PM Vinoth Chandar <
> > vin...@apache.org
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello Apache Hudi Community,
> > > > > > > > >
> > > > > > > > > The Podling Project Management Committee (PPMC) for Apache
> > > > > > > > > Hudi (Incubating) has invited lamber-ken (Xie Lei) to
> become
> > a
> > > > > > > committer
> > > > > > > > > and we are pleased to announce that he has accepted.
> > > > > > > > >
> > > > > > > > > lamber-ken has had a large impact by in hudi, with some
> > > sustained
> > > > > > > efforts
> > > > > > > > > in the past several months. He has rebuilt our site ground
> > up,
> > > > > > > automated
> > > > > > > > > doc workflows, helped fixed a lot of bugs and also been
> super
> > > > > helpful
> > > > > > > for
> > > > > > > > > the community at large.
> > > > > > > > >
> > > > > > > > > Congratulations lamber-ken !! Please join me in recognizing
> > his
> > > > > > > efforts!
> > > > > > > > >
> > > > > > > > > On behalf of PPMC,
> > > > > > > > > Vinoth
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Upgrade unit test: Junit 5 & AssertJ

2020-04-09 Thread Shiyan Xu
Filed!
https://issues.apache.org/jira/browse/HUDI-779

On Wed, Apr 8, 2020 at 11:05 PM Vinoth Chandar  wrote:

> +1 on an umbrella task..
>
> We can do the RFC for overhaul of tests (mocking more tests, cleaning up
> test data gen and so on)..
> For adding junit5 itself and doing the initial work, we could just begin
> with JIRA?
>
> On Wed, Apr 8, 2020 at 12:56 PM Shiyan Xu 
> wrote:
>
> > Thank you all for the feedback.
> >
> > > This increases the scope to a overhaul of tests across the project..
> > Wonder if we can do a RFC for this?
> > Indeed it is overhaul type of change. IMO RFC is needed specifically for
> > the test utility re-design part. Guess it can be created when it's good
> to
> > start? Since it'll be a long-running task, have an umbrella ticket for
> this
> > topic first? @vinoth
> >
> >
> > On Sat, Apr 4, 2020 at 2:47 AM leesf  wrote:
> >
> > > +1 to upgrade the unit test. junit5 combine better with java8, and
> there
> > > are some migration guides already, and we maybe could upgrade by
> module.
> > >
> > > vino yang  于2020年4月2日周四 下午4:38写道:
> > >
> > > > Hi Shiyan,
> > > >
> > > > +1 from my side.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > Vinoth Chandar  于2020年3月30日周一 下午11:00写道:
> > > >
> > > > > Hi Raymond,
> > > > >
> > > > > Sounds good to me. This increases the scope to a overhaul of tests
> > > across
> > > > > the project.. Wonder if we can do a RFC for this? But overall +1
> from
> > > me.
> > > > >
> > > > > I would like to call upon the community to chime in more though :)
> .
> > > > let's
> > > > > give it a few days..
> > > > >
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Fri, Mar 27, 2020 at 5:18 PM Shiyan Xu <
> > xu.shiyan.raym...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Understand Vinoth. To me AssertJ is nice-to-have. I agree with
> the
> > > > > learning
> > > > > > overhead.
> > > > > >
> > > > > > The current CI time is too long and we do need to use more
> mocking
> > > and
> > > > > > optimize spark jobs setup.
> > > > > >
> > > > > > Based on your points, I imagine the path forward can be planned
> as
> > > this
> > > > > >
> > > > > > 1. An initial PR to add Junit 5 to co-exist with 4 in the project
> > > with
> > > > a
> > > > > > simple testcase converted to 5 as a working proof
> > > > > > 2. A design task to refactor test utilities (create new utilities
> > > with
> > > > > > Junit 5 for easy switch-over of affected testcases)
> > > > > > 3. Track all test improvement PRs (using Junit 5). Each PR should
> > aim
> > > > to
> > > > > > solve 1 of the problems below
> > > > > >   - test can be improved with mocking
> > > > > >   - test can be optimized on spark job setup
> > > > > > 4. Clean unused test utilities (from step 2)
> > > > > >
> > > > > > We should recognize these steps to be carried out in a
> long-running
> > > > > ongoing
> > > > > > fashion.
> > > > > >
> > > > > > Any thoughts or feedback?
> > > > > >
> > > > > > On Wed, Mar 25, 2020 at 7:52 AM Vinoth Chandar <
> vin...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > +1 on Junit5.
> > > > > > >  does seem nicer with support for lambdas. assuming we do a
> > gradual
> > > > > > > rollout. At any point, we cannot have any of the core tests
> > > disabled
> > > > :)
> > > > > > > May be we can use the vintage framework for now, do minimal
> > changes
> > > > > > migrate
> > > > > > > and then proceed to redoing the tests
> > > > > > >
> > > > > > > On AssertJ type frameworks, I wonder if there is a cost to this
> > > type
> > > > of
> > > > > > > framework for new devs.
> > > > &

Re: [HELP WANTED] Codecov report skips JUnit 5 test cases

2020-04-14 Thread Shiyan Xu
Thank you for the responses. It was resolved by upgrading the surefire
plugin version. Thanks to @prashantwason !

On Tue, Apr 14, 2020 at 2:51 PM Ramachandran Madras Subramaniam
 wrote:

> Hi,
>
> AFAIK codecov doesn't really work with the test framework. we only uplod
> the corbetura reports collected locally in the build system.
>
> Can you please verify that the local codecoverage reports pick up JUnit5
> changes ?
>
> -Ram
>
> On Tue, Apr 14, 2020 at 2:49 PM Vinoth Chandar  wrote:
>
> > This one is probably worth flagging with codecov support as well?
> > Does seem weird.. :/
> >
> > On Mon, Apr 13, 2020 at 11:14 PM Shiyan Xu 
> > wrote:
> >
> > > Hi all,
> > >
> > > We're migrating all test cases to JUnit 5.
> > >
> > > This PR, as an initial step to enable JUnit 5, has migrated quite a few
> > > test cases. The test cases pass in green with no issue.
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_pull_1504=DwIBaQ=r2dcLCtU9q6n0vrtnDw9vg=KLmNyF_KPBPNb-BIVUsy8j_1tYfqyNa57jwVia1c9kM=N-aQ9xRn8pYpDY14UF6eW6uheElz-yPJJeF2t4iSnQc=rGcSb-T2BFSi6WdXoQAaMGCHMFUBUvFa2rkn31OXf6w=
> > >
> > > However, as you can see in the Codecov comment, the coverage decreased
> > > quite a bit. It's been observed that the test cases ran by JUnit 5 were
> > > somehow omitted in the report and resulted in the decrease.
> > >
> > > Is there anyone familiar with the Codecov settings? Or happen to know
> > > Codecov experts?
> > > I created this ticket for investigation. Any hint/suggestion on
> > > troubleshooting this will be highly appreciated. Please feel free to
> > > response to the thread or comment in the ticket.
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.apache.org_jira_browse_HUDI-2D792=DwIBaQ=r2dcLCtU9q6n0vrtnDw9vg=KLmNyF_KPBPNb-BIVUsy8j_1tYfqyNa57jwVia1c9kM=N-aQ9xRn8pYpDY14UF6eW6uheElz-yPJJeF2t4iSnQc=FpOe2NEJMjMsa6gvvuvi1N9YXQFRjweDMCSMYyqwtsk=
> > >
> > > Thank you.
> > >
> > > Regards,
> > > Raymond
> > >
> >
>


[HELP WANTED] Codecov report skips JUnit 5 test cases

2020-04-14 Thread Shiyan Xu
Hi all,

We're migrating all test cases to JUnit 5.

This PR, as an initial step to enable JUnit 5, has migrated quite a few
test cases. The test cases pass in green with no issue.
https://github.com/apache/incubator-hudi/pull/1504

However, as you can see in the Codecov comment, the coverage decreased
quite a bit. It's been observed that the test cases ran by JUnit 5 were
somehow omitted in the report and resulted in the decrease.

Is there anyone familiar with the Codecov settings? Or happen to know
Codecov experts?
I created this ticket for investigation. Any hint/suggestion on
troubleshooting this will be highly appreciated. Please feel free to
response to the thread or comment in the ticket.
https://jira.apache.org/jira/browse/HUDI-792

Thank you.

Regards,
Raymond


[DISCUSS] Support popular metrics reporter

2020-04-20 Thread Shiyan Xu
Hi all,

I'd like raise the topic of supporting multiple metrics reporters.

Currently hudi supports graphite and JMX. And there are 2 proposed reporter
types: CSV and Prometheus
https://jira.apache.org/jira/browse/HUDI-210
https://jira.apache.org/jira/browse/HUDI-361

I think supporting multiple metrics backends gives Hudi competitive
advantage on user expansion. It reduces the friction for different
organizations to adopt Hudi. And we only need to support a few popular ones
to achieve that.

In terms of determining the list, as mentioned by @vinoyang, flink has a
nice list of supported ones:
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html#reporter
which can be used as a reference.

>From that list, I'd like to propose supporting Datadog as well, due to its
popularity. May I get +1 on this?

Thank you.

Regards,
Raymond


Re: [DISSCUSS] Troubleshooting flow

2020-04-06 Thread Shiyan Xu
+1. Agree to the process.

On Mon, 6 Apr 2020, 19:19 Balaji Varadarajan, 
wrote:

>  Agree. The triaging process makes sense to me.
> Balaji.V
> On Monday, April 6, 2020, 09:54:24 AM PDT, Vinoth Chandar <
> vin...@apache.org> wrote:
>
>  Hi,
>
> I feel there are couple of action items here..
>
> a) JIRA to track work for slack-ML integration
> b) Document the support triaging process : Slack (level 1) -> Github Issues
> (level 2 , triage, root cause) -> JIRA (level 3, file bug, get resolution)
> .
>
> P.S: Mailing List is very similar to Slack as well IMO.. i.e mostly level 1
> things (w.r.t to triaging issues). Do you all agree?
>
> Thanks
> Vinoth
>
> On Sat, Apr 4, 2020 at 3:03 AM leesf  wrote:
>
> > Sorry to chime in so late, in fact we did discussion integrate slack with
> > dev ML before [1], but seems like it needs some other work before
> working,
> > in order to reduce repetitive workload, I am +1 to move some debugging
> > question to GH issues, which could be easily searched.
> >
> > [1]
> >
> >
> https://lists.apache.org/thread.html/r0575d916663f826a5078363ec913c53360afb372471061aa60fd380c%40%3Cdev.hudi.apache.org%3E
> >
> > lamber-ken  于2020年4月4日周六 上午12:47写道:
> >
> > >
> > >
> > > Thanks you all,
> > >
> > >
> > > Agree with Sudha, it's ok to answer simple questions and move debugging
> > > type of questions to GH issues.
> > > So, let's try to guide users who asking debugging questions to use GH
> > > issues if possible.
> > >
> > >
> > > Thanks,
> > > Lamber-Ken
> > >
> > >
> > >
> > >
> > >
> > > At 2020-04-03 07:19:26, "Bhavani Sudha" 
> wrote:
> > > >Also one thing I wanted to note. I feel it should be okay to answer
> > simple
> > > >`what does this mean` type of questions in slack and move debugging
> type
> > > of
> > > >questions to GH issues. What do you all think?
> > > >
> > > >Thanks,
> > > >Sudha
> > > >
> > > >On Thu, Apr 2, 2020 at 11:45 AM Bhavani Sudha <
> bhavanisud...@gmail.com>
> > > >wrote:
> > > >
> > > >> Agree on using GH issues to post code snippets or debugging issues.
> > > >>
> > > >> Regarding mirroring slack to commits, the last time I checked there
> > was
> > > no
> > > >> options that was readily available ( there were one or two paid
> > > products).
> > > >> It looked like we can possibly develop our own IFTT/ web hook on
> > slack.
> > > Not
> > > >> sure how much of work that is.
> > > >>
> > > >>
> > > >> Thanks,
> > > >> Sudha
> > > >>
> > > >>
> > > >> On Thu, Apr 2, 2020 at 8:40 AM Vinoth Chandar 
> > > wrote:
> > > >>
> > > >>> Hello all,
> > > >>>
> > > >>> Actually that's how we have been using GH issues.. Both slack/ml
> are
> > > >>> inconvenient for sharing code and having long threaded
> conversations.
> > > >>> (same
> > > >>> issues raised here).
> > > >>>
> > > >>> That said, we could definitely formalize this and look to move
> slack
> > > >>> threads into GH issue for triaging (then follow up with JIRA, if
> real
> > > bug)
> > > >>> before they get too long.
> > > >>>
> > > >>> >>slack has some answerbot to auto reply and promote users to
> create
> > GH
> > > >>> issues.
> > > >>> Worth looking into.. There was also a conversation around mirroring
> > > >>> #general into commits or something for indexing/searching.. ?
> > > >>>
> > > >>>
> > > >>> On Thu, Apr 2, 2020 at 1:36 AM vino yang 
> > wrote:
> > > >>>
> > > >>> > Hi Lamber-Ken,
> > > >>> >
> > > >>> > Thanks for rasing this problem.
> > > >>> >
> > > >>> > >> 3. threads cann't be indexed by search engines
> > > >>> >
> > > >>> > Yes, I always thought that it would be better to have a "users"
> ML,
> > > but
> > > >>> it
> > > >>> > is not clear whether only the 

HoodieSnapshotExporter

2020-03-27 Thread Shiyan Xu
Hi all,

We recently merged a utility class HoodieSnapshotExporter (RFC-9
)
into master with a goal to enhance exporting capabilities. Many thanks
to @openopened (sorry I only know your GitHub handle) for the initial
implementation and @leesf @vinoth for the reviews.

I wrote a blog to illustrate the feature and usage.
https://cwiki.apache.org/confluence/display/HUDI/2020/03/22/Export+Hudi+datasets+as+a+copy+or+as+different+formats

Thanks.


Cheers,
Raymond


Re: [DISCUSS] Upgrade unit test: Junit 5 & AssertJ

2020-03-27 Thread Shiyan Xu
Understand Vinoth. To me AssertJ is nice-to-have. I agree with the learning
overhead.

The current CI time is too long and we do need to use more mocking and
optimize spark jobs setup.

Based on your points, I imagine the path forward can be planned as this

1. An initial PR to add Junit 5 to co-exist with 4 in the project with a
simple testcase converted to 5 as a working proof
2. A design task to refactor test utilities (create new utilities with
Junit 5 for easy switch-over of affected testcases)
3. Track all test improvement PRs (using Junit 5). Each PR should aim to
solve 1 of the problems below
  - test can be improved with mocking
  - test can be optimized on spark job setup
4. Clean unused test utilities (from step 2)

We should recognize these steps to be carried out in a long-running ongoing
fashion.

Any thoughts or feedback?

On Wed, Mar 25, 2020 at 7:52 AM Vinoth Chandar  wrote:

> +1 on Junit5.
>  does seem nicer with support for lambdas. assuming we do a gradual
> rollout. At any point, we cannot have any of the core tests disabled :)
> May be we can use the vintage framework for now, do minimal changes migrate
> and then proceed to redoing the tests
>
> On AssertJ type frameworks, I wonder if there is a cost to this type of
> framework for new devs.
> They already need to learn junit 5, mockito, all the TestUtils and like one
> more framework for asserting
>
> Orthogonally, I will be thrilled if you also took upon a large
> restructuring on tests cleanly into
> - unit tests that test class functionality using mocks
> - functional tests that bring up a spark context and actually run the job
> (we have a lot of these tests masquerading as unit tests)
> - Clean redesign of the test utility classes
>
> Sorry to expand scope, but when someone is going to take a look at every
> test, I could not pass up an opportunity to sneak this in :)
>
> Love to hear others thoughts.. any one with experience working with
> Junit5/Assertj-Hamcrest?
>
> On Tue, Mar 24, 2020 at 9:36 PM Shiyan Xu 
> wrote:
>
> > Some references
> > https://junit.org/junit5/docs/current/user-guide/
> > https://joel-costigliola.github.io/assertj/
> >
> > On Tue, Mar 24, 2020 at 9:27 PM Shiyan Xu 
> > wrote:
> >
> > > Hi all,
> > >
> > > I'd like to gather some feedback about
> > > 1. upgrading Junit 4 to 5
> > > 2. adopt AssertJ as preferred assertion statement style
> > >
> > > IMO 1) will give many benefits on writing better unit tests. A google
> > > search of "junit 4 vs 5" could lead to many good points. And it is some
> > > migration can be done piece by piece (keeping both 4 and 5 during
> upgrade
> > > and enforce new test using 5)
> > >
> > > 2) is to spice things up and bring the test readability to another
> level,
> > > though I'll treat it as nice-to-have.
> > >
> > > Would you +1 or -1 on either or both?
> > >
> > > Knowing that it'll be a long way to go due to the large number of
> tests,
> > > this needs to be planned and tracked carefully.
> > >
> > > Thank you.
> > >
> > > Best,
> > > Raymond
> > >
> > >
> >
>


Re: HoodieSnapshotExporter

2020-03-28 Thread Shiyan Xu
Sure Vinoth, please feel free to make edits.

On Sat, 28 Mar 2020, 15:33 Vinoth Chandar,  wrote:

> +1 great contribution everyone.
>
> Thanks for the blog raymond. Will make some minor edits if you dont mind
> and tweet from our handle.:)
>
> On Sat, Mar 28, 2020 at 12:50 AM vino yang  wrote:
>
> > Hi Raymond,
> >
> > Thanks for driving this valuable feature! Having this tool, it would be
> > easier for backup purposes!
> >
> > Best,
> > Vino
> >
> >
> >
> > Shiyan Xu  于2020年3月28日周六 上午8:21写道:
> >
> > > Hi all,
> > >
> > > We recently merged a utility class HoodieSnapshotExporter (RFC-9
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
> > > >)
> > > into master with a goal to enhance exporting capabilities. Many thanks
> > > to @openopened (sorry I only know your GitHub handle) for the initial
> > > implementation and @leesf @vinoth for the reviews.
> > >
> > > I wrote a blog to illustrate the feature and usage.
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/2020/03/22/Export+Hudi+datasets+as+a+copy+or+as+different+formats
> > >
> > > Thanks.
> > >
> > >
> > > Cheers,
> > > Raymond
> > >
> >
>


Re: [DISSCUSS] Troubleshooting flow

2020-03-31 Thread Shiyan Xu
Good idea to use GH issues as triage.

Not sure if slack has some answerbot to auto reply and promote users to
create GH issues. If it can be configured that way, that'd be great for
this purpose :)

On Tue, 31 Mar 2020, 10:03 lamberken,  wrote:

> Hi team,
>
>
>
>
> Many users use slack ask for support when they met bugs / problems
> currently.
>
> but there are some disadvantages we need to consider:
>
> 1. code snippet display is not friendly.
>
> 2. we may miss some questions when questions come up at the same time.
>
> 3. threads cann't be indexed by search engines
>
> ...
>
>
>
>
> So, I suggest we should guide users to use GitHub issues as much as we can.
>
> step1: guide users use GitHub issues to report their questions
>
> step2: developers can pick up some issues which they are interested in.
>
> step3: raise a related JIRA if needed
>
> step4: add some useful notes to troubleshooting guide
>
>
>
> Any thoughts are welcome, thanks : )
>
>
> Best,
> Lamber-Ken


Re: [DISCUSS] Upgrade unit test: Junit 5 & AssertJ

2020-03-24 Thread Shiyan Xu
Some references
https://junit.org/junit5/docs/current/user-guide/
https://joel-costigliola.github.io/assertj/

On Tue, Mar 24, 2020 at 9:27 PM Shiyan Xu 
wrote:

> Hi all,
>
> I'd like to gather some feedback about
> 1. upgrading Junit 4 to 5
> 2. adopt AssertJ as preferred assertion statement style
>
> IMO 1) will give many benefits on writing better unit tests. A google
> search of "junit 4 vs 5" could lead to many good points. And it is some
> migration can be done piece by piece (keeping both 4 and 5 during upgrade
> and enforce new test using 5)
>
> 2) is to spice things up and bring the test readability to another level,
> though I'll treat it as nice-to-have.
>
> Would you +1 or -1 on either or both?
>
> Knowing that it'll be a long way to go due to the large number of tests,
> this needs to be planned and tracked carefully.
>
> Thank you.
>
> Best,
> Raymond
>
>


[DISCUSS] Upgrade unit test: Junit 5 & AssertJ

2020-03-24 Thread Shiyan Xu
Hi all,

I'd like to gather some feedback about
1. upgrading Junit 4 to 5
2. adopt AssertJ as preferred assertion statement style

IMO 1) will give many benefits on writing better unit tests. A google
search of "junit 4 vs 5" could lead to many good points. And it is some
migration can be done piece by piece (keeping both 4 and 5 during upgrade
and enforce new test using 5)

2) is to spice things up and bring the test readability to another level,
though I'll treat it as nice-to-have.

Would you +1 or -1 on either or both?

Knowing that it'll be a long way to go due to the large number of tests,
this needs to be planned and tracked carefully.

Thank you.

Best,
Raymond


Re: [ATTN] JUnit 5 adoption

2020-04-23 Thread Shiyan Xu
Also thank @vinoyang for taking on the reviews!

> [Wondering if there is a way to stick a checkstyle rule to this effect.
guess it won't check for new changes alone, rather complain about existing
junit 4 tests?]

Yes checkstyle complains all... will add the rule after API migration.

On Thu, Apr 23, 2020 at 6:27 AM Sivabalan  wrote:

> Good job Raymond ! and thanks for the reminder.
>
> On Wed, Apr 22, 2020 at 11:42 AM leesf  wrote:
>
> > Thanks for the reminder, I upgraded the to junit5 for the PR
> > https://github.com/apache/incubator-hudi/pull/1536 and will take an eye
> on
> > when reviewing PRs.
> >
> > Bhavani Sudha  于2020年4月22日周三 下午3:31写道:
> >
> > > +1. Thanks for the update Raymond and great work on the migration.
> > >
> > > -Sudha
> > >
> > > On Tue, Apr 21, 2020 at 10:39 PM Vinoth Chandar 
> > wrote:
> > >
> > > > +1 Appreciate the efforts, Raymond!
> > > >
> > > > [Wondering if there is a way to stick a checkstyle rule to this
> effect.
> > > > guess it won't check for new changes alone, rather complain about
> > > existing
> > > > junit 4 tests?]
> > > >
> > > > On Tue, Apr 21, 2020 at 5:10 PM Shiyan Xu <
> xu.shiyan.raym...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We're in progress with JUnit 5 migration for all test classes. So
> far
> > > the
> > > > > JUnit 5 dependencies (including Mockito) have been added to all
> > > modules.
> > > > > The APIs/modules migration status is shown here
> > > > > https://github.com/apache/incubator-hudi/pull/1530#issue-405575235
> > > > >
> > > > > I would like to kindly ask for support from the community in these
> 2
> > > > > aspects
> > > > >
> > > > > - To PR submitters: for newly added test classes, please start
> using
> > > > JUnit
> > > > > 5 APIs (org.junit.jupiter.*)
> > > > > - To PR reviewers: please help look out for the JUnit adopt in the
> > new
> > > > test
> > > > > classes
> > > > >
> > > > > Really appreciate the coordination efforts on this matter.
> > > > >
> > > > > Thank you.
> > > > >
> > > > > Regards,
> > > > > Raymond
> > > > >
> > > >
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Bug bash?

2020-04-23 Thread Shiyan Xu
+1 would like to participate

On Thu, Apr 23, 2020 at 5:51 PM Dongdong Hong  wrote:

> +1 sounds great!
>
> Sivabalan  于2020年4月23日周四 下午9:30写道:
>
> > +1
> >
> > On Wed, Apr 22, 2020 at 7:29 PM lamber-ken  wrote:
> >
> > >
> > >
> > >
> > > Wow, challenging job, +1
> > >
> > >
> > > Best,
> > > Lamber-Ken
> > >
> > > At 2020-04-23 04:51:01, "Vinoth Chandar"  wrote:
> > > >Just floating a very random idea here. :)
> > > >
> > > >Would there be interest in doing a bug bash for a week, where we
> > > >aggressively close out some pesky bugs that have been lingering
> around..
> > > If
> > > >enough committers and contributors are around, we can move the needle.
> > We
> > > >could time this a week before cutting RC for next release.
> > > >
> > > >Thanks
> > > >Vinoth
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


Re: [DISCUSS] Support popular metrics reporter

2020-04-23 Thread Shiyan Xu
Thank you all for the approval! Filed
https://issues.apache.org/jira/browse/HUDI-836

On Thu, Apr 23, 2020 at 5:40 PM dongdong hong  wrote:

> +1
>
>


[ATTN] JUnit 5 adoption

2020-04-21 Thread Shiyan Xu
Hi all,

We're in progress with JUnit 5 migration for all test classes. So far the
JUnit 5 dependencies (including Mockito) have been added to all modules.
The APIs/modules migration status is shown here
https://github.com/apache/incubator-hudi/pull/1530#issue-405575235

I would like to kindly ask for support from the community in these 2 aspects

- To PR submitters: for newly added test classes, please start using JUnit
5 APIs (org.junit.jupiter.*)
- To PR reviewers: please help look out for the JUnit adopt in the new test
classes

Really appreciate the coordination efforts on this matter.

Thank you.

Regards,
Raymond


[DISCUSS] Return schema provider as optional?

2020-05-02 Thread Shiyan Xu
Hi all,

In case of reading schema-inferable source like parquet, when no new data
is found, then, if i understand correctly, no schema can be inferred, and
need not to be.

Seeing this
method org.apache.hudi.utilities.sources.InputBatch#getSchemaProvider
requiring non-null schemaProvider, and
org.apache.hudi.utilities.deltastreamer.DeltaSync#readFromSource calling
getSchemaProvider() for all cases, including the no-new-data case,
exception will be thrown asking to set schema provider, for even reading
from schema-inferable parquet source. I think this is not an ideal case.

I had a short draft PR to accept null schema provider in case of no new data
https://github.com/apache/incubator-hudi/pull/1584/files
I actually prefer another approach of returning Option
getSchemaProvider()

In case I have misunderstand the logic or use case, I'd like to ask for
some feedback on this change.

Thank you.

Regards,
Raymond


Re: Unable to run hudi-cli integration tests

2020-05-17 Thread Shiyan Xu
Hi Pratyaksh,

I have the same setup as yours. I would normally tend to clean up my local
deps

mvn dependency:purge-local-repository

mvn clean install -DskipTests -DskipITs

mvn -Dtest=ITTestRepairsCommand#testDeduplicateWithReal
-DfailIfNoTests=false test

Though I was able to run the test, it failed to pass... sharing this to
check if I'm not alone :)

[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed:
6.822 s <<< FAILURE! - in org.apache.hudi.cli.integ.ITTestRepairsCommand
[ERROR]
org.apache.hudi.cli.integ.ITTestRepairsCommand.testDeduplicateWithReal
 Time elapsed: 6.014 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: expected:  but was: 
at
org.apache.hudi.cli.integ.ITTestRepairsCommand.testDeduplicateWithReal(ITTestRepairsCommand.java:171)

If it's the same for others, we'd need to investigate why CI is passing.
(hope it's just my setup)

My local setup
➜ mvn -v
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /usr/local/Cellar/maven/3.6.3_1/libexec
Java version: 1.8.0_242, vendor: AdoptOpenJDK, runtime:
/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.15.4", arch: "x86_64", family: "mac"

➜ scala -version
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

On Sun, May 17, 2020 at 5:42 AM Pratyaksh Sharma 
wrote:

> Hi hddong,
>
> Strange but nothing seems to work for me. I tried doing mvn clean and then
> run travis tests. Also I tried running the command `mvn clean package
> -DskipTests -DskipITs -Pspark-shade-unbundle-avro` first and then run the
> test using `mvn -Dtest=ITTestRepairsCommand#testDeduplicateWithReal
> -DfailIfNoTests=false test`. But both of them did not work. I have spark
> installation and I am setting the SPARK_HOME to
> /usr/local/Cellar/apache-spark/2.4.5.
>
> On Sun, May 17, 2020 at 9:00 AM hddong  wrote:
>
> > Hi Pratyaksh,
> >
> > run_travis_tests,sh not run `mvn clean`, You can try to run `mvn
> > clean` manually
> > before integration test.
> >
> > BTW, if you use IDEA, you can do
> > `mvn clean package -DskipTests -DskipITs -Pspark-shade-unbundle-avro`
> > first,
> > then just run integration test in IDEA like unit test does.
> >
> > But, there are something to notice here: you need a runnable spark and
> > SPARK_HOME should in env.
> >
> > Regards
> > hddong
> >
>


Re: [DISCUSS] should we do a 0.5.3 patch set release ?

2020-05-06 Thread Shiyan Xu
+1 for 0.5.3 as well

On Wed, May 6, 2020 at 1:55 PM Sivabalan  wrote:

> sounds good Sudha. Let's have a good list of projects/features to be done
> for 0.6.0 and not end up in a similar situation. I am ok to go with 0.5.3.
>
> On Wed, May 6, 2020 at 4:31 PM Vinoth Chandar  wrote:
>
> > Hi Sudha,
> >
> > +1 on the overall idea.. I tried to pick out few of these PRs that are
> >
> >  - Small enough to apply easily
> >  - Have limited scope, fixing pointed problems
> >  - Have high impact on performance or usability
> >
> > [HUDI-799] Use appropriate FS when loading configs
> >
> >
> https://github.com/apache/incubator-hudi/commit/acb1ada2f756b49d9f9a0aa152f99fcc9e86dde7
> >
> > [HUDI-713] Fix conversion of Spark array of struct type to Avro schema
> >
> >
> https://github.com/apache/incubator-hudi/commit/ce0a4c64d07d6eea926d1bfb92b69ae387b88f50
> >
> > [HUDI-656][Performance] Return a dummy Spark relation after writing
> >
> >
> https://github.com/apache/incubator-hudi/commit/c40a0d4e91896dece51969f5308016ecb3aa635c
> >
> > [HUDI-850] Avoid unnecessary listings in incremental cleaning mode
> >
> >
> https://github.com/apache/incubator-hudi/commit/506447fd4fde4cd922f7aa8f4e17a7f0dc97
> >
> > [HUDI-724] Parallelize getSmallFiles for partitions
> >
> >
> https://github.com/apache/incubator-hudi/commit/1f5b0c77d6c87a936f2d34287ec6a1df1cb18b33
> >
> > [HUDI-607] Fix to allow creation/syncing of Hive tables partitioned
> >
> >
> https://github.com/apache/incubator-hudi/commit/2d040145810b8b14c59c5882f9115698351039d1
> >
> > Add constructor to HoodieROTablePathFilter
> >
> >
> https://github.com/apache/incubator-hudi/commit/418f9bb2e91ed6c02077d36e49a47f0c8d08303a
> >
> > [HUDI-539] Make ROPathFilter conf member serializable
> >
> >
> https://github.com/apache/incubator-hudi/commit/e3019031d8fff60df4fec82eac3fd5c044011635
> >
> > Add changes for presto mor queries
> >
> >
> https://github.com/apache/incubator-hudi/commit/e21441ad8317f302fed947c414e059a332e4d1ef
> >
> > [HUDI-782] Add support of Aliyun object storage service.
> >
> >
> https://github.com/apache/incubator-hudi/commit/5d717a28f45137bea71dffa31b0ae7ccbf1bda00
> >
> >
> > Please chime in with your thoughts, as well.
> >
> > I think there are some bug fixes in the pending PRs as well. esp from
> Alex
> > and Pratyaksh .
> >
> > Thanks
> > Vinoth
> >
> >
> > On Tue, May 5, 2020 at 9:33 PM Bhavani Sudha 
> > wrote:
> >
> > > Hello all,
> > >
> > > I am wondering if we should do a 0.5.3 release by backporting all minor
> > to
> > > medium bug fixes (that are in master already) to 0.5.2 and do a minor
> > > release ? That way we can use some time to reserve 0.6.0 release for
> all
> > > major features that are upcoming and/or almost there. Please share your
> > > thoughts. If you agree also please share the list of fixes that you
> know
> > of
> > > that can go into 0.5.3.
> > >
> > > Thanks,
> > > Sudha
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


Re: [VOTE] Apache Hudi graduation to top level project

2020-05-06 Thread Shiyan Xu
+1

On Wed, May 6, 2020 at 2:49 PM Sivabalan  wrote:

> +1 :)
>
> On Wed, May 6, 2020 at 5:30 PM Gary Li  wrote:
>
> > +1
> >
> > On Wed, May 6, 2020 at 2:28 PM Suneel Marthi  wrote:
> >
> > > +1
> > >
> > > On Wed, May 6, 2020 at 5:01 PM Bhavani Sudha 
> > > wrote:
> > >
> > > > +1
> > > >
> > > >
> > > >
> > > > On Wed, May 6, 2020 at 1:58 PM Vinoth Chandar 
> > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > Per our discussion on the dev mailing list (
> > > > >
> > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/rc98303d9f09665af90ab517ea0baeb7c374e9a5478d8424311e285cd%40%3Cdev.hudi.apache.org%3E
> > > > > )
> > > > >
> > > > > I would like to call a VOTE for Apache Hudi graduating as a top
> level
> > > > > project.
> > > > >
> > > > > If this vote passes, the next step would be to submit the
> resolution
> > > > below
> > > > > to the Incubator PMC, who would vote on sending it on to the Apache
> > > > Board.
> > > > >
> > > > > Vote:
> > > > > [ ] +1 - Recommend graduation of Apache Hudi as a TLP
> > > > > [ ] -1 - Do not recommend graduation of Apache Hudi because...
> > > > >
> > > > > The VOTE is open for a minimum of 72 hours.
> > > > >
> > > > > Establish the Apache Hudi Project
> > > > >
> > > > > WHEREAS, the Board of Directors deems it to be in the best
> interests
> > of
> > > > the
> > > > > Foundation and consistent with the Foundation's purpose to
> establish
> > a
> > > > > Project Management Committee charged with the creation and
> > maintenance
> > > of
> > > > > open-source software, for distribution at no charge to the public,
> > > > related
> > > > > to providing atomic upserts and incremental data streams on Big
> Data.
> > > > >
> > > > > NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
> > > > (PMC),
> > > > > to be known as the "Apache Hudi Project", be and hereby is
> > established
> > > > > pursuant to Bylaws of the Foundation; and be it further
> > > > >
> > > > > RESOLVED, that the Apache Hudi Project be and hereby is responsible
> > for
> > > > the
> > > > > creation and maintenance of software related to providing atomic
> > > upserts
> > > > > and incremental data streams on Big Data; and be it further
> > > > >
> > > > > RESOLVED, that the office of "Vice President, Apache Hudi" be and
> > > hereby
> > > > is
> > > > > created, the person holding such office to serve at the direction
> of
> > > the
> > > > > Board of Directors as the chair of the Apache Hudi Project, and to
> > have
> > > > > primary responsibility for management of the projects within the
> > scope
> > > of
> > > > > responsibility of the Apache Hudi Project; and be it further
> > > > >
> > > > > RESOLVED, that the persons listed immediately below be and hereby
> are
> > > > > appointed
> > > > > to serve as the initial members of the Apache Hudi Project:
> > > > >
> > > > >  * Anbu Cheeralan   
> > > > >
> > > > >  * Balaji Varadarajan
> > > > >
> > > > >  * Bhavani Sudha Saktheeswaran   
> > > > >
> > > > >  * Luciano Resende 
> > > > >
> > > > >  * Nishith Agarwal >
> > > > >
> > > > >  * Prasanna Rajaperumal
> > > > >
> > > > >  * Shaofeng Li   
> > > > >
> > > > >  * Steve Blackmon  
> > > > >
> > > > >  * Suneel Marthi  
> > > > >
> > > > >  * Thomas Weise
> > > > >
> > > > >  * Vino Yang   <
> vinoy...@apache.org>
> > > > >
> > > > >  * Vinoth Chandar
> > > > >
> > > > > NOW, THEREFORE, BE IT FURTHER RESOLVED, that Vinoth Chandar be
> > > appointed
> > > > to
> > > > > the office of Vice President, Apache Hudi, to serve in accordance
> > with
> > > > and
> > > > > subject to the direction of the Board of Directors and the Bylaws
> of
> > > the
> > > > > Foundation until death, resignation, retirement, removal of
> > > > > disqualification, or until a successor is appointed; and
> > > > >
> > > > > be it further
> > > > >
> > > > > RESOLVED, that the Apache Hudi Project be and hereby is tasked with
> > the
> > > > > migration and rationalization of the Apache Incubator Hudi podling;
> > and
> > > > >
> > > > > be it further
> > > > >
> > > > > RESOLVED, that all responsibilities pertaining to the Apache
> > Incubator
> > > > Hudi
> > > > > podling encumbered upon the Apache Incubator PMC are hereafter
> > > > discharged.
> > > > >
> > > >
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Why add unit tests for hudi-cli module

2020-05-12 Thread Shiyan Xu
Hi, the tests in hudi-cli are more of functional tests. They are conducive
to verifying features in cli module are working. Though not covering all
options, it is always better to have some assuring passing tests than none,
isn't it? :)

On Tue, May 12, 2020 at 8:31 AM hmantu  wrote:

> hi all,
>
>
>
> I can not understand why add so many tests for hudi-cli module? We know
> that each
> command has many options, unit tests can not cover all of these options,
> and each
> module has thier own unit tests, so I think they are redundant, any
> thoughs?
>
>
>
>
>
> Thanks


Re: Question on DeltaStreamer

2020-03-18 Thread Shiyan Xu
To answer your question regarding the properties file
It is a way to manage a bunch of hoodie configuration; those confs will be
merged with other confs passed from --hoodie-conf. See this line
.
So any hoodie conf can be put there. Usually we put "configurations for
hoodie client, schema provider, key generator and data source" (per the
docs).

On Wed, Mar 18, 2020 at 6:50 AM Syed Zaidi 
wrote:

> Hi,
>
> I hope things are good. We are planning on using DetalStreamer as a client
> for hudi. Our plan is to use AWS DMS for initial load & CDC. The question I
> have is around the documentation for the properties file that I need for
> dfs, source & target. Where can I find more information on the properties
> files need for the client.
>
> Lets say if I have a source table in Oracle in the format below, will my
> avro schema for source and target will be same.
>
> CREATE TABLE orders
>   (​
> order_id NUMBER GENERATED BY DEFAULT AS IDENTITY START WITH 106
> PRIMARY KEY,​
> customer_id NUMBER( 6, 0 ) NOT NULL, ​
> status  VARCHAR( 20 ) NOT NULL ,​
> salesman_id NUMBER( 6, 0 ) , ​
> order_date   TIMESTAMP NOT NULL​
>   );
>
> I would appreciate your help in this regard.
>
> We are on this stack:
>
> EMR : emr-5.29.0
> Spark: Spark 2.4.4, spark-avro_2.11:2.4.4
>
> Thanks
> Syed Zaidi
>


Re: Sequence of Transformers

2020-03-23 Thread Shiyan Xu
Seems like an abstract class would be good enough for generic use?
User can provide a list of `Transformer` then the abstract class just apply
all the way through the list.
The implementation can be minimal for this approach.

On Mon, Mar 23, 2020 at 4:12 PM Vinoth Chandar  wrote:

> sg. Filed https://issues.apache.org/jira/browse/HUDI-731
>
> Someone looking to pick this? :). Its an nice feature to implement, that
> fits a good template..
>
> ofc we can discuss this more here in parallel
>
> On Mon, Mar 23, 2020 at 8:31 AM FO O  wrote:
>
> > Thank you Vinoth.
> >
> > >"If you are talking about implementing support for chained calling of
> > multiple Transformers, within DeltaStreamer itself"
> >
> > Yes, chained calling support for transformers would be super helpful, if
> > this discussion can be  revived it would be great.
> >
> > I see this useful for folks using DMS transformer and that need some kind
> > of transformation before the DMS transformer adds the op filed for
> initial
> > load or when loading the CDC. In the meantime, I will create a custom
> > transformer.
> >
> > Thanks again,
> > -F.
> >
> >
> > Vinoth Chandar  escreveu no dia domingo, 22/03/2020
> > à(s)
> > 20:58:
> >
> > > Hi F,
> > >
> > > The Transformer interface allows you to basically plugin anything that
> > > takes a DataFrame and returns a transformed DataFrame. Does that help?
> > > If you are talking about implementing support for chained calling of
> > > multiple Transformers, within DeltaStreamer itself..It has been
> discussed
> > > before.
> > > And we can revive that conversation.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Sat, Mar 21, 2020 at 5:14 PM FO O  wrote:
> > >
> > > > Hi team!
> > > >
> > > > My use case would benefit from running a SQL transformer followed by
> > the
> > > > DMS transformer.
> > > >
> > > > It seems my best options is to create a new transformer that is based
> > on
> > > > the current DMS transformer and add the additional transformation I
> > need
> > > > (add new columns, concatenate fields).
> > > >
> > > > Wanted to see if there are additional recommendations that I should
> > > > consider instead of this one.
> > > >
> > > > Thank you,
> > > > F
> > > >
> > >
> >
>


Re: [Discussion] hudi support log append scenario with better write and asynchronous compaction

2020-05-19 Thread Shiyan Xu
Hi Wei,

+1 on the proposal; append-only is a commonly seen use case.

IIUC, the main concern is, Hudi by default generates small files internally
in COW tables. And by setting `hoodie.parquet.small.file.limit` can reduce
the number of small files but slow down the pipeline (by doing compaction).

To the option you mentioned
When writing to parquet directly, do you consider setting params for bulk
write? It should be possible to make bulk write bounded by time and size so
that you can always have a reasonable size for the output.

I agree with Vinoth's point
> The main blocker for us to send inserts into logs, is having the ability
to
do log indexing (we wanted to support someone who may want to do inserts
and suddenly wants to upsert the table)

Logs are most of time append-only. Due to GDPR or other compliance, we may
have to scrub some fields later.
Looks like we may phase the support. 1 is to write parquet as log files. 2
is to support upsert on demand. This seems to be a different table type
(neither COW nor MOR. Sounds like Merge-on-demand?)



On Sun, May 17, 2020 at 10:10 AM wei li  wrote:

> Thanks, Vinoth Chandar
> Just like  https://issues.apache.org/jira/projects/HUDI/issues/HUDI-112 ,
> we need  a mechanism to  solve two issues.
> 1.  On the write side: do not compaction for faster write. (now merge on
> read can solve this problem)
> 2. compaction and read : also a mechanism to collapse older smaller files
> into larger ones while also keeping the query cost low.(if use merge on
> read, if do not compaction, the realtime read will slow)
>
> we have a option:
> 1. On the write side: just write parquet, not compaction
> 2. compaction and read : because the small file is parquet, the realtime
> read can be fast, also user can use asynchronous compaction to  collapse
> older smaller parquet files into larger parquet files
>
> Best Regards,
> Wei Li.
>
> On 2020/05/14 16:54:24, Vinoth Chandar  wrote:
> > Hi Wei,
> >
> > Thanks for starting this thread. I am trying to understand your concern -
> > which seems to be that for inserts, we write parquet files instead of
> > logging?  FWIW Hudi already supports asynchronous compaction... and a
> > record reader flag that can avoid merging for cases where there are only
> > inserts..
> >
> > The main blocker for us to send inserts into logs, is having the ability
> to
> > do log indexing (we wanted to support someone who may want to do inserts
> > and suddenly wants to upsert the table).. If we can sacrifice on that
> > initially, it's very doable.
> >
> > Will wait for others to chime in as well.
> >
> > On Thu, May 14, 2020 at 9:06 AM wei li  wrote:
> >
> > > The business scenarios of the data lake mainly include analysis of
> > > databases, logs, and files.
> > > [image: 1.jpg]
> > >
> > > At present, hudi can better support the scenario where the database
> cdc is
> > > incrementally written to hudi, and it is also doing bulkload files to
> hudi.
> > >
> > > However, there is no good native support for log scenarios (requiring
> > > high-throughput writes, no updates, deletions, and focusing on small
> file
> > > scenarios);now can write through inserts without deduplication, but
> they
> > > will still merge on the write side.
> > >
> > >- In copy on write mode when "hoodie.parquet.small.file.limit" is
> > >100MB, but  every batch small  will cost some time for merge,it
> will reduce
> > >write throughput.
> > >- This scene is not suitable for  merge on read.
> > >- the actual scenario only needs to write parquet in batches when
> > >writing, and then provide reverse compaction (similar to delta lake
> )
> > >
> > >
> > > I created an RFC with more details
> > >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction
> > >
> > >
> > > Best Regards,
> > > Wei Li.
> > >
> > >
> > >
> >
>


Re: Apache Hudi Graduation vote on general@incubator

2020-05-22 Thread Shiyan Xu
Great news. Congratulations!

On Fri, May 22, 2020 at 5:40 PM wangxianghu  wrote:

> congratulations,great job!
>
> 发自我的iPhone
>
> > 在 2020年5月23日,05:59,Sivabalan  写道:
> >
> > Congrats :) Kudos to Vinoth and the community :)
> >
> >
> >> On Fri, May 22, 2020 at 5:57 PM Mehrotra, Udit
> 
> >> wrote:
> >>
> >> Congrats Vinoth and to this amazing community on a major milestone !
> >>
> >> On 5/22/20, 10:11 AM, "Pratyaksh Sharma" 
> wrote:
> >>
> >>CAUTION: This email originated from outside of the organization. Do
> >> not click links or open attachments unless you can confirm the sender
> and
> >> know the content is safe.
> >>
> >>
> >>
> >>That is a great news! Congratulations to the entire community. :)
> >>
> >>On Fri, May 22, 2020 at 10:03 PM Gary Li 
> >> wrote:
> >>
> >>> Huge congrats to the Hudi community! Great job!
> >>>
> >>> On Fri, May 22, 2020 at 9:30 AM Vinoth Chandar 
> >> wrote:
> >>>
>  Folks, I am very happy share that the graduation resolution has
> >> been
>  approved by the Apache board!
>  Congratulations everyone! :)
> 
>  More to come, as we prepare to exit the incubator. Stay tuned!!
> 
>  On Tue, May 19, 2020 at 7:33 PM Balaji Varadarajan
>   wrote:
> 
> > Terrific job :) We are marching on !!
> > Balaji.V
> >On Tuesday, May 19, 2020, 05:16:57 PM PDT, Sivabalan <
> > n.siv...@gmail.com> wrote:
> >
> > wow ! 19 binding votes. Great :)
> >
> >
> > On Tue, May 19, 2020 at 1:55 AM lamber-ken 
> >> wrote:
> >
> >>
> >>
> >>
> >> Gread job! and good luck for apache hudi project.
> >>
> >>
> >>
> >>
> >> Best,
> >> Lamber-Ken
> >>
> >> At 2020-05-19 13:35:11, "Vinoth Chandar" 
> >> wrote:
> >>> Folks,
> >>>
> >>> the vote has passed!
> >>>
> >>
> >
> 
> >>>
> >>
> https://lists.apache.org/thread.html/r86278a1a69bbf340fa028aca784869297bd20ab50a71f4006669cdb5%40%3Cgeneral.incubator.apache.org%3E
> >>>
> >>>
> >>> I will follow up with the next step [1], which is to submit
> >> the
> > resolution
> >>> to the board.
> >>>
> >>> [1]
> >>>
> >>
> >
> 
> >>>
> >>
> https://incubator.apache.org/guides/graduation.html#submission_of_the_resolution_to_the_board
> >>>
> >>> On Sun, May 17, 2020 at 7:14 PM 岳伟  wrote:
> >>>
>  +1 Graduate Apache Hudi from the Incubator
> 
> 
> 
> 
>  Harvey Yue
> 
> 
>  On 05/16/2020 22:49,hamid pirahesh
> >> wrote:
>  [x ] +1 Graduate Apache Hudi from the Incubator.>
> 
>  On Fri, May 15, 2020 at 7:06 PM Vinoth Chandar <
> >> vin...@apache.org
> 
> >> wrote:
> 
>  Hello all,
> 
>  Just started the VOTE on the IPMC general list [1]
> 
>  If you are an IPMC member, you do a *binding *vote
>  If you are not, you can still do a *non-binding* vote
> 
>  Please take a moment to vote.
> 
>  [1]
> 
> 
> 
> >>
> >
> 
> >>>
> >>
> https://lists.apache.org/thread.html/r8039c8eece636df8c81a24c26965f5c1556a3c6404de02912d6455b4%40%3Cgeneral.incubator.apache.org%3E
> 
>  Thanks
>  Vinoth
> 
> 
> >>
> >
> >
> > --
> > Regards,
> > -Sivabalan
> 
> >>>
> >>
> >>
> >>
> >
> > --
> > Regards,
> > -Sivabalan
> >
>
>


[DISCUSS] Write failed records

2020-05-22 Thread Shiyan Xu
Hi all,

I'd like to bring up this discussion around handling errors in Hudi write
paths.
https://issues.apache.org/jira/browse/HUDI-648

Trying to gather some feedbacks about the implementation details
1. Error location
I'm thinking of writing the failed records to `.hoodie/errors/` for
a) encapsulate data within the Hudi table for ease of management
b) make use of existing dedicated directory

2. Write path
org.apache.hudi.client.HoodieWriteClient#postWrite
org.apache.hudi.client.HoodieWriteClient#completeCompaction
These 2 methods should be the places to persist failed records in
`org.apache.hudi.table.action.HoodieWriteMetadata#writeStatuses`
to the designated location

3. Format
Records should be written as logs (avro)

4. Metric
Post writing failed records, it should send a metric of basic count of
errors written. Easier for monitoring system to pick up and send alert.

Foreseeably, some details may need to be adjusted throughout the
development. To begin with, we may agree on a feasible plan at high level.

Please kindly share thoughts and feedbacks. Thank you.



Regards,
Raymond


Re: hudi dependency conflicts for test

2020-05-21 Thread Shiyan Xu
Hi Lian, it appears that you need to have spark-avro_2.11:2.4.4 in your
classpath.



On Thu, May 21, 2020 at 10:04 AM Lian Jiang  wrote:

> Thanks Balaji.
>
> My unit test failed due to dependency incompatibility. Any idea will be
> highly appreciated!
>
>
> The test is copied from hudi quick start:
>
> import org.apache.hudi.QuickstartUtils._
>
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._
>
> class InputOutputTest extends HudiBaseTest{
>
> val config = new SparkConf().setAppName(name)
>   config.set("spark.driver.allowMultipleContexts", "true")
>   config.set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")
>   config.setMaster("local[*]").setAppName("Local Test")
>   val executionContext =
> SparkSession.builder().config(config).getOrCreate()
>
> val tableName = "hudi_trips_cow"
>   val basePath = "file:///tmp/hudi_trips_cow"
>   val dataGen = new DataGenerator
>
>   override def beforeAll(): Unit = {
>   }
>
>   test("Can create a hudi dataset") {
> val inserts = convertToStringList(dataGen.generateInserts(10))
> val df = executionContext.sparkSession.read.json(
>   executionContext.sparkContext.parallelize(inserts, 2))
>
> df.write.format("hudi").
>   options(getQuickstartWriteConfigs).
>   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>   option(TABLE_NAME, tableName).
>   mode(Overwrite).
>   save(basePath)
>   }
> }
>
>
> The exception is:
>
> java.lang.NoClassDefFoundError: org/apache/spark/sql/avro/SchemaConverters$
> at
> org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:87)
> at
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:93)
> at
> org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
> at
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> at
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> at
> com.zillow.dataforce_storage_poc.hudi.InputOutputTest$$anonfun$1.apply$mcV$sp(InputOutputTest.scala:34)
> at
> com.zillow.dataforce_storage_poc.hudi.InputOutputTest$$anonfun$1.apply(InputOutputTest.scala:22)
> at
> com.zillow.dataforce_storage_poc.hudi.InputOutputTest$$anonfun$1.apply(InputOutputTest.scala:22)
> at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
> at org.scalatest.Transformer.apply(Transformer.scala:20)
> at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
> at 

Re: hudi dependency conflicts for test

2020-05-21 Thread Shiyan Xu
That was a close one. :)

On Thu, May 21, 2020 at 10:46 AM Vinoth Chandar  wrote:

> Wow.. Race condition :) ..
>
> Thanks for racing , Raymond!
>
> On Thu, May 21, 2020 at 10:08 AM Shiyan Xu 
> wrote:
>
> > Hi Lian, it appears that you need to have spark-avro_2.11:2.4.4 in your
> > classpath.
> >
> >
> >
> > On Thu, May 21, 2020 at 10:04 AM Lian Jiang 
> wrote:
> >
> > > Thanks Balaji.
> > >
> > > My unit test failed due to dependency incompatibility. Any idea will be
> > > highly appreciated!
> > >
> > >
> > > The test is copied from hudi quick start:
> > >
> > > import org.apache.hudi.QuickstartUtils._
> > >
> > > import scala.collection.JavaConversions._
> > > import org.apache.spark.sql.SaveMode._
> > > import org.apache.hudi.DataSourceReadOptions._
> > > import org.apache.hudi.DataSourceWriteOptions._
> > > import org.apache.hudi.config.HoodieWriteConfig._
> > >
> > > class InputOutputTest extends HudiBaseTest{
> > >
> > > val config = new SparkConf().setAppName(name)
> > >   config.set("spark.driver.allowMultipleContexts", "true")
> > >   config.set("spark.serializer",
> > > "org.apache.spark.serializer.KryoSerializer")
> > >   config.setMaster("local[*]").setAppName("Local Test")
> > >   val executionContext =
> > > SparkSession.builder().config(config).getOrCreate()
> > >
> > > val tableName = "hudi_trips_cow"
> > >   val basePath = "file:///tmp/hudi_trips_cow"
> > >   val dataGen = new DataGenerator
> > >
> > >   override def beforeAll(): Unit = {
> > >   }
> > >
> > >   test("Can create a hudi dataset") {
> > > val inserts = convertToStringList(dataGen.generateInserts(10))
> > > val df = executionContext.sparkSession.read.json(
> > >   executionContext.sparkContext.parallelize(inserts, 2))
> > >
> > > df.write.format("hudi").
> > >   options(getQuickstartWriteConfigs).
> > >   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
> > >   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
> > >   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> > >   option(TABLE_NAME, tableName).
> > >   mode(Overwrite).
> > >   save(basePath)
> > >   }
> > > }
> > >
> > >
> > > The exception is:
> > >
> > > java.lang.NoClassDefFoundError:
> > org/apache/spark/sql/avro/SchemaConverters$
> > > at
> > >
> >
> org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:87)
> > > at
> > >
> >
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:93)
> > > at
> > > org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
> > > at
> > >
> >
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> > > at
> > >
> >
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> > > at
> > >
> >
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> > > at
> > >
> >
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> > > at
> > >
> >
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> > > at
> > >
> >
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> > > at
> > >
> >
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> > > at
> > >
> >
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> > > at
> > >
> >
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> > > at
> > > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> > > at
> > >
> >
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> > > at
> > >
> >
> org.apache.spar

Re: hudi dependency conflicts for test

2020-05-21 Thread Shiyan Xu
> at
> org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
> at
> org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:48)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at
> org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.run(ThreadFactoryImpl.java:56)
> at java.base/java.lang.Thread.run(Thread.java:835)
>
>
> On Thu, May 21, 2020 at 10:46 AM Vinoth Chandar  wrote:
>
> > Wow.. Race condition :) ..
> >
> > Thanks for racing , Raymond!
> >
> > On Thu, May 21, 2020 at 10:08 AM Shiyan Xu 
> > wrote:
> >
> > > Hi Lian, it appears that you need to have spark-avro_2.11:2.4.4 in your
> > > classpath.
> > >
> > >
> > >
> > > On Thu, May 21, 2020 at 10:04 AM Lian Jiang 
> > wrote:
> > >
> > > > Thanks Balaji.
> > > >
> > > > My unit test failed due to dependency incompatibility. Any idea will
> be
> > > > highly appreciated!
> > > >
> > > >
> > > > The test is copied from hudi quick start:
> > > >
> > > > import org.apache.hudi.QuickstartUtils._
> > > >
> > > > import scala.collection.JavaConversions._
> > > > import org.apache.spark.sql.SaveMode._
> > > > import org.apache.hudi.DataSourceReadOptions._
> > > > import org.apache.hudi.DataSourceWriteOptions._
> > > > import org.apache.hudi.config.HoodieWriteConfig._
> > > >
> > > > class InputOutputTest extends HudiBaseTest{
> > > >
> > > > val config = new SparkConf().setAppName(name)
> > > >   config.set("spark.driver.allowMultipleContexts", "true")
> > > >   config.set("spark.serializer",
> > > > "org.apache.spark.serializer.KryoSerializer")
> > > >   config.setMaster("local[*]").setAppName("Local Test")
> > > >   val executionContext =
> > > > SparkSession.builder().config(config).getOrCreate()
> > > >
> > > > val tableName = "hudi_trips_cow"
> > > >   val basePath = "file:///tmp/hudi_trips_cow"
> > > >   val dataGen = new DataGenerator
> > > >
> > > >   override def beforeAll(): Unit = {
> > > >   }
> > > >
> > > >   test("Can create a hudi dataset") {
> > > > val inserts = convertToStringList(dataGen.generateInserts(10))
> > > > val df = executionContext.sparkSession.read.json(
> > > >   executionContext.sparkContext.parallelize(inserts, 2))
> > > >
> > > > df.write.format("hudi").
> > > >   options(getQuickstartWriteConfigs).
> > > >   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
> > > >   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
> > > >   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> > > >   option(TABLE_NAME, tableName).
> > > >   mode(Overwrite).
> > > >   save(basePath)
> > > >   }
> > > > }
> > > >
> > > >
> > > > The exception is:
> > > >
> > > > java.lang.NoClassDefFoundError:
> > > org/apache/spark/sql/avro/SchemaConverters$
> > > > at
> > > >
> > >
> >
> org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:87)
> > > > at
> > > >
> > >
> >
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:93)
> > > > at
> > > > org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
> > > > at
> > > >
> > >
> >
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> > > > at
> > > >
> > >
> >
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> > > > at
> > > >
> > >
> >
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> > > > at
> > > >
&g

Re: [VOTE] Release 0.6.0, release candidate #1

2020-08-21 Thread Shiyan Xu
I should have documented this...(which I will soon)

When run from terminal, could you please try running with maven profile like
`mvn -Punit-tests test`
`mvn -Pfunctional-tests test`
which should work..

Best,
Raymond


On Fri, Aug 21, 2020 at 9:44 PM Gary Li  wrote:

> +1 (non binding)
> - Complied successfully
> - Ran validation script successfully
> - Ran tests from IntelliJ successfully
>
> Seeing the same issue as Siva. The tests were passed in IDE.
>
> Best Regards,
> Gary Li
>
>
> On 8/21/20, 2:29 PM, "Sivabalan"  wrote:
>
> +1 (non binding)
> - Compilation successful
> - Ran validation script which verifies checksum, keys, license, etc.
> - Ran quick start
> - Ran some tests from intellij.
>
> JFYI: when I ran mvn test, encountered some test failures due to
> multiple
> spark contexts. Have raised a ticket here
> . But all tests are
> succeeding in CI and I could run from within intellij. So, not
> blocking the
> RC.
>
> Checking Checksum of Source Release-e Checksum Check of Source Release
> -
> [OK]
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>Dload  Upload
> Total   SpentLeft  Speed
> 100 30225  100 302250 0   106k  0 --:--:-- --:--:--
> --:--:--
> 106k
> Checking Signature
> -e Signature Check - [OK]
> Checking for binary files in source release
> -e No Binary Files in Source Release? - [OK]
> Checking for DISCLAIMER
> -e DISCLAIMER file exists ? [OK]
> Checking for LICENSE and NOTICE
> -e License file exists ? [OK]-
> e Notice file exists ? [OK]
> Performing custom Licensing Check
> -e Licensing Check Passed [OK]
> Running RAT Check
> -e RAT Check Passed [OK]
>
>
>
> On Fri, Aug 21, 2020 at 12:37 PM Bhavani Sudha <
> bhavanisud...@gmail.com>
> wrote:
>
> > Vino yang,
> >
> > I am working on the release blog. While the RC is in progress, the
> doc and
> > site updates are happening this week.
> >
> > Thanks,
> > Sudha
> >
> > On Fri, Aug 21, 2020 at 4:23 AM vino yang 
> wrote:
> >
> > > +1 from my side
> > >
> > > I checked:
> > >
> > > - ran `mvn clean package` [OK]
> > > - ran `mvn test` in my local [OK]
> > > - signature [OK]
> > >
> > > BTW, where is like of the release blog?
> > >
> > > Best,
> > > Vino
> > >
> > > Bhavani Sudha  于2020年8月20日周四 下午12:03写道:
> > >
> > > > Hi everyone,
> > > > Please review and vote on the release candidate #1 for the
> version
> > 0.6.0,
> > > > as follows:
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific
> comments)
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > > * JIRA release notes [1],
> > > > * the official Apache source release and binary convenience
> releases to
> > > be
> > > > deployed to dist.apache.org [2], which are signed with the key
> with
> > > > fingerprint 7F66CD4CE990983A284672293224F200E1FC2172 [3],
> > > > * all artifacts to be deployed to the Maven Central Repository
> [4],
> > > > * source code tag "release-0.6.0-rc1" [5],
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > > Thanks,
> > > > Release Manager
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12346663
> > > > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.6.0-rc1/
> > > > [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> > > > [4]
> > >
> https://repository.apache.org/content/repositories/orgapachehudi-1025/
> > > > [5] https://github.com/apache/hudi/tree/release-0.6.0-rc1
> > > >
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Codestyle: force multiline indentation

2020-08-22 Thread Shiyan Xu
It can be up to the individual to use the IDE formatter or not, as long as
there is a tool to help enforce Checkstyle rules.
For people who use IDE formatter, importing Checkstyle.xml as a format
scheme does not fully control the formatter's behavior, that's why IDE
sometimes gets in the way. But most of the time it serves us well.
Guess we can close the thread as we're all in favor of spotless?



On Sat, Aug 22, 2020 at 6:08 AM vino yang  wrote:

> Hi vc,
>
> Yes, this part of the practice may have different preferences for different
> developers. I have never opened the IDE's automatic formatting, nor have I
> used the IDE's formatting functions artificially. Because I have
> participated in multiple open source communities, each open source
> community has its own conventions on code style. So, I just understand the
> style of each community, after changing the code, and then compiling
> locally, checkstyle will identify the related problems, and then report,
> and then I will modify until the compilation is passed.
>
> I admit that this is my personal behavior, and everything has its two
> sides. IDE automatic formatting will make it more convenient for developers
> to deal with code styles. On the other hand, it will also make the
> community more complicated when considering related conventions and weigh
> more factors.
>
> Best,
> Vino
>
> Vinoth Chandar  于2020年8月22日周六 下午2:25写道:
>
> > >But, IMO, we can ignore the IDE here, if it breaks the code style,
> > checkstyle will stop building and spotless will work.
> >
> > I differ here slightly. Most people reformat code using the "format code"
> > in the IDE. And IDEs also can reorganize the code when you save etc.
> > We need a solid way to not be fighting the IDE all the time :). So it may
> > be okay to not go with how IDE formats things, but we need to ensure IDE
> > does not get in the way.
> >
> > thoughts?
> >
> > Thanks
> > Vinoth
> >
> > On Fri, Aug 21, 2020 at 1:26 PM Nishith  wrote:
> >
> > > +1 for spotless, automating the formatting will definitely help
> > > productivity and turn around time for PRs.
> > >
> > > -Nishith
> > >
> > > Sent from my iPhone
> > >
> > > > On Aug 21, 2020, at 11:53 AM, Sivabalan  wrote:
> > > >
> > > > totally +1 for spotless.
> > > >
> > > >
> > > >> On Thu, Aug 20, 2020 at 8:53 AM leesf  wrote:
> > > >>
> > > >> +1 on using mvn spotless:apply to fix the codestyle.
> > > >>
> > > >> Bhavani Sudha  于2020年8月19日周三 下午12:31写道:
> > > >>
> > > >>> +1 on auto code formatting. I also think it should be okay to be
> even
> > > >> more
> > > >>> restrictive by failing builds when the code format is not adhered
> (in
> > > any
> > > >>> environment). That way everyone is forced to use the same
> formatting.
> > > >>>
> > > >>>> On Tue, Aug 18, 2020 at 8:47 PM vino yang 
> > > wrote:
> > > >>>
> > > >>>>> the key challenge has been keeping checkstyle, IDE and spotless
> > > >>> agreeing
> > > >>>> on the same thing.
> > > >>>>
> > > >>>> Yes, it's the key thing. But, IMO, we can ignore the IDE here, if
> it
> > > >>> breaks
> > > >>>> the code style, checkstyle will stop building and spotless will
> > work.
> > > >>>>
> > > >>>> Vinoth Chandar  于2020年8月19日周三 上午7:49写道:
> > > >>>>
> > > >>>>> the key challenge has been keeping checkstyle, IDE and spotless
> > > >>> agreeing
> > > >>>> on
> > > >>>>> the same thing.
> > > >>>>>
> > > >>>>> your understanding is correct. CI will enforce in a similar
> > fashion.
> > > >>>>> Spotless just makes us productive by auto fixing all the
> checkstyle
> > > >>>>> violations, without having to manually fix by hand.
> > > >>>>>
> > > >>>>> On Tue, Aug 18, 2020 at 4:42 PM Shiyan Xu <
> > > >> xu.shiyan.raym...@gmail.com
> > > >>>>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> I think adding spotless as a tooling command to auto fix code is
> > > >>>>> beneficial
> > >

Re: Sequence of Transformers

2020-07-15 Thread Shiyan Xu
hi James, glad that this could be helpful. Would it be possible for you to
build the jar off master branch and use it in the meantime? You could
install the jars to the EMR nodes during bootstrap.

On Wed, Jul 15, 2020 at 8:11 AM James Walter 
wrote:

> Hello! +1 on this feature. I see that it is targeted to be shipped with
> release 0.6.0. I know there are other priorities but if the community could
> consider shipping this with earlier releases it would be great (my
> customers don't like me messing up with what EMR provides by default).
>
> Thank you, James.
> On 2020/03/26 00:39:41, FO O  wrote:
> > Thank you folks for the fast response and work.
> >
> > Vinoth Chandar  escreveu no dia quarta, 25/03/2020
> à(s)
> > 11:22:
> >
> > > btw Raymond already has a PR up here for this :)
> > > https://github.com/apache/incubator-hudi/pull/1440
> > >
> > > On Mon, Mar 23, 2020 at 5:32 PM Shiyan Xu  >
> > > wrote:
> > >
> > > > Seems like an abstract class would be good enough for generic use?
> > > > User can provide a list of `Transformer` then the abstract class just
> > > apply
> > > > all the way through the list.
> > > > The implementation can be minimal for this approach.
> > > >
> > > > On Mon, Mar 23, 2020 at 4:12 PM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > sg. Filed https://issues.apache.org/jira/browse/HUDI-731
> > > > >
> > > > > Someone looking to pick this? :). Its an nice feature to implement,
> > > that
> > > > > fits a good template..
> > > > >
> > > > > ofc we can discuss this more here in parallel
> > > > >
> > > > > On Mon, Mar 23, 2020 at 8:31 AM FO O  wrote:
> > > > >
> > > > > > Thank you Vinoth.
> > > > > >
> > > > > > >"If you are talking about implementing support for chained
> calling
> > > of
> > > > > > multiple Transformers, within DeltaStreamer itself"
> > > > > >
> > > > > > Yes, chained calling support for transformers would be super
> helpful,
> > > > if
> > > > > > this discussion can be  revived it would be great.
> > > > > >
> > > > > > I see this useful for folks using DMS transformer and that need
> some
> > > > kind
> > > > > > of transformation before the DMS transformer adds the op filed
> for
> > > > > initial
> > > > > > load or when loading the CDC. In the meantime, I will create a
> custom
> > > > > > transformer.
> > > > > >
> > > > > > Thanks again,
> > > > > > -F.
> > > > > >
> > > > > >
> > > > > > Vinoth Chandar  escreveu no dia domingo,
> > > 22/03/2020
> > > > > > à(s)
> > > > > > 20:58:
> > > > > >
> > > > > > > Hi F,
> > > > > > >
> > > > > > > The Transformer interface allows you to basically plugin
> anything
> > > > that
> > > > > > > takes a DataFrame and returns a transformed DataFrame. Does
> that
> > > > help?
> > > > > > > If you are talking about implementing support for chained
> calling
> > > of
> > > > > > > multiple Transformers, within DeltaStreamer itself..It has been
> > > > > discussed
> > > > > > > before.
> > > > > > > And we can revive that conversation.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vinoth
> > > > > > >
> > > > > > > On Sat, Mar 21, 2020 at 5:14 PM FO O 
> wrote:
> > > > > > >
> > > > > > > > Hi team!
> > > > > > > >
> > > > > > > > My use case would benefit from running a SQL transformer
> followed
> > > > by
> > > > > > the
> > > > > > > > DMS transformer.
> > > > > > > >
> > > > > > > > It seems my best options is to create a new transformer that
> is
> > > > based
> > > > > > on
> > > > > > > > the current DMS transformer and add the additional
> > > transformation I
> > > > > > need
> > > > > > > > (add new columns, concatenate fields).
> > > > > > > >
> > > > > > > > Wanted to see if there are additional recommendations that I
> > > should
> > > > > > > > consider instead of this one.
> > > > > > > >
> > > > > > > > Thank you,
> > > > > > > > F
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: PSA: master integ-tests failing

2020-08-01 Thread Shiyan Xu
looks like this is caused by scalatest-maven-plugin, which is controlled by
skipTests property.
had a fix for changing to skipUTs
https://github.com/apache/hudi/pull/1897/files

On Fri, Jul 31, 2020 at 8:21 PM nishith agarwal  wrote:

> All,
>
> I've added new log4j properties to the docker setup to limit the spark logs
> from the spark driver. Master should be stable. One thing I noticed during
> this is that the class `HoodieSparkSqlWriter` also runs as part of the
> integration tests which it should not, thus adding to the logs.
>
> @raymond - any ideas why this is happening? I noticed you made some changes
> to the functional tests and from cursory looks of the parent pom.xml I
> couldn't find anything wrong.
>
> Thanks,
> Nishith
>
> On Fri, Jul 31, 2020 at 8:23 AM Vinoth Chandar  wrote:
>
> > Hello all,
> >
> > integ-tests are currently failing due to exceeding the log limit on
> master
> > branch. Nishith is actively debugging what's going on.
> >
> > I request you to hold off merging more PRs in the meantime, until we
> > resolve this.
> >
> > @ nishith , please update this thread, when master is stable again
> >
> > thanks
> > vinoth
> >
>


Re: Merge upserts across partitions

2020-08-10 Thread Shiyan Xu
Looks like you might need to use GLOBAL_BLOOM and set this to true
https://hudi.apache.org/docs/configurations.html#bloomIndexUpdatePartitionPath

Note that there is a fix related to this setting in the upcoming 0.6.0;
recommend to use it instead of 0.5.2.

On Sun, Aug 9, 2020 at 10:28 PM Taher Koitawala  wrote:

> Hi All,
>  We are using Hudi for RDBMS CDC, your use case is to merge
> upserts, however if a record with a record key falls to another partition
> because the key on which we want to partition may have changed in RDBMS
> then those are not merged. We are using COW tables and Hudi 5.2 release.
>
> I suppose this behaviour is because HoodieKey is a record key and
> partition path, however, please can someone tell me how we can handle use
> cases like these? Am I missing something obvious ?
>
>
> Regards,
> Taher Koitawala
>


[DISCUSS] Codestyle: force multiline indentation

2020-08-10 Thread Shiyan Xu
Hi all,

I noticed that throughout the codebase, when method arguments wrap to a new
line, there are cases where indentation is 4 and other cases align the
wrapped line to the previous line of argument.

The latter is caused by intelliJ settings of "Align when multiline"
enabled. This won't be flagged by checkstyle due to not setting
*forceStrictCondition* to *true*

https://checkstyle.sourceforge.io/config_misc.html#Indentation_Properties

I'm suggesting setting this to true to avoid the discrepancy and redundant
diffs in PR caused by individual IDE settings. People who have set "Align
when multiline" will need to disable it to pass the checkstyle validation.

WDYT?

Best,
Raymond


Re: [DISCUSS] Codestyle: force multiline indentation

2020-08-10 Thread Shiyan Xu
in that case, yes, all for automation.

On Mon, Aug 10, 2020 at 7:12 PM Vinoth Chandar  wrote:

> Overall, I think we should standardize this across the project.
> But most importantly, may be revive the long dormant spotless effort first
> to enable autofixing of checkstyle issues, before we add more checking?
>
> On Mon, Aug 10, 2020 at 7:04 PM Shiyan Xu 
> wrote:
>
> > Hi all,
> >
> > I noticed that throughout the codebase, when method arguments wrap to a
> new
> > line, there are cases where indentation is 4 and other cases align the
> > wrapped line to the previous line of argument.
> >
> > The latter is caused by intelliJ settings of "Align when multiline"
> > enabled. This won't be flagged by checkstyle due to not setting
> > *forceStrictCondition* to *true*
> >
> >
> https://checkstyle.sourceforge.io/config_misc.html#Indentation_Properties
> >
> > I'm suggesting setting this to true to avoid the discrepancy and
> redundant
> > diffs in PR caused by individual IDE settings. People who have set "Align
> > when multiline" will need to disable it to pass the checkstyle
> validation.
> >
> > WDYT?
> >
> > Best,
> > Raymond
> >
>


Re: DISCUSS code, config, design walk through sessions

2020-07-06 Thread Shiyan Xu
+1

On Mon, Jul 6, 2020 at 9:27 AM vbal...@apache.org 
wrote:

>  +1.
> On Monday, July 6, 2020, 09:11:47 AM PDT, Bhavani Sudha <
> bhavanisud...@gmail.com> wrote:
>
>  +1 this is a great idea!
>
> On Mon, Jul 6, 2020 at 7:54 AM vino yang  wrote:
>
> > +1
> >
> > Adam Feldman  于2020年7月6日周一 下午9:55写道:
> >
> > > Interested
> > >
> > > On Mon, Jul 6, 2020, 08:29 Sivabalan  wrote:
> > >
> > > > +1 for sure
> > > >
> > > > On Mon, Jul 6, 2020 at 4:42 AM Gurudatt Kulkarni <
> guruak...@gmail.com>
> > > > wrote:
> > > >
> > > > > +1
> > > > > Really a great idea. Will help in understanding the project better.
> > > > >
> > > > > On Mon, Jul 6, 2020 at 1:35 PM Pratyaksh Sharma <
> > pratyaks...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > This is a great idea and really helpful one.
> > > > > >
> > > > > > On Mon, Jul 6, 2020 at 1:09 PM  wrote:
> > > > > >
> > > > > > > +1
> > > > > > > It can also attract more partners to join us.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 07/06/2020 15:34, Ranganath Tirumala wrote:
> > > > > > > +1
> > > > > > >
> > > > > > > On Mon, 6 Jul 2020 at 16:59, David Sheard <
> > > > > > > david.she...@datarefactory.com.au>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Perfect
> > > > > > > >
> > > > > > > > On Mon, 6 Jul. 2020, 1:30 pm Vinoth Chandar, <
> > vin...@apache.org>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > As we scale the community, its important that more of us
> are
> > > able
> > > > > to
> > > > > > > help
> > > > > > > > > users, users becoming contributors.
> > > > > > > > >
> > > > > > > > > In the past, we have drafted faqs, trouble shooting guides.
> > > But I
> > > > > > feel
> > > > > > > > > sometimes, more hands on walk through sessions over video
> > could
> > > > > help.
> > > > > > > > >
> > > > > > > > > I am happy to spend 2 hours each on code/configs,
> > > > > > > > design/perf/architecture.
> > > > > > > > > Have the session be recorded as well for future.
> > > > > > > > >
> > > > > > > > > What does everyone think?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Vinoth
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > >
> > > > > > > Ranganath Tirumala
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > -Sivabalan
> > > >
> > >
> >


GitHub release display issue

2020-07-14 Thread Shiyan Xu
The new GitHub UI displays 0.4.7 as the latest, which is misleading.
Guess adding notes to the later releases could resolve it?
[image: Screen Shot 2020-07-13 at 11.37.16 PM.png]


Re: DISCUSS code, config, design walk through sessions

2020-07-14 Thread Shiyan Xu
+1

On Tue, Jul 14, 2020, 11:34 AM Vinoth Chandar  wrote:

> Typo: date TBD (not data :))
>
> On Tue, Jul 14, 2020 at 11:20 AM Adam Feldman  wrote:
>
> > +1
> >
> > On Tue, Jul 14, 2020, 14:09 Gary Li  wrote:
> >
> > > +1. 8am works for me.
> > >
> > > On Tue, Jul 14, 2020 at 11:01 AM Vinoth Chandar 
> > wrote:
> > >
> > > > Hello all,
> > > >
> > > > please chime in. We will plan to freeze Tuesday 8AM (data TBD) by EOD
> > PST
> > > > today.
> > > >
> > > > thanks
> > > > Vinoth
> > > >
> > > > On Mon, Jul 13, 2020 at 12:38 AM Pratyaksh Sharma <
> > pratyaks...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > 8 AM PST works for me. This is actually more suitable for me than
> the
> > > > > community sync time.
> > > > >
> > > > > Will wait for others to respond. If 8 AM does not work for majority
> > of
> > > > > people, I will start a new thread for revoting.
> > > > >
> > > > > On Mon, Jul 13, 2020 at 11:55 AM David Sheard <
> > > > > david.she...@datarefactory.com.au> wrote:
> > > > >
> > > > > > That is 01:00 Canberra Australia time. But that is fine
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > On Mon, 13 Jul. 2020, 11:55 am Vinoth Chandar, <
> vin...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > NO. time/date is not finalized yet until we resolve the time
> zone
> > > > > issues.
> > > > > > > let's
> > > > > > > spend some time confirming the time in the next few days. and a
> > > week
> > > > > for
> > > > > > me
> > > > > > > to prep some slides/docs to run through the course as well.
> > > > > > > Once finalized, we will send an explicit email spelling out the
> > > > > > time/date.
> > > > > > >
> > > > > > > YES on recording and make it available. (I need to find a tool
> > that
> > > > can
> > > > > > > allow me to do that).
> > > > > > >
> > > > > > > On the time zones, would 8AM PST be more amenable (we can also
> > > revote
> > > > > on
> > > > > > > community sync. @pratyaksh , please start a new thread on that
> if
> > > > > > > interested)
> > > > > > >
> > > > > > > LocationLocal TimeTime ZoneUTC Offset
> > > > > > > San Jose 
> > > (USA
> > > > -
> > > > > > > California)  8:00:00 am PDT <
> > > > > https://www.timeanddate.com/time/zones/pdt>
> > > > > > > UTC-7
> > > > > > > hours
> > > > > > > New York 
> > > (USA
> > > > -
> > > > > > New
> > > > > > > York) 11:00:00 am EDT <
> > https://www.timeanddate.com/time/zones/edt>
> > > > > UTC-4
> > > > > > > hours
> > > > > > > New Delhi <
> > https://www.timeanddate.com/worldclock/india/new-delhi>
> > > > > > (India
> > > > > > > -
> > > > > > > Delhi) 8:30:00 pm IST <
> > https://www.timeanddate.com/time/zones/ist>
> > > > > > > UTC+5:30
> > > > > > > hours
> > > > > > > Shanghai <
> https://www.timeanddate.com/worldclock/china/shanghai>
> > > > > (China
> > > > > > -
> > > > > > > Shanghai Municipality) 11:00:00 pm CST
> > > > > > >  UTC+8 hours
> > > > > > > London 
> > (United
> > > > > > Kingdom
> > > > > > > -
> > > > > > > England) 4:00:00 pm BST <
> > > https://www.timeanddate.com/time/zones/bst>
> > > > > > UTC+1
> > > > > > > hour
> > > > > > >
> > > > > > > Please speak up if this does not work for anyone..
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vinoth
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Jul 12, 2020 at 4:12 PM Ranganath Tirumala <
> > > > > > > ranganath.tirum...@gmail.com> wrote:
> > > > > > >
> > > > > > > > So, Is this confirmed for 14th July 9:30pm PST?
> > > > > > > >
> > > > > > > > On Sat, 11 Jul 2020 at 14:32, Gurudatt Kulkarni <
> > > > guruak...@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > If possible recoding of these sessions would be great, to
> > fill
> > > > the
> > > > > > > > timezone
> > > > > > > > > gap.
> > > > > > > > >
> > > > > > > > > On Friday, July 10, 2020, Pratyaksh Sharma <
> > > > pratyaks...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > @Vinoth Chandar  Time zones are
> indeed
> > > > > tricky.
> > > > > > > > Maybe
> > > > > > > > > we
> > > > > > > > > > can do a poll again to decide on the time for these
> > sessions
> > > > > given
> > > > > > > the
> > > > > > > > > > community size has increased much more now as compared to
> > > last
> > > > > time
> > > > > > > we
> > > > > > > > > > decided on weekly sync timings? This might help all the
> new
> > > > > members
> > > > > > > of
> > > > > > > > > our
> > > > > > > > > > community as well. :)
> > > > > > > > > >
> > > > > > > > > > On Fri, Jul 10, 2020 at 8:45 AM Adam Feldman <
> > > > > afeldm...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> Yea, time zones are tough. That's midnight in EST and
> the
> > > > middle
> > > > > > of
> > > > > > > > the
> > > > > > > > > >> night if 

Re: DISCUSS code, config, design walk through sessions

2020-07-08 Thread Shiyan Xu
The time slot works for me but i guess it may conflict with work hours in
other time zones. Maybe alternating morning and evening sessions in PST
work better?

On Wed, Jul 8, 2020 at 9:07 PM Vinoth Chandar  wrote:

> Apologies. Should have been more detailed.
>
> It’s Tuesday. Please see here for details
>
> https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+Community+Weekly+Sync
>
>
> On Wed, Jul 8, 2020 at 8:55 PM Adam Feldman  wrote:
>
> > Hi, what day will this be?
> >
> > On Tue, Jul 7, 2020, 17:25 Vinoth Chandar  wrote:
> >
> > > Thanks, everyone! There appears to be great interest. let's do it.
> > >
> > > In terms of timing, I was thinking if we can extend one of our existing
> > > community weekly sync meetings for this purpose.
> > > So, timing would be 930-11PM PST. Does that work for everyone here?
> > >
> > > On Mon, Jul 6, 2020 at 10:30 AM Shiyan Xu  >
> > > wrote:
> > >
> > > > +1
> > > >
> > > > On Mon, Jul 6, 2020 at 9:27 AM vbal...@apache.org <
> vbal...@apache.org>
> > > > wrote:
> > > >
> > > > >  +1.
> > > > > On Monday, July 6, 2020, 09:11:47 AM PDT, Bhavani Sudha <
> > > > > bhavanisud...@gmail.com> wrote:
> > > > >
> > > > >  +1 this is a great idea!
> > > > >
> > > > > On Mon, Jul 6, 2020 at 7:54 AM vino yang 
> > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > Adam Feldman  于2020年7月6日周一 下午9:55写道:
> > > > > >
> > > > > > > Interested
> > > > > > >
> > > > > > > On Mon, Jul 6, 2020, 08:29 Sivabalan 
> wrote:
> > > > > > >
> > > > > > > > +1 for sure
> > > > > > > >
> > > > > > > > On Mon, Jul 6, 2020 at 4:42 AM Gurudatt Kulkarni <
> > > > > guruak...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1
> > > > > > > > > Really a great idea. Will help in understanding the project
> > > > better.
> > > > > > > > >
> > > > > > > > > On Mon, Jul 6, 2020 at 1:35 PM Pratyaksh Sharma <
> > > > > > pratyaks...@gmail.com
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > This is a great idea and really helpful one.
> > > > > > > > > >
> > > > > > > > > > On Mon, Jul 6, 2020 at 1:09 PM 
> wrote:
> > > > > > > > > >
> > > > > > > > > > > +1
> > > > > > > > > > > It can also attract more partners to join us.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On 07/06/2020 15:34, Ranganath Tirumala wrote:
> > > > > > > > > > > +1
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 6 Jul 2020 at 16:59, David Sheard <
> > > > > > > > > > > david.she...@datarefactory.com.au>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Perfect
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, 6 Jul. 2020, 1:30 pm Vinoth Chandar, <
> > > > > > vin...@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > >
> > > > > > > > > > > > > As we scale the community, its important that more
> of
> > > us
> > > > > are
> > > > > > > able
> > > > > > > > > to
> > > > > > > > > > > help
> > > > > > > > > > > > > users, users becoming contributors.
> > > > > > > > > > > > >
> > > > > > > > > > > > > In the past, we have drafted faqs, trouble shooting
> > > > guides.
> > > > > > > But I
> > > > > > > > > > feel
> > > > > > > > > > > > > sometimes, more hands on walk through sessions over
> > > video
> > > > > > could
> > > > > > > > > help.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I am happy to spend 2 hours each on code/configs,
> > > > > > > > > > > > design/perf/architecture.
> > > > > > > > > > > > > Have the session be recorded as well for future.
> > > > > > > > > > > > >
> > > > > > > > > > > > > What does everyone think?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > Vinoth
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > Ranganath Tirumala
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > > -Sivabalan
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] Introduce a write committed callback hook

2020-06-21 Thread Shiyan Xu
+1. It is a great complement to the pull model; helpful to fan-out scenarios

On Sun, Jun 21, 2020 at 8:07 AM Bhavani Sudha 
wrote:

> +1 . I think this is a valid use case and would be useful in general.
>
> On Sun, Jun 21, 2020 at 7:11 AM Vinoth Chandar  wrote:
>
> > +1 as well
> >
> > > We expect to introduce a proactive notification(event callback)
> > mechanism. For example, a hook can be introduced after a successful
> commit.
> >
> > This would be very useful. We could write to a variety of event bus-es
> and
> > notify new data arrival.
> >
> > On Sat, Jun 20, 2020 at 2:51 AM wangxianghu  wrote:
> >
> > > +1 for this, I think this is a feature worth doing.
> > > Think about it in the filed of offline computing, data changes happens
> > > hourly or daily, if there is no a notification mechanism to inform the
> > > downstream,  then the tasks downstream will keeping running all the day
> > > along, but the time really processing data maybe very short, this
> > situation
> > > will surely cause resource wastes.
> > > > 2020年6月20日 上午8:13,vino yang  写道:
> > > >
> > > > Hi all,
> > > >
> > > > Currently, we have a need to incrementally process and build a new
> > table
> > > > based on an original hoodie table. We expect that after a new commit
> is
> > > > completed on the original hoodie table, it could be retrieved ASAP,
> so
> > > that
> > > > it can be used for incremental view queries. Based on the existing
> > > > capabilities, one approach we can use is to continuously poll
> Hoodie's
> > > > Timeline to check for new commits. This is a very common processing
> > way,
> > > > but it will cause unnecessary waste of resources.
> > > >
> > > > We expect to introduce a proactive notification(event callback)
> > > mechanism.
> > > > For example, a hook can be introduced after a successful commit.
> > External
> > > > processors interested in the commit, such as scheduling systems, can
> > use
> > > > the hook as their own trigger. When a certain commit is completed,
> the
> > > > scheduling system can pull up the task of obtaining incremental data
> > > > through the API in the callback. Thereby completing the processing of
> > > > incremental data.
> > > >
> > > > There is currently a `postCommit` method in Hudi's client module, and
> > the
> > > > existing implementation is mainly used for compression and cleanup
> > after
> > > > commit. And the triggering time is a little early. Not after
> everything
> > > is
> > > > processed, we found that it may still cause the rollback of the
> commit
> > > due
> > > > to the exception. We need to find a new location to trigger this hook
> > to
> > > > ensure that the commit is deterministic.
> > > >
> > > > This is one of our scene requirements, and it will be a very useful
> > > feature
> > > > combined with the incremental query, it can make the incremental
> > > processing
> > > > more timely.
> > > >
> > > > We hope to hear what the community thinks of this proposal. Any
> > comments
> > > > and opinions are appreciated.
> > > >
> > > > Best,
> > > > Vino
> > >
> > >
> >
>


[DISCUSS] Make delete marker configurable?

2020-06-26 Thread Shiyan Xu
Hi all,

A small suggestion: as delta streamer relies on `_hoodie_is_deleted` to do
hard delete, can we make it configurable? as in users can specify any
boolean field for delete marker and `_hoodie_is_deleted` remains as default.

Regards,
Raymond


Re: Re:Re: [DISCUSS] Regarding nightly builds

2020-06-21 Thread Shiyan Xu
+1 very helpful to accelerate the adoption.

On Sun, Jun 21, 2020 at 4:51 PM Sivabalan  wrote:

> +1
>
> On Sun, Jun 21, 2020 at 11:58 AM vbal...@apache.org 
> wrote:
>
> >  +1. It is a good idea to run hudi-test-suite on a daily basis with
> > expanded tests.
> > Balaji.VOn Sunday, June 21, 2020, 08:16:39 AM PDT, Trevor-zhang <
> > 957029...@qq.com> wrote:
> >
> >  +1 as well.
> >
> > -- 原始邮件 --
> > 发件人:"vino yang"  > 发送时间:2020年6月21日(星期天) 23:04
> > 收件人:"dev"  > 主题:Re: [DISCUSS] Regarding nightly builds
> >
> >
> >
> > +1 as well,
> >
> > Currently, I am waiting for hudi-test-suite to be merged into the master
> > branch, so that when we have a new PR merged into the master branch, this
> > will cause the "hudi-test-suite" that is also on the master branch to be
> > triggered on Azure Pipeline " easier.
> >
> > Sharing more information here:
> >
> > Now, there is a warehouse about hudi-ci, which is used to try to connect
> > with Azure Pipeline. [1]
> >
> > And our reference sample is Flink Azure Pipeline [2].
> >
> > Best,
> > Vino
> >
> > [1]: https://github.com/apachehudi-ci
> > [2]:
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/2020/03/22/Migrating+Flink%27s+CI+Infrastructure+from+Travis+CI+to+Azure+Pipelines
> >
> > Vinoth Chandar  >
> >  Hi Sudha,
> > 
> >  Thanks for getting this kicked off.. +1 on a new nightly build
> > process..
> >  This will help us more easily make the bleeding edge testable..
> > 
> >  My initial thoughts here are
> > 
> >  - Figure out a way to get Azure Pipelines enabled for Hudi
> >  - Setup the nightly there (this will also help us transition off
> > travis
> >  slowly over time)
> >  - We can leverage the hudi-test-suite that nishith/vinoyang have
> been
> >  working on, add tons of more scenarios to test every night
> > 
> >  Knowing the software is stable on a daily basis and having warning
> > flags
> >  would help us make smoother releases as well.
> > 
> >  Others, please chime in as well..
> > 
> >  thanks
> >  vinoth
> > 
> > 
> > 
> > 
> > 
> >  On Thu, Jun 18, 2020 at 10:10 PM Bhavani Sudha <
> > bhavanisud...@gmail.com
> >  wrote:
> > 
> >   Hello all,
> >  
> >   Should we have nightly builds that way we can point users to
> > those builds
> >   for the latest features introduced, instead of being blocked on
> > the next
> >   release. Also this kind of gives an early feedback on new
> > features or
> >  fixes
> >   if any further improvements are needed. Does anyone
> > know if and how
> >  other
> >   Apache projects handle nightly builds?
> >  
> >   Thanks,
> >   Sudha
> >  
> > 
>
>
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Publishing benchmarks for releases

2020-06-21 Thread Shiyan Xu
+1 definitely useful info.

On Sun, Jun 21, 2020 at 4:56 PM Sivabalan  wrote:

> Hey folks,
> Is it a common practise to publish benchmarks for releases? I have put
> up an initial PR  to add jmh
> benchmark support to a couple of Hudi operations. If the community feels
> positive on publishing benchmarks, we can add support for more operations
> and for every release, we could publish some benchmark numbers.
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Make delete marker configurable?

2020-06-28 Thread Shiyan Xu
Thanks for the +1. Filed https://issues.apache.org/jira/browse/HUDI-1058

On Sat, Jun 27, 2020 at 11:34 PM Pratyaksh Sharma 
wrote:

> The suggestion looks good to me as well.
>
> On Sun, Jun 28, 2020 at 8:17 AM Sivabalan  wrote:
>
> > +1, I just left it as a todo for future patch when I worked on it.
> >
> > On Sat, Jun 27, 2020 at 8:32 PM Bhavani Sudha 
> > wrote:
> >
> > > Hi Raymond,
> > >
> > > I am trying to understand  the use case . Can you please provide more
> > > context on what problem this addresses ?
> > >
> > >
> > > Thanks,
> > > Sudha
> > >
> > > On Fri, Jun 26, 2020 at 9:02 PM Shiyan Xu  >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > A small suggestion: as delta streamer relies on `_hoodie_is_deleted`
> to
> > > do
> > > > hard delete, can we make it configurable? as in users can specify any
> > > > boolean field for delete marker and `_hoodie_is_deleted` remains as
> > > > default.
> > > >
> > > > Regards,
> > > > Raymond
> > > >
> > >
> > --
> > Regards,
> > -Sivabalan
> >
>


Re: [DISCUSS] Make delete marker configurable?

2020-06-28 Thread Shiyan Xu
Hi Sudha, the delete marker being configurable can give more flexibility to
users when process delete events; they can check any bool field they may
have on their own schema.

On Sat, Jun 27, 2020 at 5:32 PM Bhavani Sudha 
wrote:

> Hi Raymond,
>
> I am trying to understand  the use case . Can you please provide more
> context on what problem this addresses ?
>
>
> Thanks,
> Sudha
>
> On Fri, Jun 26, 2020 at 9:02 PM Shiyan Xu 
> wrote:
>
> > Hi all,
> >
> > A small suggestion: as delta streamer relies on `_hoodie_is_deleted` to
> do
> > hard delete, can we make it configurable? as in users can specify any
> > boolean field for delete marker and `_hoodie_is_deleted` remains as
> > default.
> >
> > Regards,
> > Raymond
> >
>


Re: [DISCUSS] querying commit metadata from spark DataSource

2020-06-10 Thread Shiyan Xu
Yes, Vinoth, it does go a bit too far with first class support on these
data.
A global error table can do the job easily. As we discussed yesterday,
parallel local error tables with `_errors` suffix could also benefit for
some scenarios, like different product teams manage their own tables or in
2B case where customers manage their own data. These would prefer good
segregation on errors or other related data. Let me note down the points in
RFC-20 for further discussion. Thanks for the feedback!

On Wed, Jun 3, 2020 at 9:31 PM Vinoth Chandar  wrote:

> Hi Raymond,
>
> I am not sure generalizing this to all metadata like - errors and metrics -
> would be a good idea. We can certainly implement logging errors to a common
> errors hudi table, with a certain schema. But these can be just regular
> “hudi” format tables.
>
> Unlike the timeline metadata, these are really external data, not related
> to a given table’ core functioning.. we don’t necessarily want to keep one
> error table per hudi table..
>
> Thoughts?
>
> On Tue, Jun 2, 2020 at 5:34 PM Shiyan Xu 
> wrote:
>
> > I also encountered use cases where I'd like to programmatically query
> > metadata.
> > +1 on the idea of format(“hudi-timeline”)
> >
> > I also feel that the metadata can be extended further to include more
> info
> > like, errors, metrics/write statistics, etc. Like the newly proposed
> error
> > handling, we could also store all metrics or write stats there too, and
> > relate them to the timeline actions.
> >
> > A potential use case could be, with all these info encapsulated within
> > metadata, we may be able to derive some insightful results (by check
> > against some benchmarks) and answer questions like: does table A need
> more
> > tuning? does table B exceed error budget?
> >
> > Programmatic query to these metadata can help manage many tables in
> > diagnosis and inspection. We may need different read formats like
> > format("hudi-errors") or format("hudi-metrics")
> >
> > Sorry this sidetracked from the original question..These are really rough
> > high-level thoughts, and may have sign of over-engineering. Would like to
> > hear some feedbacks. Thanks.
> >
> >
> >
> >
> > On Mon, Jun 1, 2020 at 9:28 PM Satish Kotha  >
> > wrote:
> >
> > > Got it. I'll look into implementation choices for creating a new data
> > > source. Appreciate all the feedback.
> > >
> > > On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar 
> wrote:
> > >
> > > > >Is it to separate data and metadata access?
> > > > Correct. We already have modes for querying data using
> format("hudi").
> > I
> > > > feel it will get very confusing to mix data and metadata in the same
> > > > source.. for e.g a lot of options we support for data may not even
> make
> > > > sense for the TimelineRelation.
> > > >
> > > > >This class seems like a list of static methods, I'm not seeing where
> > > these
> > > > are accessed from
> > > > That's the public API for obtaining this information for Scala/Java
> > > Spark.
> > > > If you have a way of calling this from python through some bridge
> > without
> > > > painful bridges (e.g jython), might be a tactical solution that can
> > meet
> > > > your needs.
> > > >
> > > > On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha
> >  > > >
> > > > wrote:
> > > >
> > > > > Thanks for the feedback.
> > > > >
> > > > > What is the advantage of doing
> > > > > spark.read.format(“hudi-timeline”).load(basepath) as opposed to
> doing
> > > new
> > > > > relation? Is it to separate data and metadata access?
> > > > >
> > > > > Are you looking for similar functionality as
> HoodieDatasourceHelpers?
> > > > > >
> > > > > This class seems like a list of static methods, I'm not seeing
> where
> > > > these
> > > > > are accessed from. But, I need a way to query metadata details
> easily
> > > > > in pyspark.
> > > > >
> > > > >
> > > > > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar 
> > > wrote:
> > > > >
> > > > > > Also please take a look at
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_

Re: [DISCUSS] querying commit metadata from spark DataSource

2020-06-12 Thread Shiyan Xu
Yes, tickets linked.

On Thu, Jun 11, 2020 at 10:50 AM Vinoth Chandar  wrote:

> Thanks Raymond!
>
> yes.. we can make this a config and leave it to the user to decide if they
> want to use a global table for all their hudi tables (or) keep
> one error table for each hudi table..
>
> For this effort, does it make sense to  take a dependency on the
> multi-writer jira HUDI-944, that liwei filed?
>
> On Wed, Jun 10, 2020 at 7:49 PM Shiyan Xu 
> wrote:
>
> > Yes, Vinoth, it does go a bit too far with first class support on these
> > data.
> > A global error table can do the job easily. As we discussed yesterday,
> > parallel local error tables with `_errors` suffix could also benefit for
> > some scenarios, like different product teams manage their own tables or
> in
> > 2B case where customers manage their own data. These would prefer good
> > segregation on errors or other related data. Let me note down the points
> in
> > RFC-20 for further discussion. Thanks for the feedback!
> >
> > On Wed, Jun 3, 2020 at 9:31 PM Vinoth Chandar  wrote:
> >
> > > Hi Raymond,
> > >
> > > I am not sure generalizing this to all metadata like - errors and
> > metrics -
> > > would be a good idea. We can certainly implement logging errors to a
> > common
> > > errors hudi table, with a certain schema. But these can be just regular
> > > “hudi” format tables.
> > >
> > > Unlike the timeline metadata, these are really external data, not
> related
> > > to a given table’ core functioning.. we don’t necessarily want to keep
> > one
> > > error table per hudi table..
> > >
> > > Thoughts?
> > >
> > > On Tue, Jun 2, 2020 at 5:34 PM Shiyan Xu 
> > > wrote:
> > >
> > > > I also encountered use cases where I'd like to programmatically query
> > > > metadata.
> > > > +1 on the idea of format(“hudi-timeline”)
> > > >
> > > > I also feel that the metadata can be extended further to include more
> > > info
> > > > like, errors, metrics/write statistics, etc. Like the newly proposed
> > > error
> > > > handling, we could also store all metrics or write stats there too,
> and
> > > > relate them to the timeline actions.
> > > >
> > > > A potential use case could be, with all these info encapsulated
> within
> > > > metadata, we may be able to derive some insightful results (by check
> > > > against some benchmarks) and answer questions like: does table A need
> > > more
> > > > tuning? does table B exceed error budget?
> > > >
> > > > Programmatic query to these metadata can help manage many tables in
> > > > diagnosis and inspection. We may need different read formats like
> > > > format("hudi-errors") or format("hudi-metrics")
> > > >
> > > > Sorry this sidetracked from the original question..These are really
> > rough
> > > > high-level thoughts, and may have sign of over-engineering. Would
> like
> > to
> > > > hear some feedbacks. Thanks.
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Jun 1, 2020 at 9:28 PM Satish Kotha
> >  > > >
> > > > wrote:
> > > >
> > > > > Got it. I'll look into implementation choices for creating a new
> data
> > > > > source. Appreciate all the feedback.
> > > > >
> > > > > On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar 
> > > wrote:
> > > > >
> > > > > > >Is it to separate data and metadata access?
> > > > > > Correct. We already have modes for querying data using
> > > format("hudi").
> > > > I
> > > > > > feel it will get very confusing to mix data and metadata in the
> > same
> > > > > > source.. for e.g a lot of options we support for data may not
> even
> > > make
> > > > > > sense for the TimelineRelation.
> > > > > >
> > > > > > >This class seems like a list of static methods, I'm not seeing
> > where
> > > > > these
> > > > > > are accessed from
> > > > > > That's the public API for obtaining this information for
> Scala/Java
> > > > > Spark.
> > > > > > If you have a way of calling this from python through some bridge

Re: [VOTE] Release 0.5.3, release candidate #2

2020-06-12 Thread Shiyan Xu
+1 (non-binding)

Source compile ... ok
Local UT ... ok
Delta streamer run on EMR ... ok
Release label:emr-5.29.0
Hadoop distribution:Amazon 2.8.5
Applications:Spark 2.4.4, Hive 2.3.6, Tez 0.9.2, Presto 0.227
Upsert to COW table  ... ok
Hive sync ... ok
HiveQL select ... ok

On Fri, Jun 12, 2020 at 5:09 AM Yajun Luo  wrote:

> +1
>
>
>
> From: Sivabalan
> Date: 2020-06-11 05:57
> To: dev
> Subject: [VOTE] Release 0.5.3, release candidate #2
> Hi everyone,
>
> Please review and vote on the release candidate #2 for the version 0.5.3,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint 001B66FA2B2543C151872CCC29A4FD82F1508833 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "release-0.5.3-rc2" [5],
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
>
> Thanks,
> Release Manager
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12348256
>
> [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.5.3-rc2/
>
> [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
>
> [4] https://repository.apache.org/content/repositories/orgapachehudi-1023/
>
> [5] https://github.com/apache/hudi/tree/release-0.5.3-rc2
>


Re: [DISCUSS] querying commit metadata from spark DataSource

2020-06-02 Thread Shiyan Xu
I also encountered use cases where I'd like to programmatically query
metadata.
+1 on the idea of format(“hudi-timeline”)

I also feel that the metadata can be extended further to include more info
like, errors, metrics/write statistics, etc. Like the newly proposed error
handling, we could also store all metrics or write stats there too, and
relate them to the timeline actions.

A potential use case could be, with all these info encapsulated within
metadata, we may be able to derive some insightful results (by check
against some benchmarks) and answer questions like: does table A need more
tuning? does table B exceed error budget?

Programmatic query to these metadata can help manage many tables in
diagnosis and inspection. We may need different read formats like
format("hudi-errors") or format("hudi-metrics")

Sorry this sidetracked from the original question..These are really rough
high-level thoughts, and may have sign of over-engineering. Would like to
hear some feedbacks. Thanks.




On Mon, Jun 1, 2020 at 9:28 PM Satish Kotha 
wrote:

> Got it. I'll look into implementation choices for creating a new data
> source. Appreciate all the feedback.
>
> On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar  wrote:
>
> > >Is it to separate data and metadata access?
> > Correct. We already have modes for querying data using format("hudi"). I
> > feel it will get very confusing to mix data and metadata in the same
> > source.. for e.g a lot of options we support for data may not even make
> > sense for the TimelineRelation.
> >
> > >This class seems like a list of static methods, I'm not seeing where
> these
> > are accessed from
> > That's the public API for obtaining this information for Scala/Java
> Spark.
> > If you have a way of calling this from python through some bridge without
> > painful bridges (e.g jython), might be a tactical solution that can meet
> > your needs.
> >
> > On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha  >
> > wrote:
> >
> > > Thanks for the feedback.
> > >
> > > What is the advantage of doing
> > > spark.read.format(“hudi-timeline”).load(basepath) as opposed to doing
> new
> > > relation? Is it to separate data and metadata access?
> > >
> > > Are you looking for similar functionality as HoodieDatasourceHelpers?
> > > >
> > > This class seems like a list of static methods, I'm not seeing where
> > these
> > > are accessed from. But, I need a way to query metadata details easily
> > > in pyspark.
> > >
> > >
> > > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar 
> wrote:
> > >
> > > > Also please take a look at
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0=
> > > > .
> > > >
> > > > This was an effort to make the timeline more generalized for querying
> > > (for
> > > > a different purpose).. but good to revisit now..
> > > >
> > > > On Sun, May 31, 2020 at 11:04 PM vbal...@apache.org <
> > vbal...@apache.org>
> > > > wrote:
> > > >
> > > > >
> > > > > I strongly recommend using a separate datasource relation (option
> 1)
> > to
> > > > > query timeline. It is elegant and fits well with spark APIs.
> > > > > Thanks.Balaji.VOn Saturday, May 30, 2020, 01:18:45 PM PDT,
> Vinoth
> > > > > Chandar  wrote:
> > > > >
> > > > >  Hi satish,
> > > > >
> > > > > Are you looking for similar functionality as
> HoodieDatasourceHelpers?
> > > > >
> > > > > We have historically relied on cli to inspect the table, which does
> > not
> > > > > lend it self well to programmatic access.. overall in like option
> 1 -
> > > > > allowing the timeline to be queryable with a standard schema does
> > seem
> > > > way
> > > > > nicer.
> > > > >
> > > > > I am wondering though if we should introduce a new view. Instead we
> > can
> > > > use
> > > > > a different data source name -
> > > > > spark.read.format(“hudi-timeline”).load(basepath). We can start by
> > just
> > > > > allowing querying of active timeline and expand this to archive
> > > timeline?
> > > > >
> > > > > What do other Think?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> > > >  > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hello folks,
> > > > > >
> > > > > > We have a use case to incrementally generate data for hudi table
> > (say
> > > > > > 'table2')  by transforming data from other hudi table(say,
> table1).
> > > We
> > > > > want
> > > > > > to atomically store commit timestamps read from table1 into
> table2
> > > > commit
> > > > > > metadata.
> > > > > >
> > > > > > This is similar to how DeltaStreamer operates with kafka offsets.
> > > > > However,
> > > > > > DeltaStreamer is java code and can easily query kafka offset
> > > processed
> > > > by
> > > > > > creating metaclient for target table. We want to use pyspark and
> I
> > > > don't
> > > > > > see a 

Re: GitHub release display issue

2020-07-17 Thread Shiyan Xu
+1 to remove github releases

On Fri, Jul 17, 2020 at 6:44 AM Vinoth Chandar  wrote:

> Thanks for flagging this, Raymond!
>
> I think we can just remove the github releases or mark it as old. All
> apache releases are hosted on asf infrastructure.
>
> Anyone?
>
> On Mon, Jul 13, 2020 at 11:42 PM Shiyan Xu 
> wrote:
>
> > The new GitHub UI displays 0.4.7 as the latest, which is misleading.
> > Guess adding notes to the later releases could resolve it?
> > [image: Screen Shot 2020-07-13 at 11.37.16 PM.png]
> >
>


Re: Unit tests in hudi-client module fail due to SparkContext

2020-07-28 Thread Shiyan Xu
Sure... here it is
https://gist.github.com/xushiyan/db4d4067657abe6b8872ef12473b7087

On Tue, Jul 28, 2020 at 9:53 AM Vinoth Chandar  wrote:

> Unfortunately, mailing list does not support images . you could create a
> gist and paste link :)
>
> On Tue, Jul 28, 2020 at 9:49 AM Y Ethan Guo 
> wrote:
>
> > Thanks, Shiyan.  Interesting, maybe the memory config is the issue for
> me.
> > When I tried to run all tests, it fails.  But running a smaller set of
> > tests per run is fine for me.  I'll check that config in my settings.
> >
> > The discrepancy between the intellij test runner and CLI mvn runner may
> be
> > > affected via these settings
> >
> > Somehow I can't see the screenshots...
> >
> > Thanks,
> > - Ethan
> >
> > On Tue, Jul 28, 2020 at 9:32 AM Shiyan Xu 
> > wrote:
> >
> > > At least that's what master build is setting with...
> > >
> > > The discrepancy between the intellij test runner and CLI mvn runner may
> > be
> > > affected via these settings
> > > [image: Screen Shot 2020-07-28 at 9.29.20 AM.png]
> > > [image: Screen Shot 2020-07-28 at 9.29.27 AM.png]
> > >
> > > On Tue, Jul 28, 2020 at 9:19 AM Shiyan Xu  >
> > > wrote:
> > >
> > >> The maven surefire/failsafe plugin is configured with -Xmx2g here
> > >> <
> >
> https://github.com/apache/hudi/blob/b2763f433b3efb92fdcc0e760a88a43eaa2e5be3/pom.xml#L124
> > >,
> > >> which should be plentiful for all tests so far.
> > >> The OOM looks weird to me.. maybe try checking the maven log see if
> > >> -Xmx2g is indeed applied
> > >>
> > >> On Mon, Jul 27, 2020 at 11:11 PM Y Ethan Guo <
> ethan.guoyi...@gmail.com>
> > >> wrote:
> > >>
> > >>> I see.  I'll check the travis CI setup later.  I'm unblocked now for
> > >>> running the unit tests locally.
> > >>>
> > >>> Thanks,
> > >>> - Ethan
> > >>>
> > >>> On Mon, Jul 27, 2020 at 11:03 PM Vinoth Chandar 
> > >>> wrote:
> > >>>
> > >>> > So, i realized something. With the recent changes, functional tests
> > >>> retain
> > >>> > a single spark session for the entire test suite to speed things
> up.
> > >>> So
> > >>> > thats probably what you were hitting first, when running via IDE
> > >>> > Try following the .travis.yml profiles directly?
> > >>> >
> > >>> >
> > >>> > Not sure about the OOM. Have not seen this before.
> > >>> >
> > >>> > On Mon, Jul 27, 2020 at 11:00 PM Y Ethan Guo <
> > ethan.guoyi...@gmail.com
> > >>> >
> > >>> > wrote:
> > >>> >
> > >>> > > Thanks for the suggestion, Vinoth.
> > >>> > >
> > >>> > > I tried a few things on master:
> > >>> > > - "mvn clean package -DskipITs": It throws the following
> exception:
> > >>> > >
> > >>> > > [*ERROR*] *Tests **run: 3*, Failures: 0, *Errors: 1*, Skipped: 0,
> > >>> Time
> > >>> > > elapsed: 19.286 s* <<< FAILURE!* - in
> > >>> > > org.apache.hudi.table.action.rollback.
> > >>> > > *TestMergeOnReadRollbackActionExecutor*
> > >>> > >
> > >>> > > [*ERROR*]
> > >>> > >
> > >>> > >
> > >>> >
> > >>>
> >
> org.apache.hudi.table.action.rollback.TestMergeOnReadRollbackActionExecutor.testMergeOnReadRollbackActionExecutor(boolean)[2]
> > >>> > > Time elapsed: 4.977 s  <<< ERROR!
> > >>> > >
> > >>> > > org.apache.hudi.exception.HoodieRemoteException: ... failed to
> > >>> respond
> > >>> > >
> > >>> > > at
> > >>> > >
> > >>> > >
> > >>> >
> > >>>
> >
> org.apache.hudi.table.action.rollback.TestMergeOnReadRollbackActionExecutor.testMergeOnReadRollbackActionExecutor(TestMergeOnReadRollbackActionExecutor.java:79)
> > >>> > >
> > >>> > > Caused by: org.apache.http.NoHttpResponseException: ... failed to
> > >>> respond
> > >>> > >
> > >>> > > at
> > &

Re: Unit tests in hudi-client module fail due to SparkContext

2020-07-28 Thread Shiyan Xu
The maven surefire/failsafe plugin is configured with -Xmx2g here
,
which should be plentiful for all tests so far.
The OOM looks weird to me.. maybe try checking the maven log see if -Xmx2g
is indeed applied

On Mon, Jul 27, 2020 at 11:11 PM Y Ethan Guo 
wrote:

> I see.  I'll check the travis CI setup later.  I'm unblocked now for
> running the unit tests locally.
>
> Thanks,
> - Ethan
>
> On Mon, Jul 27, 2020 at 11:03 PM Vinoth Chandar  wrote:
>
> > So, i realized something. With the recent changes, functional tests
> retain
> > a single spark session for the entire test suite to speed things up.  So
> > thats probably what you were hitting first, when running via IDE
> > Try following the .travis.yml profiles directly?
> >
> >
> > Not sure about the OOM. Have not seen this before.
> >
> > On Mon, Jul 27, 2020 at 11:00 PM Y Ethan Guo 
> > wrote:
> >
> > > Thanks for the suggestion, Vinoth.
> > >
> > > I tried a few things on master:
> > > - "mvn clean package -DskipITs": It throws the following exception:
> > >
> > > [*ERROR*] *Tests **run: 3*, Failures: 0, *Errors: 1*, Skipped: 0, Time
> > > elapsed: 19.286 s* <<< FAILURE!* - in
> > > org.apache.hudi.table.action.rollback.
> > > *TestMergeOnReadRollbackActionExecutor*
> > >
> > > [*ERROR*]
> > >
> > >
> >
> org.apache.hudi.table.action.rollback.TestMergeOnReadRollbackActionExecutor.testMergeOnReadRollbackActionExecutor(boolean)[2]
> > > Time elapsed: 4.977 s  <<< ERROR!
> > >
> > > org.apache.hudi.exception.HoodieRemoteException: ... failed to respond
> > >
> > > at
> > >
> > >
> >
> org.apache.hudi.table.action.rollback.TestMergeOnReadRollbackActionExecutor.testMergeOnReadRollbackActionExecutor(TestMergeOnReadRollbackActionExecutor.java:79)
> > >
> > > Caused by: org.apache.http.NoHttpResponseException: ... failed to
> respond
> > >
> > > at
> > >
> > >
> >
> org.apache.hudi.table.action.rollback.TestMergeOnReadRollbackActionExecutor.testMergeOnReadRollbackActionExecutor(TestMergeOnReadRollbackActionExecutor.java:79)
> > >
> > > [*ERROR*]
> > >
> > >
> >
> org.apache.hudi.index.TestHoodieIndex.testSimpleTagLocationAndUpdate(HoodieIndex$IndexType)
> > > Time elapsed: 1.182 s  <<< ERROR!
> > >
> > > org.apache.spark.SparkException:
> > >
> > > Only one SparkContext may be running in this JVM (see SPARK-2243). To
> > > ignore this error, set spark.driver.allowMultipleContexts = true. The
> > > currently running SparkContext was created at:
> > >
> > > - "mvn test -Punit-tests -pl hudi-client -B" from CI: It encounters
> > > "java.lang.OutOfMemoryError: GC overhead limit exceeded".
> > >
> > > What I do now is running tests under different packages in hudi-client
> > > manually in IntelliJ.  That seems to work for me.
> > >
> > > On Sun, Jul 26, 2020 at 5:18 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Hi Ethan,
> > > >
> > > > For purposes of unblocking yourself, can you try running them locally
> > via
> > > > mvn command via terminal?
> > > >
> > > > thanks
> > > > vinoth
> > > >
> > > > On Sun, Jul 26, 2020 at 4:12 PM Y Ethan Guo <
> ethan.guoyi...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm working on hudi-client module and I notice that if I run all
> unit
> > > > tests
> > > > > under hudi-client locally in IntelliJ, some tests (54 out of 256)
> are
> > > > > failing due to the following SparkException: "Only one SparkContext
> > may
> > > > be
> > > > > running in this JVM".  Is there any way I can get around this?
> > > > >
> > > > > org.apache.spark.SparkException: Only one SparkContext may be
> running
> > > in
> > > > > this JVM (see SPARK-2243). To ignore this error, set
> > > > > spark.driver.allowMultipleContexts = true. The currently running
> > > > > SparkContext was created at:
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hudi.testutils.FunctionalTestHarness.runBeforeEach(FunctionalTestHarness.java:132)
> > > > > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > java.lang.reflect.Method.invoke(Method.java:498)
> > > > >
> > > > >
> > > >
> > >
> >
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:688)
> > > > >
> > > > >
> > > >
> > >
> >
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> > > > >
> > > > >
> > > >
> > >
> >
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
> > > > >
> > > > >
> > > >
> > >
> >
> 

Re: [DISCUSS] Adding Metrics to Hudi Common

2020-07-28 Thread Shiyan Xu
yea make sense to keep module-specific metrics classes, like deltastreamer
metrics should just reside in hudi-utilities.


On Tue, Jul 28, 2020 at 9:52 AM Vinoth Chandar  wrote:

> IMO having metrics within each module is probably more maintainable.
> the common metrics interfaces/base classes can just live in hudi-common for
> now?
>
> On Tue, Jul 28, 2020 at 9:06 AM Shiyan Xu 
> wrote:
>
> > +1. It would be very helpful to have more internal
> performance/cost-related
> > metrics (perhaps optionally enabled). Also it does make sense to move
> > metrics classes to common, or even to a separate module (if the scope
> gets
> > extended a lot further)
> >
> > On Tue, Jul 28, 2020 at 8:43 AM vbal...@apache.org 
> > wrote:
> >
> > >  +1. Would love to see observability metrics exposed for file system
> RPC
> > > calls. This would greatly help in figuring out RPC performance and
> > > bottlenecks across varied file-systems that Hudi supports.
> > > On Tuesday, July 28, 2020, 08:24:54 AM PDT, Nishith <
> > > n3.nas...@gmail.com> wrote:
> > >
> > >  +1
> > >
> > > Having the metrics flexibly in common will help in building
> observability
> > > in other modules.
> > >
> > > Thanks,
> > > Nishith
> > >
> > > > On Jul 28, 2020, at 7:28 AM, Vinoth Chandar 
> wrote:
> > > >
> > > > +1 as well.
> > > >
> > > > Given we support many reporters now. Could you please further
> > > > improve/retain modularity.
> > > >
> > > >> On Mon, Jul 27, 2020 at 6:30 PM vino yang 
> > > wrote:
> > > >>
> > > >> Hi Modi,
> > > >>
> > > >> +1 for this proposal.
> > > >>
> > > >> I agree with your opinion that the metric report should not only
> > report
> > > the
> > > >> client's metrics.
> > > >>
> > > >> And we should decouple the implementation of metrics from the client
> > > module
> > > >> so that it could be developed independently.
> > > >>
> > > >> Best,
> > > >> Vino
> > > >>
> > > >> Abhishek Modi  于2020年7月28日周二 上午4:17写道:
> > > >>
> > > >>> Hi Everyone!
> > > >>>
> > > >>> I'm hoping to have a discussion around adding a lightweight metrics
> > > class
> > > >>> to Hudi Common. There are parts of Hudi Common that have large
> > > >> performance
> > > >>> implications, and I think adding metrics to these parts will help
> us
> > > >> track
> > > >>> Hudi's health in production and help us understand the performance
> > > >>> implications of changes we make.
> > > >>>
> > > >>> I've opened a Jira on this topic -
> > > >>> https://issues.apache.org/jira/browse/HUDI-1025. This jira
> > > >>> specifically suggests adding HoodieWrapperFileSystem as this class
> > has
> > > >>> performance implications not just for Hudi, but also for the
> > underlying
> > > >>> DFS.
> > > >>>
> > > >>> Looking forward to everyone's opinions on this :)
> > > >>>
> > > >>> Best,
> > > >>> Modi
> > > >>>
> > > >>
> >
>


Re: [DISCUSS] Adding Metrics to Hudi Common

2020-07-28 Thread Shiyan Xu
+1. It would be very helpful to have more internal performance/cost-related
metrics (perhaps optionally enabled). Also it does make sense to move
metrics classes to common, or even to a separate module (if the scope gets
extended a lot further)

On Tue, Jul 28, 2020 at 8:43 AM vbal...@apache.org 
wrote:

>  +1. Would love to see observability metrics exposed for file system RPC
> calls. This would greatly help in figuring out RPC performance and
> bottlenecks across varied file-systems that Hudi supports.
> On Tuesday, July 28, 2020, 08:24:54 AM PDT, Nishith <
> n3.nas...@gmail.com> wrote:
>
>  +1
>
> Having the metrics flexibly in common will help in building observability
> in other modules.
>
> Thanks,
> Nishith
>
> > On Jul 28, 2020, at 7:28 AM, Vinoth Chandar  wrote:
> >
> > +1 as well.
> >
> > Given we support many reporters now. Could you please further
> > improve/retain modularity.
> >
> >> On Mon, Jul 27, 2020 at 6:30 PM vino yang 
> wrote:
> >>
> >> Hi Modi,
> >>
> >> +1 for this proposal.
> >>
> >> I agree with your opinion that the metric report should not only report
> the
> >> client's metrics.
> >>
> >> And we should decouple the implementation of metrics from the client
> module
> >> so that it could be developed independently.
> >>
> >> Best,
> >> Vino
> >>
> >> Abhishek Modi  于2020年7月28日周二 上午4:17写道:
> >>
> >>> Hi Everyone!
> >>>
> >>> I'm hoping to have a discussion around adding a lightweight metrics
> class
> >>> to Hudi Common. There are parts of Hudi Common that have large
> >> performance
> >>> implications, and I think adding metrics to these parts will help us
> >> track
> >>> Hudi's health in production and help us understand the performance
> >>> implications of changes we make.
> >>>
> >>> I've opened a Jira on this topic -
> >>> https://issues.apache.org/jira/browse/HUDI-1025. This jira
> >>> specifically suggests adding HoodieWrapperFileSystem as this class has
> >>> performance implications not just for Hudi, but also for the underlying
> >>> DFS.
> >>>
> >>> Looking forward to everyone's opinions on this :)
> >>>
> >>> Best,
> >>> Modi
> >>>
> >>


Re: 0.11.0 release timeline

2022-03-27 Thread Shiyan Xu
Hi All, just a reminder on the timeline, as discussed earlier:

- Mar 31 00:00 PST : feature freeze - new features/functionalities won't be
merged to master (3 days from now)
- Apr 03 00:00 PST : cut release branch and start RC voting/testing (6 days
from now)

Thank you.


On Wed, Mar 23, 2022 at 4:42 AM Vinoth Chandar  wrote:

> +1 from me, as long as we don’t push it out more.
>
> On Tue, Mar 22, 2022 at 12:29 Raymond Xu 
> wrote:
>
> > Ok Vinoth, thanks for highlighting this. BigQuery integration is an
> > important feature to add to 0.11.0. I also see some other inflight work
> > from the backlog. To accommodate this and other inflight issues / testing
> > activities, I suggest making a one-time adjustment: pushing 7 days from
> the
> > original timeline, as in Mar 31 feature freeze and Apr 3 for cutting
> > release branch.
> >
> > Sounds good to the community here?
> >
> > --
> > Best,
> > Raymond
> >
> >
> > On Tue, Mar 22, 2022 at 12:50 PM Vinoth Govindarajan <
> > vinoth.govindara...@gmail.com> wrote:
> >
> > > Hi Raymond,
> > > I'm working on the Hudi <> BigQuery Integration RFC-34, I'm trying to
> > wrap
> > > up everything and send out the PR before the end of this week, I was
> > > wondering would it be possible to include this PR as part of the 0.11.0
> > > release, let me know what I need to do to make this part of next
> release,
> > > thanks!
> > >
> > > Best,
> > > Vinoth
> > >
> > >
> > > On Fri, Mar 18, 2022 at 10:58 PM Raymond Xu <
> xu.shiyan.raym...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > As we're approaching late March when we intended to release 0.11.0,
> I'd
> > > > like to call out this timeline
> > > >
> > > > - Mar 24 00:00 PST : feature freeze - new features/functionalities
> > won't
> > > be
> > > > merged to master (6 days from now)
> > > > - Mar 27 00:00 PST : cut release branch and start RC voting/testing
> (9
> > > days
> > > > from now)
> > > >
> > > > Please kindly highlight any concerns. Thank you.
> > > >
> > > > Best,
> > > > Raymond
> > > >
> > >
> >
>


-- 
Best,
Shiyan


Re: Permission to contribute

2022-04-03 Thread Shiyan Xu
Done and welcome!

On Sat, Apr 2, 2022 at 8:37 PM wulingqi  wrote:

> Hi Team ,
>
> I want to contribute to Apache Hudi. Would you please give me the
> contributor permission? My JIRA username is  KnightChess
>
> Thanks
> Lingqi
>
>

-- 
Best,
Shiyan


Re: [ANNOUNCE] New Apache Hudi Committer - Zhaojing Yu

2022-03-25 Thread Shiyan Xu
Congrats!

On Fri, Mar 25, 2022 at 1:40 PM Danny Chan  wrote:

> Hi everyone,
>
> On behalf of the PMC, I'm very happy to announce Zhaojing Yu as a new
> Hudi committer.
>
> Zhaojing is very active in Flink Hudi contributions, many cool
> features such as the flink streaming bootstrap, compaction service and
> all kinds of writing modes are contributed by him. He also fixed many
> critical bugs from the Flink side.
>
> Besides that, Zhaojing is also active in use case publicity of Hudi in
> China, he is very active in answering user questions in our Dingtalk
> group. Now he is working in Bytedance for pushing forward the Volcanic
> cloud service Hudi products !
>
> Please join me in congratulating Zhaojing for becoming a Hudi committer!
>
> Cheers,
> Danny
>


-- 
--
Best,
Shiyan


Re: 0.11.0 release timeline

2022-04-05 Thread Shiyan Xu
Hi all,

Apologies about the delays, with the last few blockers just landed, we are
now starting the RC process.

I started by following the release guide here
https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+-+Release+Guide

As there are some additional setup for me, please expect RC1 to be ready
the next day. In the meantime, we'll continue landing bug fixes.

Thanks.

On Mon, Mar 28, 2022 at 2:25 AM Shiyan Xu 
wrote:

> Hi All, just a reminder on the timeline, as discussed earlier:
>
> - Mar 31 00:00 PST : feature freeze - new features/functionalities won't
> be merged to master (3 days from now)
> - Apr 03 00:00 PST : cut release branch and start RC voting/testing (6
> days from now)
>
> Thank you.
>
>
> On Wed, Mar 23, 2022 at 4:42 AM Vinoth Chandar  wrote:
>
>> +1 from me, as long as we don’t push it out more.
>>
>> On Tue, Mar 22, 2022 at 12:29 Raymond Xu 
>> wrote:
>>
>> > Ok Vinoth, thanks for highlighting this. BigQuery integration is an
>> > important feature to add to 0.11.0. I also see some other inflight work
>> > from the backlog. To accommodate this and other inflight issues /
>> testing
>> > activities, I suggest making a one-time adjustment: pushing 7 days from
>> the
>> > original timeline, as in Mar 31 feature freeze and Apr 3 for cutting
>> > release branch.
>> >
>> > Sounds good to the community here?
>> >
>> > --
>> > Best,
>> > Raymond
>> >
>> >
>> > On Tue, Mar 22, 2022 at 12:50 PM Vinoth Govindarajan <
>> > vinoth.govindara...@gmail.com> wrote:
>> >
>> > > Hi Raymond,
>> > > I'm working on the Hudi <> BigQuery Integration RFC-34, I'm trying to
>> > wrap
>> > > up everything and send out the PR before the end of this week, I was
>> > > wondering would it be possible to include this PR as part of the
>> 0.11.0
>> > > release, let me know what I need to do to make this part of next
>> release,
>> > > thanks!
>> > >
>> > > Best,
>> > > Vinoth
>> > >
>> > >
>> > > On Fri, Mar 18, 2022 at 10:58 PM Raymond Xu <
>> xu.shiyan.raym...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > As we're approaching late March when we intended to release 0.11.0,
>> I'd
>> > > > like to call out this timeline
>> > > >
>> > > > - Mar 24 00:00 PST : feature freeze - new features/functionalities
>> > won't
>> > > be
>> > > > merged to master (6 days from now)
>> > > > - Mar 27 00:00 PST : cut release branch and start RC voting/testing
>> (9
>> > > days
>> > > > from now)
>> > > >
>> > > > Please kindly highlight any concerns. Thank you.
>> > > >
>> > > > Best,
>> > > > Raymond
>> > > >
>> > >
>> >
>>
>
>
> --
> Best,
> Shiyan
>


-- 
Best,
Shiyan


Re: contributor permission

2022-04-11 Thread Shiyan Xu
Done. welcome!

On Mon, Apr 11, 2022 at 7:39 PM 金鱼缸底的秘密 <1715123...@qq.com.invalid> wrote:

> Hi,
>
>
> I want to contribute to Apache Hudi.
> Would you please give me the contributor permission?
>
>
> My JIRA ID is zyp.
>
> My email is1715123...@qq.com.
>
> 张一鹏
> Tel:18162317322
> Email:1715123...@qq.com
>
>
>
>
> 



-- 
Best,
Shiyan


Re: 0.11.0 release timeline

2022-04-06 Thread Shiyan Xu
Hi all,

Status update: the 0.11.0 release branch was cut and 0.11.0-rc1 was tagged
https://github.com/apache/hudi/tree/release-0.11.0-rc1

Please feel free to start testing activities using this tag branch.

Meanwhile, I'll continue the process and soon upload artifacts to
dist.apache.org and then send a separate voting email for RC1.

Thanks.


On Wed, Apr 6, 2022 at 2:20 AM Shiyan Xu 
wrote:

> Hi all,
>
> Apologies about the delays, with the last few blockers just landed, we are
> now starting the RC process.
>
> I started by following the release guide here
>
> https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+-+Release+Guide
>
> As there are some additional setup for me, please expect RC1 to be ready
> the next day. In the meantime, we'll continue landing bug fixes.
>
> Thanks.
>
> On Mon, Mar 28, 2022 at 2:25 AM Shiyan Xu 
> wrote:
>
>> Hi All, just a reminder on the timeline, as discussed earlier:
>>
>> - Mar 31 00:00 PST : feature freeze - new features/functionalities won't
>> be merged to master (3 days from now)
>> - Apr 03 00:00 PST : cut release branch and start RC voting/testing (6
>> days from now)
>>
>> Thank you.
>>
>>
>> On Wed, Mar 23, 2022 at 4:42 AM Vinoth Chandar  wrote:
>>
>>> +1 from me, as long as we don’t push it out more.
>>>
>>> On Tue, Mar 22, 2022 at 12:29 Raymond Xu 
>>> wrote:
>>>
>>> > Ok Vinoth, thanks for highlighting this. BigQuery integration is an
>>> > important feature to add to 0.11.0. I also see some other inflight work
>>> > from the backlog. To accommodate this and other inflight issues /
>>> testing
>>> > activities, I suggest making a one-time adjustment: pushing 7 days
>>> from the
>>> > original timeline, as in Mar 31 feature freeze and Apr 3 for cutting
>>> > release branch.
>>> >
>>> > Sounds good to the community here?
>>> >
>>> > --
>>> > Best,
>>> > Raymond
>>> >
>>> >
>>> > On Tue, Mar 22, 2022 at 12:50 PM Vinoth Govindarajan <
>>> > vinoth.govindara...@gmail.com> wrote:
>>> >
>>> > > Hi Raymond,
>>> > > I'm working on the Hudi <> BigQuery Integration RFC-34, I'm trying to
>>> > wrap
>>> > > up everything and send out the PR before the end of this week, I was
>>> > > wondering would it be possible to include this PR as part of the
>>> 0.11.0
>>> > > release, let me know what I need to do to make this part of next
>>> release,
>>> > > thanks!
>>> > >
>>> > > Best,
>>> > > Vinoth
>>> > >
>>> > >
>>> > > On Fri, Mar 18, 2022 at 10:58 PM Raymond Xu <
>>> xu.shiyan.raym...@gmail.com
>>> > >
>>> > > wrote:
>>> > >
>>> > > > Hi all,
>>> > > >
>>> > > > As we're approaching late March when we intended to release
>>> 0.11.0, I'd
>>> > > > like to call out this timeline
>>> > > >
>>> > > > - Mar 24 00:00 PST : feature freeze - new features/functionalities
>>> > won't
>>> > > be
>>> > > > merged to master (6 days from now)
>>> > > > - Mar 27 00:00 PST : cut release branch and start RC
>>> voting/testing (9
>>> > > days
>>> > > > from now)
>>> > > >
>>> > > > Please kindly highlight any concerns. Thank you.
>>> > > >
>>> > > > Best,
>>> > > > Raymond
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>> --
>> Best,
>> Shiyan
>>
>
>
> --
> Best,
> Shiyan
>


-- 
Best,
Shiyan


Re: 0.11.0 release timeline

2022-04-06 Thread Shiyan Xu
Hi all,

Status update:

[VOTE] email for RC1 was sent. Please continue the testing activities using the
artifacts
<https://repository.apache.org/content/repositories/orgapachehudi-1056/> or
build from the 0.11.0-rc1 tag branch
<https://github.com/apache/hudi/releases/tag/release-0.11.0-rc1>.

Please report to the [VOTE] email by casting votes and giving feedback or
test results.

We would love to have as much feedback as possible to help stabilize the
RC. Please don't hesitate to test even if there is a -1 vote.

Thank you for your cooperation.


On Wed, Apr 6, 2022 at 5:46 PM Shiyan Xu 
wrote:

> Hi all,
>
> Status update: the 0.11.0 release branch was cut and 0.11.0-rc1 was tagged
> https://github.com/apache/hudi/tree/release-0.11.0-rc1
>
> Please feel free to start testing activities using this tag branch.
>
> Meanwhile, I'll continue the process and soon upload artifacts to
> dist.apache.org and then send a separate voting email for RC1.
>
> Thanks.
>
>
> On Wed, Apr 6, 2022 at 2:20 AM Shiyan Xu 
> wrote:
>
>> Hi all,
>>
>> Apologies about the delays, with the last few blockers just landed, we
>> are now starting the RC process.
>>
>> I started by following the release guide here
>>
>> https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+-+Release+Guide
>>
>> As there are some additional setup for me, please expect RC1 to be ready
>> the next day. In the meantime, we'll continue landing bug fixes.
>>
>> Thanks.
>>
>> On Mon, Mar 28, 2022 at 2:25 AM Shiyan Xu 
>> wrote:
>>
>>> Hi All, just a reminder on the timeline, as discussed earlier:
>>>
>>> - Mar 31 00:00 PST : feature freeze - new features/functionalities won't
>>> be merged to master (3 days from now)
>>> - Apr 03 00:00 PST : cut release branch and start RC voting/testing (6
>>> days from now)
>>>
>>> Thank you.
>>>
>>>
>>> On Wed, Mar 23, 2022 at 4:42 AM Vinoth Chandar 
>>> wrote:
>>>
>>>> +1 from me, as long as we don’t push it out more.
>>>>
>>>> On Tue, Mar 22, 2022 at 12:29 Raymond Xu 
>>>> wrote:
>>>>
>>>> > Ok Vinoth, thanks for highlighting this. BigQuery integration is an
>>>> > important feature to add to 0.11.0. I also see some other inflight
>>>> work
>>>> > from the backlog. To accommodate this and other inflight issues /
>>>> testing
>>>> > activities, I suggest making a one-time adjustment: pushing 7 days
>>>> from the
>>>> > original timeline, as in Mar 31 feature freeze and Apr 3 for cutting
>>>> > release branch.
>>>> >
>>>> > Sounds good to the community here?
>>>> >
>>>> > --
>>>> > Best,
>>>> > Raymond
>>>> >
>>>> >
>>>> > On Tue, Mar 22, 2022 at 12:50 PM Vinoth Govindarajan <
>>>> > vinoth.govindara...@gmail.com> wrote:
>>>> >
>>>> > > Hi Raymond,
>>>> > > I'm working on the Hudi <> BigQuery Integration RFC-34, I'm trying
>>>> to
>>>> > wrap
>>>> > > up everything and send out the PR before the end of this week, I was
>>>> > > wondering would it be possible to include this PR as part of the
>>>> 0.11.0
>>>> > > release, let me know what I need to do to make this part of next
>>>> release,
>>>> > > thanks!
>>>> > >
>>>> > > Best,
>>>> > > Vinoth
>>>> > >
>>>> > >
>>>> > > On Fri, Mar 18, 2022 at 10:58 PM Raymond Xu <
>>>> xu.shiyan.raym...@gmail.com
>>>> > >
>>>> > > wrote:
>>>> > >
>>>> > > > Hi all,
>>>> > > >
>>>> > > > As we're approaching late March when we intended to release
>>>> 0.11.0, I'd
>>>> > > > like to call out this timeline
>>>> > > >
>>>> > > > - Mar 24 00:00 PST : feature freeze - new features/functionalities
>>>> > won't
>>>> > > be
>>>> > > > merged to master (6 days from now)
>>>> > > > - Mar 27 00:00 PST : cut release branch and start RC
>>>> voting/testing (9
>>>> > > days
>>>> > > > from now)
>>>> > > >
>>>> > > > Please kindly highlight any concerns. Thank you.
>>>> > > >
>>>> > > > Best,
>>>> > > > Raymond
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>> --
>>> Best,
>>> Shiyan
>>>
>>
>>
>> --
>> Best,
>> Shiyan
>>
>
>
> --
> Best,
> Shiyan
>


-- 
Best,
Shiyan


[VOTE] Release 0.11.0, release candidate #1

2022-04-06 Thread Shiyan Xu
Hi everyone,

Please review and vote on the release candidate #1 for the version 0.11.0,
as follows:

[ ] +1, Approve the release

[ ] -1, Do not approve the release (please provide specific comments)



The complete staging area is available for your review, which includes:

* JIRA release notes [1],

* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint E1FACC15B67B2C5149224052D3B314F3B6E9C746 [3],

* all artifacts to be deployed to the Maven Central Repository [4],

* source code tag "0.11.0-rc1" [5],



The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.



Thanks,

Release Manager



[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12350673

[2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.11.0-rc1/

[3] https://dist.apache.org/repos/dist/release/hudi/KEYS

[4] https://repository.apache.org/content/repositories/orgapachehudi-1056/

[5] https://github.com/apache/hudi/releases/tag/release-0.11.0-rc1


[VOTE] Release 0.11.0, release candidate #2

2022-04-15 Thread Shiyan Xu
Hi everyone,

Please review and vote on the release candidate #2 for the version 0.11.0,
as follows:

[ ] +1, Approve the release

[ ] -1, Do not approve the release (please provide specific comments)



The complete staging area is available for your review, which includes:

* JIRA release notes [1],

* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint E1FACC15B67B2C5149224052D3B314F3B6E9C746 [3],

* all artifacts to be deployed to the Maven Central Repository [4],

* source code tag "0.11.0-rc2" [5],



The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.



Thanks,

Release Manager



[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12350673

[2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.11.0-rc2/

[3] https://dist.apache.org/repos/dist/release/hudi/KEYS

[4] https://repository.apache.org/content/repositories/orgapachehudi-1060/

[5] https://github.com/apache/hudi/releases/tag/release-0.11.0-rc2


Re: [VOTE] Release 0.11.0, release candidate #1

2022-04-12 Thread Shiyan Xu
Based on the feedback, we will cancel RC1 and I'll start preparing RC2
within 1 day. Thank you all.

On Mon, Apr 11, 2022 at 6:07 AM Y Ethan Guo 
wrote:

> -1
>
> During my testing of 0.11.0 RC1 with Deltastreamer, errors have come up due
> to the issues Siva mentioned.
>
> On Sun, Apr 10, 2022 at 11:03 AM Shiyan Xu 
> wrote:
>
> > -1
> >
> > Rat plugin in CI was not working for some time and resulted in some files
> > missing Apache license header. This was fixed in master
> >
> >
> https://github.com/apache/hudi/commit/5e65aefc61e973a959c11f08c2aa9f19fefc09fc
> >
> >
> >
> > On Sat, Apr 9, 2022 at 9:01 PM Sivabalan  wrote:
> >
> > > -1
> > >
> > > a. release validation script failed due to presence of non-text files.
> > > b. We identified a few bugs that are blocker and needs to be added to
> RC.
> > > One of the fixes have already been landed, but two are under review.
> > >
> > > https://issues.apache.org/jira/browse/HUDI-3636 : With deltastreamer
> and
> > > async table services, timeline server is not managed well on
> > re-initiating
> > > write clients. This will fail the pipeline from time to time and thwart
> > > progress.
> > > https://issues.apache.org/jira/browse/HUDI-3807: We are adding new
> > > partitions to MDT in 0.11, and added a new bloom filter index based out
> > of
> > > it. But its not guarded by a flag and by default gets enabled.
> > > https://issues.apache.org/jira/browse/HUDI-3454: Partition name
> parsing
> > > had
> > > bugs with LogRecordScanner in some of the code paths.
> > >
> > > Appendix
> > > a. release validation failures
> > >
> > > ./release/validate_staged_release.sh --release=0.11.0 --rc_num=1
> > > Validation release 0.11.0 for RC 1 and release type dev
> > > /tmp/validation_scratch_dir_001
> > > ~/Documents/personal/projects/apache_hudi_dec/hudi/scripts
> > > Downloading from svn co https://dist.apache.org/repos/dist//dev/hudi
> > > Validating hudi-0.11.0-rc1 with release type "dev"
> > > Checking Checksum of Source Release
> > > Checksum Check of Source Release - [OK]
> > >
> > >   % Total% Received % Xferd  Average Speed   TimeTime Time
> > > Current
> > >  Dload  Upload   Total   SpentLeft
> > > Speed
> > > 100 55975  100 559750 0   162k  0 --:--:-- --:--:--
> --:--:--
> > > 162k
> > > Checking Signature
> > > Signature Check - [OK]
> > >
> > > Checking for binary files in source release
> > > There were non-text files in source release. Please check below
> > >
> > > ./docker/push_to_docker_hub.png: image/png; charset=binary
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, 6 Apr 2022 at 15:37, Shiyan Xu 
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Please review and vote on the release candidate #1 for the version
> > > 0.11.0,
> > > > as follows:
> > > >
> > > > [ ] +1, Approve the release
> > > >
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > >
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > >
> > > > * JIRA release notes [1],
> > > >
> > > > * the official Apache source release and binary convenience releases
> to
> > > be
> > > > deployed to dist.apache.org [2], which are signed with the key with
> > > > fingerprint E1FACC15B67B2C5149224052D3B314F3B6E9C746 [3],
> > > >
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > >
> > > > * source code tag "0.11.0-rc1" [5],
> > > >
> > > >
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Release Manager
> > > >
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12350673
> > > >
> > > > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.11.0-rc1/
> > > >
> > > > [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> > > >
> > > > [4]
> > > https://repository.apache.org/content/repositories/orgapachehudi-1056/
> > > >
> > > > [5] https://github.com/apache/hudi/releases/tag/release-0.11.0-rc1
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
> >
> > --
> > Best,
> > Shiyan
> >
>


-- 
Best,
Shiyan


Re: [VOTE] Release 0.14.0, release candidate #3

2023-09-25 Thread Shiyan Xu
+1 (binding)

- Ran some sanity tests for spark 3.4

On Fri, Sep 22, 2023 at 3:42 PM Shawn Chang  wrote:

> +1 (non-binding)
>
> - Ran integration tests with Hudi 0.14.0-rc3 jars on the latest EMR cluster
>
> On Fri, Sep 22, 2023 at 12:13 PM Hussein Awala  wrote:
>
> > +1 (non-binding) ran some Spark3 jobs to create and read COW tables.
> >
> > On Fri 22 Sep 2023 at 21:08, Nishith  wrote:
> >
> > > +1 (binding)
> > >
> > > - Compile [OK]
> > > - Checksum [OK]
> > > - Validation Script [OK]
> > >
> > > -Nishith
> > >
> > > > On Sep 22, 2023, at 11:54 AM, Udit Mehrotra 
> wrote:
> > > >
> > > > +1 (binding)
> > > >
> > > > - Build with Spark 3 [OK]
> > > > - Release validation script [OK]
> > > > - Ran quickstart with Spark 3.4.1 [OK]
> > > >
> > > > Best,
> > > > Udit
> > > >
> > > >> On Fri, Sep 22, 2023 at 11:39 AM Balaji Varadarajan
> > > >>  wrote:
> > > >>
> > > >> +1 (binding)
> > > >> Ran validate stage testChecking Checksum of Source Release
> > > >>  Checksum Check of Source Release - [OK]
> > > >>
> > > >>
> > > >>
> > > >> Checking Signature
> > > >>
> > > >>  Signature Check - [OK]
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Checking for binary files in the source files
> > > >>
> > > >>  No Binary Files in the source files? - [OK]
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Checking for DISCLAIMER
> > > >>
> > > >>  DISCLAIMER file exists ? [OK]
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Checking for LICENSE and NOTICE
> > > >>
> > > >>  License file exists ? [OK]
> > > >>
> > > >>  Notice file exists ? [OK]
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Performing custom Licensing Check
> > > >>
> > > >>  Licensing Check Passed [OK]
> > > >>
> > > >>
> > > >>
> > > >> Running RAT Check.   RAT Check Passed [OK]
> > > >>
> > > >>On Friday, September 22, 2023 at 11:33:54 AM PDT, Amrish Lal <
> > > amrish.k@gmail.com> wrote:
> > > >>
> > > >> +1 (non binding)
> > > >>
> > > >> - Sanity tests using COW/MOR table to create, update, delete, and
> > query
> > > >> records.
> > > >> - Tested use of RLI in snapshot, realtime, time-travel, and
> > incremental
> > > >> queries.
> > > >> - Overall OK, except that use of RLI should be disabled for
> > time-travel
> > > (
> > > >> HUDI-6886 ) and
> > > snapshot
> > > >> queries (HUDI-6891  >)
> > > >>
> > > >>> On Fri, Sep 22, 2023 at 11:26 AM Y Ethan Guo 
> > wrote:
> > > >>>
> > > >>> +1 (binding)
> > > >>> - Ran validate_staged_release.sh [OK]
> > > >>> - Hudi (Delta)streamer with error injection [OK]
> > > >>> - Bundle validation
> > > https://github.com/apache/hudi/actions/runs/6277569953
> > > >>> [OK]
> > > >>>
> > > >>> - Ethan
> > > >>>
> > > >>>
> > >  On Fri, Sep 22, 2023 at 10:29 AM Jonathan Vexler  >
> > > wrote:
> > > >>>
> > >  +1 (non-binding)
> > >  - Tested Spark Datasource and Spark Sql core flow tests
> > >  -Tested reading from bootstrap tables
> > > 
> > > 
> > >  On Fri, Sep 22, 2023 at 12:39 PM sagar sumit 
> > > wrote:
> > > 
> > > > +1 (non-binding)
> > > >
> > > > - Long-running deltastreamer [OK]
> > > > - Hive metastore sync [OK]
> > > > - Query using Presto and Trino [OK]
> > > >
> > > > Regards,
> > > > Sagar
> > > >
> > > > On Fri, Sep 22, 2023 at 9:53 PM Aditya Goenka <
> adi...@onehouse.ai>
> > >  wrote:
> > > >
> > > >> +1 (non-binding)
> > > >>
> > > >> - Tested Spark Sql workflows , delta streamer , spark structured
> > > > streaming
> > > >> for both types of tables with and without record key.
> > > >> - Meta Sync tests
> > > >> - Tests for data-skipping with both Column stats and RLI.
> > > >>
> > > >> On Fri, Sep 22, 2023 at 9:38 PM Vinoth Chandar <
> vin...@apache.org
> > >
> > > > wrote:
> > > >>
> > > >>> +1 (binding)
> > > >>>
> > > >>>
> > > >>>   - Ran rc checks on RC2 only, but nothing has seemed to
> change.
> > > >>>   - Tested Spark Datasource/SQL flows around new features like
> > > >>> auto
> > > > key
> > > >>>   generation. This is a simpler SQL experience.
> > > >>>
> > > >>>   Thanks to all the contributors !
> > > >>>
> > > >>>
> > > >>> On Tue, Sep 19, 2023 at 11:56 AM Prashant Wason
> > > >  > > >>>
> > > >>> wrote:
> > > >>>
> > >  Hi everyone,
> > > 
> > >  Please review and vote on the *release candidate #3* for the
> > >  version
> > >  0.14.0, as follows:
> > > 
> > >  [ ] +1, Approve the release
> > > 
> > >  [ ] -1, Do not approve the release (please provide specific
> > >  comments)
> > > 
> > > 
> > > 
> > >  The complete staging area is available for your review, which
> > > > includes:
> > > 
> > >  * JIRA release notes [1],
> > > 
> > >  * the 

Re: [VOTE] Release 0.11.0, release candidate #2

2022-04-22 Thread Shiyan Xu
Hi Cheng, thanks for raising this point. It is known that some module jars
are not suffixed with scala version and also they are not intended for
direct use. We make sure bundle jars are built and suffixed properly, which
serve as the user-facing artifacts. We might align this across all modules
in the future.

On Wed, Apr 20, 2022 at 11:02 AM Cheng Pan  wrote:

> Hi,
>
> I notice that hudi-spark3-common should be scala dependent but does
> not have a scala binary version suffix like `_2.12`, which may be a
> potential issue.
>
> Thanks,
> Cheng Pan
>
> On Tue, Apr 19, 2022 at 4:15 PM Shiyan Xu 
> wrote:
> >
> > The above fixes are critical. Moving to RC3 now. Thanks everyone.
> >
> > On Tue, Apr 19, 2022 at 10:03 AM Alexey Kudinkin 
> wrote:
> >
> > > -1
> > >
> > > Found pretty substantial perf degradation in 0.11 RC2 as compared to
> > > vanilla Parquet table in Spark (which is being addressed currently).
> > > More details could be found HUDI-3891
> > > <https://issues.apache.org/jira/browse/HUDI-3891>
> > >
> > > On Mon, Apr 18, 2022 at 4:31 PM Y Ethan Guo 
> > > wrote:
> > >
> > > > -1
> > > > The Kafka Connect Sink for Hudi cannot ingest data using
> > > > hudi-kafka-connect-bundle from 0.11.0-rc2 due to
> NoClassDefFoundError.
> > > The
> > > > following fix is put up.
> > > > https://github.com/apache/hudi/pull/5353
> > > >
> > > > Best,
> > > > - Ethan
> > > >
> > > > On Fri, Apr 15, 2022 at 5:20 AM Shiyan Xu <
> xu.shiyan.raym...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > Please review and vote on the release candidate #2 for the version
> > > > 0.11.0,
> > > > > as follows:
> > > > >
> > > > > [ ] +1, Approve the release
> > > > >
> > > > > [ ] -1, Do not approve the release (please provide specific
> comments)
> > > > >
> > > > >
> > > > >
> > > > > The complete staging area is available for your review, which
> includes:
> > > > >
> > > > > * JIRA release notes [1],
> > > > >
> > > > > * the official Apache source release and binary convenience
> releases to
> > > > be
> > > > > deployed to dist.apache.org [2], which are signed with the key
> with
> > > > > fingerprint E1FACC15B67B2C5149224052D3B314F3B6E9C746 [3],
> > > > >
> > > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > > >
> > > > > * source code tag "0.11.0-rc2" [5],
> > > > >
> > > > >
> > > > >
> > > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > > approval, with at least 3 PMC affirmative votes.
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Release Manager
> > > > >
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12350673
> > > > >
> > > > > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.11.0-rc2/
> > > > >
> > > > > [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> > > > >
> > > > > [4]
> > > >
> https://repository.apache.org/content/repositories/orgapachehudi-1060/
> > > > >
> > > > > [5] https://github.com/apache/hudi/releases/tag/release-0.11.0-rc2
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best,
> > Shiyan
>


-- 
Best,
Shiyan


Re: [VOTE] Release 0.11.0, release candidate #2

2022-04-19 Thread Shiyan Xu
The above fixes are critical. Moving to RC3 now. Thanks everyone.

On Tue, Apr 19, 2022 at 10:03 AM Alexey Kudinkin  wrote:

> -1
>
> Found pretty substantial perf degradation in 0.11 RC2 as compared to
> vanilla Parquet table in Spark (which is being addressed currently).
> More details could be found HUDI-3891
> <https://issues.apache.org/jira/browse/HUDI-3891>
>
> On Mon, Apr 18, 2022 at 4:31 PM Y Ethan Guo 
> wrote:
>
> > -1
> > The Kafka Connect Sink for Hudi cannot ingest data using
> > hudi-kafka-connect-bundle from 0.11.0-rc2 due to NoClassDefFoundError.
> The
> > following fix is put up.
> > https://github.com/apache/hudi/pull/5353
> >
> > Best,
> > - Ethan
> >
> > On Fri, Apr 15, 2022 at 5:20 AM Shiyan Xu 
> > wrote:
> >
> > > Hi everyone,
> > >
> > > Please review and vote on the release candidate #2 for the version
> > 0.11.0,
> > > as follows:
> > >
> > > [ ] +1, Approve the release
> > >
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > >
> > >
> > > The complete staging area is available for your review, which includes:
> > >
> > > * JIRA release notes [1],
> > >
> > > * the official Apache source release and binary convenience releases to
> > be
> > > deployed to dist.apache.org [2], which are signed with the key with
> > > fingerprint E1FACC15B67B2C5149224052D3B314F3B6E9C746 [3],
> > >
> > > * all artifacts to be deployed to the Maven Central Repository [4],
> > >
> > > * source code tag "0.11.0-rc2" [5],
> > >
> > >
> > >
> > > The vote will be open for at least 72 hours. It is adopted by majority
> > > approval, with at least 3 PMC affirmative votes.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Release Manager
> > >
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12350673
> > >
> > > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.11.0-rc2/
> > >
> > > [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> > >
> > > [4]
> > https://repository.apache.org/content/repositories/orgapachehudi-1060/
> > >
> > > [5] https://github.com/apache/hudi/releases/tag/release-0.11.0-rc2
> > >
> >
>


-- 
Best,
Shiyan


[VOTE] Release 0.11.0, release candidate #3

2022-04-24 Thread Shiyan Xu
Hi everyone,

Please review and vote on the release candidate #3 for the version 0.11.0,
as follows:

[ ] +1, Approve the release

[ ] -1, Do not approve the release (please provide specific comments)



The complete staging area is available for your review, which includes:

* JIRA release notes [1],

* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint E1FACC15B67B2C5149224052D3B314F3B6E9C746 [3],

* all artifacts to be deployed to the Maven Central Repository [4],

* source code tag "0.11.0-rc3" [5],



The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.



Thanks,

Release Manager



[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12350673

[2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.11.0-rc3/

[3] https://dist.apache.org/repos/dist/release/hudi/KEYS

[4] https://repository.apache.org/content/repositories/orgapachehudi-1078/

[5] https://github.com/apache/hudi/releases/tag/release-0.11.0-rc3


Re: Permission to contribute

2022-04-26 Thread Shiyan Xu
Done and welcome!

On Tue, Apr 26, 2022 at 7:47 PM Александр Трушев 
wrote:

> Hi,
> I want to contribute to Apache Hudi.
> Would you please give me the contributor permission?
> My jira username is trushev
>
> Thanks
>


-- 
Best,
Shiyan


Re: [VOTE] Release 0.11.0, release candidate #1

2022-04-10 Thread Shiyan Xu
-1

Rat plugin in CI was not working for some time and resulted in some files
missing Apache license header. This was fixed in master
https://github.com/apache/hudi/commit/5e65aefc61e973a959c11f08c2aa9f19fefc09fc



On Sat, Apr 9, 2022 at 9:01 PM Sivabalan  wrote:

> -1
>
> a. release validation script failed due to presence of non-text files.
> b. We identified a few bugs that are blocker and needs to be added to RC.
> One of the fixes have already been landed, but two are under review.
>
> https://issues.apache.org/jira/browse/HUDI-3636 : With deltastreamer and
> async table services, timeline server is not managed well on re-initiating
> write clients. This will fail the pipeline from time to time and thwart
> progress.
> https://issues.apache.org/jira/browse/HUDI-3807: We are adding new
> partitions to MDT in 0.11, and added a new bloom filter index based out of
> it. But its not guarded by a flag and by default gets enabled.
> https://issues.apache.org/jira/browse/HUDI-3454: Partition name parsing
> had
> bugs with LogRecordScanner in some of the code paths.
>
> Appendix
> a. release validation failures
>
> ./release/validate_staged_release.sh --release=0.11.0 --rc_num=1
> Validation release 0.11.0 for RC 1 and release type dev
> /tmp/validation_scratch_dir_001
> ~/Documents/personal/projects/apache_hudi_dec/hudi/scripts
> Downloading from svn co https://dist.apache.org/repos/dist//dev/hudi
> Validating hudi-0.11.0-rc1 with release type "dev"
> Checking Checksum of Source Release
> Checksum Check of Source Release - [OK]
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>  Dload  Upload   Total   SpentLeft
> Speed
> 100 55975  100 559750 0   162k  0 --:--:-- --:--:-- --:--:--
> 162k
> Checking Signature
> Signature Check - [OK]
>
> Checking for binary files in source release
> There were non-text files in source release. Please check below
>
> ./docker/push_to_docker_hub.png: image/png; charset=binary
>
>
>
>
>
>
>
>
> On Wed, 6 Apr 2022 at 15:37, Shiyan Xu 
> wrote:
>
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version
> 0.11.0,
> > as follows:
> >
> > [ ] +1, Approve the release
> >
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> >
> >
> > The complete staging area is available for your review, which includes:
> >
> > * JIRA release notes [1],
> >
> > * the official Apache source release and binary convenience releases to
> be
> > deployed to dist.apache.org [2], which are signed with the key with
> > fingerprint E1FACC15B67B2C5149224052D3B314F3B6E9C746 [3],
> >
> > * all artifacts to be deployed to the Maven Central Repository [4],
> >
> > * source code tag "0.11.0-rc1" [5],
> >
> >
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> >
> >
> > Thanks,
> >
> > Release Manager
> >
> >
> >
> > [1]
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12350673
> >
> > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.11.0-rc1/
> >
> > [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> >
> > [4]
> https://repository.apache.org/content/repositories/orgapachehudi-1056/
> >
> > [5] https://github.com/apache/hudi/releases/tag/release-0.11.0-rc1
> >
>
>
> --
> Regards,
> -Sivabalan
>


-- 
Best,
Shiyan


[DISCUSS] Diagnostic reporter

2022-05-30 Thread Shiyan Xu
Hi all,

When troubleshooting Hudi jobs in users' environments, we always ask users
to share configs, environment info, check spark UI, etc. Here is an RFC
idea: can we extend the Hudi metrics system and make a diagnostic reporter?
It can be turned on like a normal metrics reporter. it should collect
common troubleshooting info and save to json or other human-readable text
format. Users should be able to run with it and share the diagnosis file.
The RFC should discuss what info should / can be collected.

Does this make sense? Anyone interested in driving the RFC design and
implementation work?

-- 
Best,
Shiyan


  1   2   >