Re: [DISCUSS] Multi-table transactions

2023-08-30 Thread Vinoth Chandar
+1 Reviewed the RFC. Looks like a promising direction to take.

On Thu, Aug 24, 2023 at 9:26 AM sagar sumit  wrote:

> Hi devs,
>
> RFC-69 proposes some exciting features and in line with that vision,
> I would like to propose support for multi-table transactions in Hudi.
>
> As the name suggests, this would enable transactional consistency
> across multiple tables, i.e. a set of changes to multiple tables either
> completely succeeds or completely fails. This could be helpful for use
> cases such as updating details about a sales order that affects 2 or more
> tables, deleting records for a customer across 2 or more tables, etc.
>
> Hudi already provides ACID guarantees on a single table and tunable
> concurrency control. We would need to build additional orchestration or
> consistency mechanisms on top of existing mechanisms. I would like to
> put more details in a separate RFC. However, the high-level goal is to
> provide the same guarantees as Hudi provides for a single table and
> should work with both kinds of concurrency control OCC and MVCC.
>
> Looking forward to hearing some thoughts from you all.
>
> Regards,
> Sagar
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-08-21 Thread Pratyaksh Sharma
Hi Vinoth,

I have raised a PR here - https://github.com/apache/hudi/pull/9492.
Let us continue the discussion there.

On Wed, Aug 16, 2023 at 4:43 PM Vinoth Chandar <
mail.vinoth.chan...@gmail.com> wrote:

> Hi Pratyaksh,
>
> Are you still actively driving this?
>
> On Tue, Jul 11, 2023 at 2:18 PM Pratyaksh Sharma 
> wrote:
>
> > Update: I will be raising the initial draft of RFC in the next couple of
> > days.
> >
> > On Thu, Jun 15, 2023 at 2:28 AM Rajesh Mahindra 
> > wrote:
> >
> > > Great. We also need it for use cases of loading data into warehouses,
> and
> > > would love to help.
> > >
> > > On Wed, Jun 14, 2023 at 9:06 AM Pratyaksh Sharma <
> pratyaks...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I missed this email earlier. Sure let me start an RFC this week and
> we
> > > can
> > > > take it from there.
> > > >
> > > > On Wed, Jun 14, 2023 at 9:20 PM Nicolas Paris <
> > nicolas.pa...@riseup.net>
> > > > wrote:
> > > >
> > > > > Hi any rfc/ongoing efforts on the reverse delta streamer ? We have
> a
> > > use
> > > > > case to do hudi => Kafka and would enjoy building a more general
> > tool.
> > > > >
> > > > > However we need a rfc basis to start some effort in the right way
> > > > >
> > > > > On April 12, 2023 3:08:22 AM UTC, Vinoth Chandar <
> > > > > mail.vinoth.chan...@gmail.com> wrote:
> > > > > >Cool. lets draw up a RFC for this? @pratyaksh - do you want to
> start
> > > > one,
> > > > > >given you expressed interest?
> > > > > >
> > > > > >On Mon, Apr 10, 2023 at 7:32 PM Léo Biscassi <
> > leo.bisca...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> +1
> > > > > >> This would be great!
> > > > > >>
> > > > > >> Cheers,
> > > > > >>
> > > > > >> On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma <
> > > > pratyaks...@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Vinoth,
> > > > > >> >
> > > > > >> > I am aligned with the first reason that you mentioned. Better
> to
> > > > have
> > > > > a
> > > > > >> > separate tool to take care of this.
> > > > > >> >
> > > > > >> > On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar <
> > > > > >> > mail.vinoth.chan...@gmail.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > +1
> > > > > >> > >
> > > > > >> > > I was thinking that we add a new utility and NOT extend
> > > > > DeltaStreamer
> > > > > >> by
> > > > > >> > > adding a Sink interface, for the following reasons
> > > > > >> > >
> > > > > >> > > - It will make it look like a generic Source => Sink ETL
> tool,
> > > > > which is
> > > > > >> > > actually not our intention to support on Hudi. There are
> > plenty
> > > of
> > > > > good
> > > > > >> > > tools for that out there.
> > > > > >> > > - the config management can get bit hard to understand,
> since
> > we
> > > > > >> overload
> > > > > >> > > ingest and reverse ETL into a single tool. So break it off
> at
> > > > > use-case
> > > > > >> > > level?
> > > > > >> > >
> > > > > >> > > Thoughts?
> > > > > >> > >
> > > > > >> > > David:  PMC does not have control over that. Please see
> > > > unsubscribe
> > > > > >> > > instructions here.
> > > https://hudi.apache.org/community/get-involved
> > > > > >> > > Love to keep this thread about reverse streamer discussion.
> So
> > > > > kindly
> > > > > >> > fork
> > > > > >> > > another thread if you want to discuss unsubscribing.
> > > > > >> > >
> > > > > >> > > On Fri, Mar 31, 2023 at 1:47 AM Davidiam <
> > > david.rosa...@gmail.com
> > > > >
> > > > > >> > wrote:
> > > > > >> > >
> > > > > >> > > > Hello Vinoth,
> > > > > >> > > >
> > > > > >> > > > Can you please unsubscribe me?  I have been trying to
> > > > unsubscribe
> > > > > for
> > > > > >> > > > months without success.
> > > > > >> > > >
> > > > > >> > > > Kind Regards,
> > > > > >> > > > David
> > > > > >> > > >
> > > > > >> > > > Sent from Outlook for Android
> > > > > >> > > > 
> > > > > >> > > > From: Vinoth Chandar 
> > > > > >> > > > Sent: Friday, March 31, 2023 5:09:52 AM
> > > > > >> > > > To: dev 
> > > > > >> > > > Subject: [DISCUSS] Hudi Reverse Streamer
> > > > > >> > > >
> > > > > >> > > > Hi all,
> > > > > >> > > >
> > > > > >> > > > Any interest in building a reverse streaming tool, that
> does
> > > the
> > > > > >> > reverse
> > > > > >> > > of
> > > > > >> > > > what the DeltaStreamer tool does? It will read Hudi table
> > > > > >> incrementally
> > > > > >> > > > (only source) and write out the data to a variety of
> sinks -
> > > > > Kafka,
> > > > > >> > JDBC
> > > > > >> > > > Databases, DFS.
> > > > > >> > > >
> > > > > >> > > > This has come up many times with data warehouse users.
> Often
> > > > > times,
> > > > > >> > they
> > > > > >> > > > want to use Hudi to speed up or reduce costs on their data
> > > > > ingestion
> > > > > >> > and
> > > > > >> > > > ETL (using Spark/Flink), but want to move the derived data
> > > back
> > > > > into
> > > > > >> a
> > > > > >> > > data
> > > > > >> > > > warehouse or an 

Re: [DISCUSS] Release Manager for 1.0

2023-08-16 Thread Vinoth Chandar
Awesome! that was easy. lets go!

On Wed, Aug 16, 2023 at 5:32 AM sagar sumit  wrote:

> Hi Vinoth,
>
> 1.0 seems to be packed with exciting features.
> I would be glad to volunteer as the release manager.
>
> Regards,
> Sagar
>
> On Wed, Aug 16, 2023 at 5:24 PM Vinoth Chandar  wrote:
>
> > Hi PMC/Committers,
> >
> > We are looking for a volunteer to act as release manager for the 1.0
> > release.
> > https://cwiki.apache.org/confluence/display/HUDI/1.0+Execution+Planning
> >
> > Anyone interested?
> >
> > Thanks
> > Vinoth
> >
>


Re: [DISCUSS] Release Manager for 1.0

2023-08-16 Thread sagar sumit
Hi Vinoth,

1.0 seems to be packed with exciting features.
I would be glad to volunteer as the release manager.

Regards,
Sagar

On Wed, Aug 16, 2023 at 5:24 PM Vinoth Chandar  wrote:

> Hi PMC/Committers,
>
> We are looking for a volunteer to act as release manager for the 1.0
> release.
> https://cwiki.apache.org/confluence/display/HUDI/1.0+Execution+Planning
>
> Anyone interested?
>
> Thanks
> Vinoth
>


Re: DISCUSS Hudi 1.x plans

2023-08-16 Thread Vinoth Chandar
Hello everyone,

We have been doing a lot of foundational design, prototyping work and I
have outlined an execution plan here.
https://cwiki.apache.org/confluence/display/HUDI/1.0+Execution+Planning

Look forward to contributions!


On Wed, May 10, 2023 at 4:14 PM Sivabalan  wrote:

> Great! Left some feedback.
>
> On Wed, 10 May 2023 at 06:56, Vinoth Chandar  wrote:
> >
> > All - the RFC is up here. Please comment on the PR or use the dev list to
> > discuss ideas.
> > https://github.com/apache/hudi/pull/8679/
> >
> > On Mon, May 8, 2023 at 11:43 PM Vinoth Chandar 
> wrote:
> >
> > > I have claimed RFC-69, per our process.
> > >
> > > On Mon, May 8, 2023 at 9:19 PM Vinoth Chandar 
> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I have been consolidating all our progress on Hudi and putting
> together a
> > >> proposal for Hudi 1.x vision and a concrete plan for the first
> version 1.0.
> > >>
> > >> Will plan to open up the RFC to gather ideas across the community in
> > >> coming days.
> > >>
> > >> Thanks
> > >> Vinoth
> > >>
> > >
>
>
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-08-16 Thread Vinoth Chandar
Hi Pratyaksh,

Are you still actively driving this?

On Tue, Jul 11, 2023 at 2:18 PM Pratyaksh Sharma 
wrote:

> Update: I will be raising the initial draft of RFC in the next couple of
> days.
>
> On Thu, Jun 15, 2023 at 2:28 AM Rajesh Mahindra 
> wrote:
>
> > Great. We also need it for use cases of loading data into warehouses, and
> > would love to help.
> >
> > On Wed, Jun 14, 2023 at 9:06 AM Pratyaksh Sharma 
> > wrote:
> >
> > > Hi,
> > >
> > > I missed this email earlier. Sure let me start an RFC this week and we
> > can
> > > take it from there.
> > >
> > > On Wed, Jun 14, 2023 at 9:20 PM Nicolas Paris <
> nicolas.pa...@riseup.net>
> > > wrote:
> > >
> > > > Hi any rfc/ongoing efforts on the reverse delta streamer ? We have a
> > use
> > > > case to do hudi => Kafka and would enjoy building a more general
> tool.
> > > >
> > > > However we need a rfc basis to start some effort in the right way
> > > >
> > > > On April 12, 2023 3:08:22 AM UTC, Vinoth Chandar <
> > > > mail.vinoth.chan...@gmail.com> wrote:
> > > > >Cool. lets draw up a RFC for this? @pratyaksh - do you want to start
> > > one,
> > > > >given you expressed interest?
> > > > >
> > > > >On Mon, Apr 10, 2023 at 7:32 PM Léo Biscassi <
> leo.bisca...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> +1
> > > > >> This would be great!
> > > > >>
> > > > >> Cheers,
> > > > >>
> > > > >> On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma <
> > > pratyaks...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Vinoth,
> > > > >> >
> > > > >> > I am aligned with the first reason that you mentioned. Better to
> > > have
> > > > a
> > > > >> > separate tool to take care of this.
> > > > >> >
> > > > >> > On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar <
> > > > >> > mail.vinoth.chan...@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > +1
> > > > >> > >
> > > > >> > > I was thinking that we add a new utility and NOT extend
> > > > DeltaStreamer
> > > > >> by
> > > > >> > > adding a Sink interface, for the following reasons
> > > > >> > >
> > > > >> > > - It will make it look like a generic Source => Sink ETL tool,
> > > > which is
> > > > >> > > actually not our intention to support on Hudi. There are
> plenty
> > of
> > > > good
> > > > >> > > tools for that out there.
> > > > >> > > - the config management can get bit hard to understand, since
> we
> > > > >> overload
> > > > >> > > ingest and reverse ETL into a single tool. So break it off at
> > > > use-case
> > > > >> > > level?
> > > > >> > >
> > > > >> > > Thoughts?
> > > > >> > >
> > > > >> > > David:  PMC does not have control over that. Please see
> > > unsubscribe
> > > > >> > > instructions here.
> > https://hudi.apache.org/community/get-involved
> > > > >> > > Love to keep this thread about reverse streamer discussion. So
> > > > kindly
> > > > >> > fork
> > > > >> > > another thread if you want to discuss unsubscribing.
> > > > >> > >
> > > > >> > > On Fri, Mar 31, 2023 at 1:47 AM Davidiam <
> > david.rosa...@gmail.com
> > > >
> > > > >> > wrote:
> > > > >> > >
> > > > >> > > > Hello Vinoth,
> > > > >> > > >
> > > > >> > > > Can you please unsubscribe me?  I have been trying to
> > > unsubscribe
> > > > for
> > > > >> > > > months without success.
> > > > >> > > >
> > > > >> > > > Kind Regards,
> > > > >> > > > David
> > > > >> > > >
> > > > >> > > > Sent from Outlook for Android
> > > > >> > > > 
> > > > >> > > > From: Vinoth Chandar 
> > > > >> > > > Sent: Friday, March 31, 2023 5:09:52 AM
> > > > >> > > > To: dev 
> > > > >> > > > Subject: [DISCUSS] Hudi Reverse Streamer
> > > > >> > > >
> > > > >> > > > Hi all,
> > > > >> > > >
> > > > >> > > > Any interest in building a reverse streaming tool, that does
> > the
> > > > >> > reverse
> > > > >> > > of
> > > > >> > > > what the DeltaStreamer tool does? It will read Hudi table
> > > > >> incrementally
> > > > >> > > > (only source) and write out the data to a variety of sinks -
> > > > Kafka,
> > > > >> > JDBC
> > > > >> > > > Databases, DFS.
> > > > >> > > >
> > > > >> > > > This has come up many times with data warehouse users. Often
> > > > times,
> > > > >> > they
> > > > >> > > > want to use Hudi to speed up or reduce costs on their data
> > > > ingestion
> > > > >> > and
> > > > >> > > > ETL (using Spark/Flink), but want to move the derived data
> > back
> > > > into
> > > > >> a
> > > > >> > > data
> > > > >> > > > warehouse or an operational database for serving.
> > > > >> > > >
> > > > >> > > > What do you all think?
> > > > >> > > >
> > > > >> > > > Thanks
> > > > >> > > > Vinoth
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >>
> > > > >> --
> > > > >> *Léo Biscassi*
> > > > >> Blog - https://leobiscassi.com
> > > > >>
> > > > >>-
> > > > >>
> > > >
> > >
> >
> >
> > --
> > Take Care,
> > Rajesh Mahindra
> >
>


Re: [DISCUSS] Should we support a service to manage all deltastreamer jobs?

2023-08-16 Thread Vinoth Chandar
+1 there are RFCs on table management services, but not specific to
deltastreamer itself.

Are you proposing building something specific to that?

On Wed, Jun 14, 2023 at 8:26 AM Pratyaksh Sharma 
wrote:

> Hi,
>
> Personally I am in favour of creating such a UI where monitoring and
> managing configurations is just a click away. That makes life a lot easier
> for users. So +1 on the proposal.
>
> I remember the work for it had started long back around 2019. You can check
> this RFC
> <
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233
> >
> for your reference. I am not sure why this work could not continue though.
>
> On Wed, Jun 14, 2023 at 4:28 PM 孔维 <18701146...@163.com> wrote:
>
> > Hi, team,
> >
> >
> > Background:
> > More and more hudi accesses use deltastreamer, resulting in a large
> number
> > of deltastreamer jobs that need to be managed. In our company, we also
> > manage a large number of deltastreamer jobs by ourselves, and there is a
> > lot of operation and maintenance management and monitoring work.
> > If we can provide such a deltastreamer service to create, manage, and
> > monitor all tasks in a unified manner, it can greatly reduce the
> management
> > pressure of deltastreamer, and at the same time lower the threshold for
> > using deltastreamer, which is conducive to the promotion and use of
> > deltastreamer.
> > At the same time, considering that deltastreamer already supports
> > configuration hot update capability [
> > https://github.com/apache/hudi/pull/8807], we can offer configuration
> hot
> > update capability based on the feature, and make configuration changes
> > without restarting the job.
> >
> >
> > We hope to provide:
> > Provides a web UI to support creation, management and monitoring of
> > deltastreamer tasks
> > Using configuration hot update capability to provide timely configuration
> > change capability
> >
> >
> > I don't know whether such a service is in line with the evolution of the
> > community, and I hope to receive your reply!
> >
> >
> > Best Regards
>


Re: Discuss fast copy on write rfc-68

2023-07-21 Thread Nicolas Paris
Definitely can't see a benefit to use 30MB row groups over just creating 30MB 
parquet files.

I would add that stats indexes are on the file level, so it's in favor to using 
row groups size=file size.

The only context it would help is when clustering is setup and targets 1GB 
files, w/ 128MB row groups.

Would love to be contradict on this. But so far the fast cow already exists, 
it's consist of reducing the parquet size for faster writes. It comes with 
drawback on read performances, as would be smaller row groups but it benefits 
from stats indexes better.


On July 20, 2023 9:28:07 PM UTC, Nicolas Paris  wrote:
>Spliting parquet file into 5 row groups, leads to same benefit as creating 5 
>parquet files each 1 row group instead.
>
>Also the later can involve more parallelism for writes.
>
>Am I missing something?
>
>On July 20, 2023 12:38:54 PM UTC, sagar sumit  wrote:
>>Good questions! The idea is to be able to skip rowgroups based on index.
>>But, if we have to do a full snapshot load, then our wrapper should actually
>>be doing batch GET on S3. Why incur 5x more calls.
>>As for the update, I think this is in the context of COW. So, the footer
>>will be
>>recomputed anyways, so handling updates should not be that tricky.
>>
>>Regards,
>>Sagar
>>
>>On Thu, Jul 20, 2023 at 3:26 PM nicolas paris 
>>wrote:
>>
>>> Hi,
>>>
>>> Multiple idenpendant initiatives for fast copy on write have emerged
>>> (correct me if I am wrong):
>>> 1.
>>>
>>> https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
>>> 2.
>>> https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/
>>>
>>>
>>> The idea is to rely on RLI index to target only some row groups in a
>>> given parquet file, and only serde that one when copying the file
>>>
>>> Currently hudi generates one row group per parquet file (and having
>>> large row group is what parquet and other advocates).
>>>
>>> The FCOW feature then need to use several row group per parquet to
>>> provide some benefit, let's say 30MB as mentionned in the rfc68
>>> discussion.
>>>
>>> I have concerns about using small row groups for read performances such
>>> as :
>>> - more s3 throttle: if we have 5x more row group in a parquet files,
>>> then it leads to 5x GET call
>>> - worst read performances: since largest row group leads to better
>>> performances overall
>>>
>>>
>>> As a side question, I wonder how the writer can keep statistics within
>>> parquet footer correct. If updates occurs somewhere, then the below
>>> stuff present in the footer shall be updated accordingly:
>>> - parquet row group/pages stats
>>> - parquet dictionary
>>> - parquet bloom filters
>>>
>>> Thanks for your feedback on those
>>>


Re: Discuss fast copy on write rfc-68

2023-07-20 Thread Nicolas Paris
Spliting parquet file into 5 row groups, leads to same benefit as creating 5 
parquet files each 1 row group instead.

Also the later can involve more parallelism for writes.

Am I missing something?

On July 20, 2023 12:38:54 PM UTC, sagar sumit  wrote:
>Good questions! The idea is to be able to skip rowgroups based on index.
>But, if we have to do a full snapshot load, then our wrapper should actually
>be doing batch GET on S3. Why incur 5x more calls.
>As for the update, I think this is in the context of COW. So, the footer
>will be
>recomputed anyways, so handling updates should not be that tricky.
>
>Regards,
>Sagar
>
>On Thu, Jul 20, 2023 at 3:26 PM nicolas paris 
>wrote:
>
>> Hi,
>>
>> Multiple idenpendant initiatives for fast copy on write have emerged
>> (correct me if I am wrong):
>> 1.
>>
>> https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
>> 2.
>> https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/
>>
>>
>> The idea is to rely on RLI index to target only some row groups in a
>> given parquet file, and only serde that one when copying the file
>>
>> Currently hudi generates one row group per parquet file (and having
>> large row group is what parquet and other advocates).
>>
>> The FCOW feature then need to use several row group per parquet to
>> provide some benefit, let's say 30MB as mentionned in the rfc68
>> discussion.
>>
>> I have concerns about using small row groups for read performances such
>> as :
>> - more s3 throttle: if we have 5x more row group in a parquet files,
>> then it leads to 5x GET call
>> - worst read performances: since largest row group leads to better
>> performances overall
>>
>>
>> As a side question, I wonder how the writer can keep statistics within
>> parquet footer correct. If updates occurs somewhere, then the below
>> stuff present in the footer shall be updated accordingly:
>> - parquet row group/pages stats
>> - parquet dictionary
>> - parquet bloom filters
>>
>> Thanks for your feedback on those
>>


Re: Discuss fast copy on write rfc-68

2023-07-20 Thread sagar sumit
Good questions! The idea is to be able to skip rowgroups based on index.
But, if we have to do a full snapshot load, then our wrapper should actually
be doing batch GET on S3. Why incur 5x more calls.
As for the update, I think this is in the context of COW. So, the footer
will be
recomputed anyways, so handling updates should not be that tricky.

Regards,
Sagar

On Thu, Jul 20, 2023 at 3:26 PM nicolas paris 
wrote:

> Hi,
>
> Multiple idenpendant initiatives for fast copy on write have emerged
> (correct me if I am wrong):
> 1.
>
> https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
> 2.
> https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/
>
>
> The idea is to rely on RLI index to target only some row groups in a
> given parquet file, and only serde that one when copying the file
>
> Currently hudi generates one row group per parquet file (and having
> large row group is what parquet and other advocates).
>
> The FCOW feature then need to use several row group per parquet to
> provide some benefit, let's say 30MB as mentionned in the rfc68
> discussion.
>
> I have concerns about using small row groups for read performances such
> as :
> - more s3 throttle: if we have 5x more row group in a parquet files,
> then it leads to 5x GET call
> - worst read performances: since largest row group leads to better
> performances overall
>
>
> As a side question, I wonder how the writer can keep statistics within
> parquet footer correct. If updates occurs somewhere, then the below
> stuff present in the footer shall be updated accordingly:
> - parquet row group/pages stats
> - parquet dictionary
> - parquet bloom filters
>
> Thanks for your feedback on those
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-07-11 Thread Pratyaksh Sharma
Update: I will be raising the initial draft of RFC in the next couple of
days.

On Thu, Jun 15, 2023 at 2:28 AM Rajesh Mahindra  wrote:

> Great. We also need it for use cases of loading data into warehouses, and
> would love to help.
>
> On Wed, Jun 14, 2023 at 9:06 AM Pratyaksh Sharma 
> wrote:
>
> > Hi,
> >
> > I missed this email earlier. Sure let me start an RFC this week and we
> can
> > take it from there.
> >
> > On Wed, Jun 14, 2023 at 9:20 PM Nicolas Paris 
> > wrote:
> >
> > > Hi any rfc/ongoing efforts on the reverse delta streamer ? We have a
> use
> > > case to do hudi => Kafka and would enjoy building a more general tool.
> > >
> > > However we need a rfc basis to start some effort in the right way
> > >
> > > On April 12, 2023 3:08:22 AM UTC, Vinoth Chandar <
> > > mail.vinoth.chan...@gmail.com> wrote:
> > > >Cool. lets draw up a RFC for this? @pratyaksh - do you want to start
> > one,
> > > >given you expressed interest?
> > > >
> > > >On Mon, Apr 10, 2023 at 7:32 PM Léo Biscassi 
> > > wrote:
> > > >
> > > >> +1
> > > >> This would be great!
> > > >>
> > > >> Cheers,
> > > >>
> > > >> On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma <
> > pratyaks...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Hi Vinoth,
> > > >> >
> > > >> > I am aligned with the first reason that you mentioned. Better to
> > have
> > > a
> > > >> > separate tool to take care of this.
> > > >> >
> > > >> > On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar <
> > > >> > mail.vinoth.chan...@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > +1
> > > >> > >
> > > >> > > I was thinking that we add a new utility and NOT extend
> > > DeltaStreamer
> > > >> by
> > > >> > > adding a Sink interface, for the following reasons
> > > >> > >
> > > >> > > - It will make it look like a generic Source => Sink ETL tool,
> > > which is
> > > >> > > actually not our intention to support on Hudi. There are plenty
> of
> > > good
> > > >> > > tools for that out there.
> > > >> > > - the config management can get bit hard to understand, since we
> > > >> overload
> > > >> > > ingest and reverse ETL into a single tool. So break it off at
> > > use-case
> > > >> > > level?
> > > >> > >
> > > >> > > Thoughts?
> > > >> > >
> > > >> > > David:  PMC does not have control over that. Please see
> > unsubscribe
> > > >> > > instructions here.
> https://hudi.apache.org/community/get-involved
> > > >> > > Love to keep this thread about reverse streamer discussion. So
> > > kindly
> > > >> > fork
> > > >> > > another thread if you want to discuss unsubscribing.
> > > >> > >
> > > >> > > On Fri, Mar 31, 2023 at 1:47 AM Davidiam <
> david.rosa...@gmail.com
> > >
> > > >> > wrote:
> > > >> > >
> > > >> > > > Hello Vinoth,
> > > >> > > >
> > > >> > > > Can you please unsubscribe me?  I have been trying to
> > unsubscribe
> > > for
> > > >> > > > months without success.
> > > >> > > >
> > > >> > > > Kind Regards,
> > > >> > > > David
> > > >> > > >
> > > >> > > > Sent from Outlook for Android
> > > >> > > > 
> > > >> > > > From: Vinoth Chandar 
> > > >> > > > Sent: Friday, March 31, 2023 5:09:52 AM
> > > >> > > > To: dev 
> > > >> > > > Subject: [DISCUSS] Hudi Reverse Streamer
> > > >> > > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > > Any interest in building a reverse streaming tool, that does
> the
> > > >> > reverse
> > > >> > > of
> > > >> > > > what the DeltaStreamer tool does? It will read Hudi table
> > > >> incrementally
> > > >> > > > (only source) and write out the data to a variety of sinks -
> > > Kafka,
> > > >> > JDBC
> > > >> > > > Databases, DFS.
> > > >> > > >
> > > >> > > > This has come up many times with data warehouse users. Often
> > > times,
> > > >> > they
> > > >> > > > want to use Hudi to speed up or reduce costs on their data
> > > ingestion
> > > >> > and
> > > >> > > > ETL (using Spark/Flink), but want to move the derived data
> back
> > > into
> > > >> a
> > > >> > > data
> > > >> > > > warehouse or an operational database for serving.
> > > >> > > >
> > > >> > > > What do you all think?
> > > >> > > >
> > > >> > > > Thanks
> > > >> > > > Vinoth
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >> --
> > > >> *Léo Biscassi*
> > > >> Blog - https://leobiscassi.com
> > > >>
> > > >>-
> > > >>
> > >
> >
>
>
> --
> Take Care,
> Rajesh Mahindra
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-06-14 Thread Rajesh Mahindra
Great. We also need it for use cases of loading data into warehouses, and
would love to help.

On Wed, Jun 14, 2023 at 9:06 AM Pratyaksh Sharma 
wrote:

> Hi,
>
> I missed this email earlier. Sure let me start an RFC this week and we can
> take it from there.
>
> On Wed, Jun 14, 2023 at 9:20 PM Nicolas Paris 
> wrote:
>
> > Hi any rfc/ongoing efforts on the reverse delta streamer ? We have a use
> > case to do hudi => Kafka and would enjoy building a more general tool.
> >
> > However we need a rfc basis to start some effort in the right way
> >
> > On April 12, 2023 3:08:22 AM UTC, Vinoth Chandar <
> > mail.vinoth.chan...@gmail.com> wrote:
> > >Cool. lets draw up a RFC for this? @pratyaksh - do you want to start
> one,
> > >given you expressed interest?
> > >
> > >On Mon, Apr 10, 2023 at 7:32 PM Léo Biscassi 
> > wrote:
> > >
> > >> +1
> > >> This would be great!
> > >>
> > >> Cheers,
> > >>
> > >> On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma <
> pratyaks...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi Vinoth,
> > >> >
> > >> > I am aligned with the first reason that you mentioned. Better to
> have
> > a
> > >> > separate tool to take care of this.
> > >> >
> > >> > On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar <
> > >> > mail.vinoth.chan...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > +1
> > >> > >
> > >> > > I was thinking that we add a new utility and NOT extend
> > DeltaStreamer
> > >> by
> > >> > > adding a Sink interface, for the following reasons
> > >> > >
> > >> > > - It will make it look like a generic Source => Sink ETL tool,
> > which is
> > >> > > actually not our intention to support on Hudi. There are plenty of
> > good
> > >> > > tools for that out there.
> > >> > > - the config management can get bit hard to understand, since we
> > >> overload
> > >> > > ingest and reverse ETL into a single tool. So break it off at
> > use-case
> > >> > > level?
> > >> > >
> > >> > > Thoughts?
> > >> > >
> > >> > > David:  PMC does not have control over that. Please see
> unsubscribe
> > >> > > instructions here. https://hudi.apache.org/community/get-involved
> > >> > > Love to keep this thread about reverse streamer discussion. So
> > kindly
> > >> > fork
> > >> > > another thread if you want to discuss unsubscribing.
> > >> > >
> > >> > > On Fri, Mar 31, 2023 at 1:47 AM Davidiam  >
> > >> > wrote:
> > >> > >
> > >> > > > Hello Vinoth,
> > >> > > >
> > >> > > > Can you please unsubscribe me?  I have been trying to
> unsubscribe
> > for
> > >> > > > months without success.
> > >> > > >
> > >> > > > Kind Regards,
> > >> > > > David
> > >> > > >
> > >> > > > Sent from Outlook for Android
> > >> > > > 
> > >> > > > From: Vinoth Chandar 
> > >> > > > Sent: Friday, March 31, 2023 5:09:52 AM
> > >> > > > To: dev 
> > >> > > > Subject: [DISCUSS] Hudi Reverse Streamer
> > >> > > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > > Any interest in building a reverse streaming tool, that does the
> > >> > reverse
> > >> > > of
> > >> > > > what the DeltaStreamer tool does? It will read Hudi table
> > >> incrementally
> > >> > > > (only source) and write out the data to a variety of sinks -
> > Kafka,
> > >> > JDBC
> > >> > > > Databases, DFS.
> > >> > > >
> > >> > > > This has come up many times with data warehouse users. Often
> > times,
> > >> > they
> > >> > > > want to use Hudi to speed up or reduce costs on their data
> > ingestion
> > >> > and
> > >> > > > ETL (using Spark/Flink), but want to move the derived data back
> > into
> > >> a
> > >> > > data
> > >> > > > warehouse or an operational database for serving.
> > >> > > >
> > >> > > > What do you all think?
> > >> > > >
> > >> > > > Thanks
> > >> > > > Vinoth
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> > >> --
> > >> *Léo Biscassi*
> > >> Blog - https://leobiscassi.com
> > >>
> > >>-
> > >>
> >
>


-- 
Take Care,
Rajesh Mahindra


Re: [DISCUSS] Hudi Reverse Streamer

2023-06-14 Thread Pratyaksh Sharma
Hi,

I missed this email earlier. Sure let me start an RFC this week and we can
take it from there.

On Wed, Jun 14, 2023 at 9:20 PM Nicolas Paris 
wrote:

> Hi any rfc/ongoing efforts on the reverse delta streamer ? We have a use
> case to do hudi => Kafka and would enjoy building a more general tool.
>
> However we need a rfc basis to start some effort in the right way
>
> On April 12, 2023 3:08:22 AM UTC, Vinoth Chandar <
> mail.vinoth.chan...@gmail.com> wrote:
> >Cool. lets draw up a RFC for this? @pratyaksh - do you want to start one,
> >given you expressed interest?
> >
> >On Mon, Apr 10, 2023 at 7:32 PM Léo Biscassi 
> wrote:
> >
> >> +1
> >> This would be great!
> >>
> >> Cheers,
> >>
> >> On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma 
> >> wrote:
> >>
> >> > Hi Vinoth,
> >> >
> >> > I am aligned with the first reason that you mentioned. Better to have
> a
> >> > separate tool to take care of this.
> >> >
> >> > On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar <
> >> > mail.vinoth.chan...@gmail.com>
> >> > wrote:
> >> >
> >> > > +1
> >> > >
> >> > > I was thinking that we add a new utility and NOT extend
> DeltaStreamer
> >> by
> >> > > adding a Sink interface, for the following reasons
> >> > >
> >> > > - It will make it look like a generic Source => Sink ETL tool,
> which is
> >> > > actually not our intention to support on Hudi. There are plenty of
> good
> >> > > tools for that out there.
> >> > > - the config management can get bit hard to understand, since we
> >> overload
> >> > > ingest and reverse ETL into a single tool. So break it off at
> use-case
> >> > > level?
> >> > >
> >> > > Thoughts?
> >> > >
> >> > > David:  PMC does not have control over that. Please see unsubscribe
> >> > > instructions here. https://hudi.apache.org/community/get-involved
> >> > > Love to keep this thread about reverse streamer discussion. So
> kindly
> >> > fork
> >> > > another thread if you want to discuss unsubscribing.
> >> > >
> >> > > On Fri, Mar 31, 2023 at 1:47 AM Davidiam 
> >> > wrote:
> >> > >
> >> > > > Hello Vinoth,
> >> > > >
> >> > > > Can you please unsubscribe me?  I have been trying to unsubscribe
> for
> >> > > > months without success.
> >> > > >
> >> > > > Kind Regards,
> >> > > > David
> >> > > >
> >> > > > Sent from Outlook for Android
> >> > > > 
> >> > > > From: Vinoth Chandar 
> >> > > > Sent: Friday, March 31, 2023 5:09:52 AM
> >> > > > To: dev 
> >> > > > Subject: [DISCUSS] Hudi Reverse Streamer
> >> > > >
> >> > > > Hi all,
> >> > > >
> >> > > > Any interest in building a reverse streaming tool, that does the
> >> > reverse
> >> > > of
> >> > > > what the DeltaStreamer tool does? It will read Hudi table
> >> incrementally
> >> > > > (only source) and write out the data to a variety of sinks -
> Kafka,
> >> > JDBC
> >> > > > Databases, DFS.
> >> > > >
> >> > > > This has come up many times with data warehouse users. Often
> times,
> >> > they
> >> > > > want to use Hudi to speed up or reduce costs on their data
> ingestion
> >> > and
> >> > > > ETL (using Spark/Flink), but want to move the derived data back
> into
> >> a
> >> > > data
> >> > > > warehouse or an operational database for serving.
> >> > > >
> >> > > > What do you all think?
> >> > > >
> >> > > > Thanks
> >> > > > Vinoth
> >> > > >
> >> > >
> >> >
> >>
> >>
> >> --
> >> *Léo Biscassi*
> >> Blog - https://leobiscassi.com
> >>
> >>-
> >>
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-06-14 Thread Nicolas Paris
Hi any rfc/ongoing efforts on the reverse delta streamer ? We have a use case 
to do hudi => Kafka and would enjoy building a more general tool. 

However we need a rfc basis to start some effort in the right way

On April 12, 2023 3:08:22 AM UTC, Vinoth Chandar 
 wrote:
>Cool. lets draw up a RFC for this? @pratyaksh - do you want to start one,
>given you expressed interest?
>
>On Mon, Apr 10, 2023 at 7:32 PM Léo Biscassi  wrote:
>
>> +1
>> This would be great!
>>
>> Cheers,
>>
>> On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma 
>> wrote:
>>
>> > Hi Vinoth,
>> >
>> > I am aligned with the first reason that you mentioned. Better to have a
>> > separate tool to take care of this.
>> >
>> > On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar <
>> > mail.vinoth.chan...@gmail.com>
>> > wrote:
>> >
>> > > +1
>> > >
>> > > I was thinking that we add a new utility and NOT extend DeltaStreamer
>> by
>> > > adding a Sink interface, for the following reasons
>> > >
>> > > - It will make it look like a generic Source => Sink ETL tool, which is
>> > > actually not our intention to support on Hudi. There are plenty of good
>> > > tools for that out there.
>> > > - the config management can get bit hard to understand, since we
>> overload
>> > > ingest and reverse ETL into a single tool. So break it off at use-case
>> > > level?
>> > >
>> > > Thoughts?
>> > >
>> > > David:  PMC does not have control over that. Please see unsubscribe
>> > > instructions here. https://hudi.apache.org/community/get-involved
>> > > Love to keep this thread about reverse streamer discussion. So kindly
>> > fork
>> > > another thread if you want to discuss unsubscribing.
>> > >
>> > > On Fri, Mar 31, 2023 at 1:47 AM Davidiam 
>> > wrote:
>> > >
>> > > > Hello Vinoth,
>> > > >
>> > > > Can you please unsubscribe me?  I have been trying to unsubscribe for
>> > > > months without success.
>> > > >
>> > > > Kind Regards,
>> > > > David
>> > > >
>> > > > Sent from Outlook for Android
>> > > > 
>> > > > From: Vinoth Chandar 
>> > > > Sent: Friday, March 31, 2023 5:09:52 AM
>> > > > To: dev 
>> > > > Subject: [DISCUSS] Hudi Reverse Streamer
>> > > >
>> > > > Hi all,
>> > > >
>> > > > Any interest in building a reverse streaming tool, that does the
>> > reverse
>> > > of
>> > > > what the DeltaStreamer tool does? It will read Hudi table
>> incrementally
>> > > > (only source) and write out the data to a variety of sinks - Kafka,
>> > JDBC
>> > > > Databases, DFS.
>> > > >
>> > > > This has come up many times with data warehouse users. Often times,
>> > they
>> > > > want to use Hudi to speed up or reduce costs on their data ingestion
>> > and
>> > > > ETL (using Spark/Flink), but want to move the derived data back into
>> a
>> > > data
>> > > > warehouse or an operational database for serving.
>> > > >
>> > > > What do you all think?
>> > > >
>> > > > Thanks
>> > > > Vinoth
>> > > >
>> > >
>> >
>>
>>
>> --
>> *Léo Biscassi*
>> Blog - https://leobiscassi.com
>>
>>-
>>


Re: [DISCUSS] Should we support a service to manage all deltastreamer jobs?

2023-06-14 Thread Pratyaksh Sharma
Hi,

Personally I am in favour of creating such a UI where monitoring and
managing configurations is just a click away. That makes life a lot easier
for users. So +1 on the proposal.

I remember the work for it had started long back around 2019. You can check
this RFC

for your reference. I am not sure why this work could not continue though.

On Wed, Jun 14, 2023 at 4:28 PM 孔维 <18701146...@163.com> wrote:

> Hi, team,
>
>
> Background:
> More and more hudi accesses use deltastreamer, resulting in a large number
> of deltastreamer jobs that need to be managed. In our company, we also
> manage a large number of deltastreamer jobs by ourselves, and there is a
> lot of operation and maintenance management and monitoring work.
> If we can provide such a deltastreamer service to create, manage, and
> monitor all tasks in a unified manner, it can greatly reduce the management
> pressure of deltastreamer, and at the same time lower the threshold for
> using deltastreamer, which is conducive to the promotion and use of
> deltastreamer.
> At the same time, considering that deltastreamer already supports
> configuration hot update capability [
> https://github.com/apache/hudi/pull/8807], we can offer configuration hot
> update capability based on the feature, and make configuration changes
> without restarting the job.
>
>
> We hope to provide:
> Provides a web UI to support creation, management and monitoring of
> deltastreamer tasks
> Using configuration hot update capability to provide timely configuration
> change capability
>
>
> I don't know whether such a service is in line with the evolution of the
> community, and I hope to receive your reply!
>
>
> Best Regards


Re: Re: [DISCUSS] should deltastreamer support configuration hot update?

2023-05-24 Thread Sivabalan
sure. sg. thanks!

On Tue, 23 May 2023 at 23:22, 孔维 <18701146...@163.com> wrote:
>
> Hi, Sivabalan,
>
>
> Great to hear from you. Then I will create a JIRA ticket to track this feature
>
>
> Best Regards
>
> At 2023-05-24 02:22:36, "Sivabalan"  wrote:
> >I could not see the image you have attached. But I do get your ask here.
> >Will definitely benefit continuous deltastreamer use-cases. One possible
> >option is to keep track of the last mod time of the property file that we
> >feed in for detlastreamer top level config and before every batch, we can
> >check if it has changed. Optionally re-instantiate write config and other
> >components (write client etc) if applicable. If not, proceed as usual.
> >Should not be hard to add the support.
> >
> >
> >
> >
> >On Mon, 22 May 2023 at 00:05, 孔维 <18701146...@163.com> wrote:
> >
> >> Hi team,
> >>
> >> I am thinking about whether it is necessary to add the feature of
> >> configuration hot update to deltastreamer.
> >>
> >> In our company, hudi is used as a platform. We provide deltastreamer (run
> >> in continuous mode) to write to a large number of sources (including mysql
> >> & tidb) as a long time service. We often need to update the hudi
> >> configuration, but we don’t want to restart deltastreamer to achieve it,
> >> which may be too heavy for our job scheduler server based on livy/yarn.
> >> Therefore, we provide deltastreamer configuration hot update function. It
> >> is possible to update some common parameters instantly, and these
> >> parameters will take effect at the next sync of deltastreamer. These
> >> parameters include:
> >>
> >>- hoodie.bulkinsert.shuffle.parallelism (used only in bulkinsert)
> >>- hoodie.upsert.shuffle.parallelism
> >>- hoodie.deltastreamer.kafka.source.maxEvents
> >>- hoodie.memory.merge.max.size
> >>- hoodie.memory.compaction.max.size
> >>- hoodie.datasource.hive_sync.*
> >>- hoodie.compact.inline.max.delta.commits
> >>- hoodie.compaction.strategy
> >>- hoodie.compaction.target.io
> >>
> >> The whole flow chart is as follows:
> >>
> >>
> >
> >--
> >Regards,
> >-Sivabalan



-- 
Regards,
-Sivabalan


Re: [DISCUSS] should deltastreamer support configuration hot update?

2023-05-23 Thread Sivabalan
I could not see the image you have attached. But I do get your ask here.
Will definitely benefit continuous deltastreamer use-cases. One possible
option is to keep track of the last mod time of the property file that we
feed in for detlastreamer top level config and before every batch, we can
check if it has changed. Optionally re-instantiate write config and other
components (write client etc) if applicable. If not, proceed as usual.
Should not be hard to add the support.




On Mon, 22 May 2023 at 00:05, 孔维 <18701146...@163.com> wrote:

> Hi team,
>
> I am thinking about whether it is necessary to add the feature of
> configuration hot update to deltastreamer.
>
> In our company, hudi is used as a platform. We provide deltastreamer (run
> in continuous mode) to write to a large number of sources (including mysql
> & tidb) as a long time service. We often need to update the hudi
> configuration, but we don’t want to restart deltastreamer to achieve it,
> which may be too heavy for our job scheduler server based on livy/yarn.
> Therefore, we provide deltastreamer configuration hot update function. It
> is possible to update some common parameters instantly, and these
> parameters will take effect at the next sync of deltastreamer. These
> parameters include:
>
>- hoodie.bulkinsert.shuffle.parallelism (used only in bulkinsert)
>- hoodie.upsert.shuffle.parallelism
>- hoodie.deltastreamer.kafka.source.maxEvents
>- hoodie.memory.merge.max.size
>- hoodie.memory.compaction.max.size
>- hoodie.datasource.hive_sync.*
>- hoodie.compact.inline.max.delta.commits
>- hoodie.compaction.strategy
>- hoodie.compaction.target.io
>
> The whole flow chart is as follows:
>
>

-- 
Regards,
-Sivabalan


Re: DISCUSS Hudi 1.x plans

2023-05-10 Thread Sivabalan
Great! Left some feedback.

On Wed, 10 May 2023 at 06:56, Vinoth Chandar  wrote:
>
> All - the RFC is up here. Please comment on the PR or use the dev list to
> discuss ideas.
> https://github.com/apache/hudi/pull/8679/
>
> On Mon, May 8, 2023 at 11:43 PM Vinoth Chandar  wrote:
>
> > I have claimed RFC-69, per our process.
> >
> > On Mon, May 8, 2023 at 9:19 PM Vinoth Chandar  wrote:
> >
> >> Hi all,
> >>
> >> I have been consolidating all our progress on Hudi and putting together a
> >> proposal for Hudi 1.x vision and a concrete plan for the first version 1.0.
> >>
> >> Will plan to open up the RFC to gather ideas across the community in
> >> coming days.
> >>
> >> Thanks
> >> Vinoth
> >>
> >



-- 
Regards,
-Sivabalan


Re: DISCUSS Hudi 1.x plans

2023-05-10 Thread Vinoth Chandar
All - the RFC is up here. Please comment on the PR or use the dev list to
discuss ideas.
https://github.com/apache/hudi/pull/8679/

On Mon, May 8, 2023 at 11:43 PM Vinoth Chandar  wrote:

> I have claimed RFC-69, per our process.
>
> On Mon, May 8, 2023 at 9:19 PM Vinoth Chandar  wrote:
>
>> Hi all,
>>
>> I have been consolidating all our progress on Hudi and putting together a
>> proposal for Hudi 1.x vision and a concrete plan for the first version 1.0.
>>
>> Will plan to open up the RFC to gather ideas across the community in
>> coming days.
>>
>> Thanks
>> Vinoth
>>
>


Re: DISCUSS Hudi 1.x plans

2023-05-08 Thread Vinoth Chandar
I have claimed RFC-69, per our process.

On Mon, May 8, 2023 at 9:19 PM Vinoth Chandar  wrote:

> Hi all,
>
> I have been consolidating all our progress on Hudi and putting together a
> proposal for Hudi 1.x vision and a concrete plan for the first version 1.0.
>
> Will plan to open up the RFC to gather ideas across the community in
> coming days.
>
> Thanks
> Vinoth
>


Re:Re:Re: Re: Re: DISCUSS

2023-04-24 Thread 吕虎
Hi folks,
  I haven't received your reply for a long time. I think you must have 
something more important to do. Am I right? : ) 

At 2023-03-31 21:40:46, "吕虎"  wrote:
>Hi Vinoth, I'm glad to receive your letter. Here are some of my thoughts.
>At 2023-03-31 10:17:52, "Vinoth Chandar"  wrote:
>>I think we can focus more on validating the hash index + bloom filter vs
>>consistent hash index more first. Have you looked at RFC-08, which is a
>>kind of hash index as well, except it stores the key => file group mapping
>>externally.
>
>  The idea of RFC-08  Index (rowKey ->pationPath, fileID) is very similar 
> to HBase index, but its index is implemented internally in Hudi, so there is 
> no need to worry about consistency issues. Index  can be written to HFiles 
> quickly, but when reading, it is necessary to read from multiple HFiles, so 
> the performance of reading an index can be a problem. Therefore, RFC proposer 
> naturally thought of using hash buckets to partially solve this problems. 
> HBase's solution to multiple HFILE files is to add a maximum and minimum 
> index and a Bloom filter index. In Hudi, you can directly create a maximum 
> and minimum index and a Bloom filter index for FileGroups, eliminating the 
> need to store the index in HFILE; Another solution is to do a compaction on 
> HFILE files, but it also adds a burden to hudi.We need to consider the 
> performance of reading HFile well when using RFC-08.
>
>Therefore, I believe that hash partition  + bloom filter is still the simplest 
>and most effective solution for predictable data growth in a small range.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>At 2023-03-31 10:17:52, "Vinoth Chandar"  wrote:
>>I think we can focus more on validating the hash index + bloom filter vs
>>consistent hash index more first. Have you looked at RFC-08, which is a
>>kind of hash index as well, except it stores the key => file group mapping
>>externally.
>>
>>On Fri, Mar 24, 2023 at 2:14 AM 吕虎  wrote:
>>
>>> Hi Vinoth, I am very happy to receive your reply. Here are some of my
>>> thoughts。
>>>
>>> At 2023-03-21 23:32:44, "Vinoth Chandar"  wrote:
>>> >>but when it is used for data expansion, it still involves the need to
>>> >redistribute the data records of some data files, thus affecting the
>>> >performance.
>>> >but expansion of the consistent hash index is an optional operation right?
>>>
>>> >Sorry, not still fully understanding the differences here,
>>> I'm sorry I didn't make myself clearly. The expansion I mentioned last
>>> time refers to data records increase in hudi table.
>>> The difference between consistent hash index and hash partition with Bloom
>>> filters index is how to deal with  data increase:
>>> For consistent hash index, the way of splitting the file is used.
>>> Splitting files affects performance, but can permanently work effectively.
>>> So consistent hash index is  suitable for scenarios where data increase
>>> cannot be estimated or  data will increase large.
>>> For hash partitions with Bloom filters index, the way of creating  new
>>> files is used. Adding new files does not affect performance, but if there
>>> are too many files, the probability of false positives in the Bloom filters
>>> will increase. So hash partitions with Bloom filters index is  suitable for
>>> scenario where data increase can be estimated over a relatively small range.
>>>
>>>
>>> >>Because the hash partition field values under the parquet file in a
>>> >columnar storage format are all equal, the added column field hardly
>>> >occupies storage space after compression.
>>> >Any new meta field added adds other overhead in terms evolving the schema,
>>> >so forth. are you suggesting this is not possible to do without a new meta
>>> >field?
>>>
>>> No new meta field  implementation is a more elegant implementation, but
>>> for me, who is not yet familiar with the Hudi source code, it is somewhat
>>> difficult to implement, but it is not a problem for experts. If you want to
>>> implement it without adding new meta fields, I hope I can participate in
>>> some simple development, and I can also learn how experts can do it.
>>>
>>>
>>> >On Thu, Mar 16, 2023 at 2:22 AM 吕虎  wrote:
>>> >
>>> >> Hello,
>>> >>  I feel very honored that you are interested in my views.
>>> >>
>>> >>  Here are some of my thoughts marked with blue font.
>>> >>
>>> >> At 2023-03-16 13:18:08, "Vinoth Chandar"  wrote:
>>> >>
>>> >> >Thanks for the proposal! Some first set of questions here.
>>> >> >
>>> >> >>You need to pre-select the number of buckets and use the hash
>>> function to
>>> >> >determine which bucket a record belongs to.
>>> >> >>when building the table according to the estimated amount of data,
>>> and it
>>> >> >cannot be changed after building the table
>>> >> >>When the amount of data in a hash partition is too large, the data in
>>> >> that
>>> >> >partition will be split into multiple files in the way of Bloom index.
>>> >> >
>>> >> >All these issues 

Re: [DISCUSS] Hudi Reverse Streamer

2023-04-11 Thread Vinoth Chandar
Cool. lets draw up a RFC for this? @pratyaksh - do you want to start one,
given you expressed interest?

On Mon, Apr 10, 2023 at 7:32 PM Léo Biscassi  wrote:

> +1
> This would be great!
>
> Cheers,
>
> On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma 
> wrote:
>
> > Hi Vinoth,
> >
> > I am aligned with the first reason that you mentioned. Better to have a
> > separate tool to take care of this.
> >
> > On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar <
> > mail.vinoth.chan...@gmail.com>
> > wrote:
> >
> > > +1
> > >
> > > I was thinking that we add a new utility and NOT extend DeltaStreamer
> by
> > > adding a Sink interface, for the following reasons
> > >
> > > - It will make it look like a generic Source => Sink ETL tool, which is
> > > actually not our intention to support on Hudi. There are plenty of good
> > > tools for that out there.
> > > - the config management can get bit hard to understand, since we
> overload
> > > ingest and reverse ETL into a single tool. So break it off at use-case
> > > level?
> > >
> > > Thoughts?
> > >
> > > David:  PMC does not have control over that. Please see unsubscribe
> > > instructions here. https://hudi.apache.org/community/get-involved
> > > Love to keep this thread about reverse streamer discussion. So kindly
> > fork
> > > another thread if you want to discuss unsubscribing.
> > >
> > > On Fri, Mar 31, 2023 at 1:47 AM Davidiam 
> > wrote:
> > >
> > > > Hello Vinoth,
> > > >
> > > > Can you please unsubscribe me?  I have been trying to unsubscribe for
> > > > months without success.
> > > >
> > > > Kind Regards,
> > > > David
> > > >
> > > > Sent from Outlook for Android
> > > > 
> > > > From: Vinoth Chandar 
> > > > Sent: Friday, March 31, 2023 5:09:52 AM
> > > > To: dev 
> > > > Subject: [DISCUSS] Hudi Reverse Streamer
> > > >
> > > > Hi all,
> > > >
> > > > Any interest in building a reverse streaming tool, that does the
> > reverse
> > > of
> > > > what the DeltaStreamer tool does? It will read Hudi table
> incrementally
> > > > (only source) and write out the data to a variety of sinks - Kafka,
> > JDBC
> > > > Databases, DFS.
> > > >
> > > > This has come up many times with data warehouse users. Often times,
> > they
> > > > want to use Hudi to speed up or reduce costs on their data ingestion
> > and
> > > > ETL (using Spark/Flink), but want to move the derived data back into
> a
> > > data
> > > > warehouse or an operational database for serving.
> > > >
> > > > What do you all think?
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
>
>
> --
> *Léo Biscassi*
> Blog - https://leobiscassi.com
>
>-
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-04-10 Thread Léo Biscassi
+1
This would be great!

Cheers,

On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma 
wrote:

> Hi Vinoth,
>
> I am aligned with the first reason that you mentioned. Better to have a
> separate tool to take care of this.
>
> On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar <
> mail.vinoth.chan...@gmail.com>
> wrote:
>
> > +1
> >
> > I was thinking that we add a new utility and NOT extend DeltaStreamer by
> > adding a Sink interface, for the following reasons
> >
> > - It will make it look like a generic Source => Sink ETL tool, which is
> > actually not our intention to support on Hudi. There are plenty of good
> > tools for that out there.
> > - the config management can get bit hard to understand, since we overload
> > ingest and reverse ETL into a single tool. So break it off at use-case
> > level?
> >
> > Thoughts?
> >
> > David:  PMC does not have control over that. Please see unsubscribe
> > instructions here. https://hudi.apache.org/community/get-involved
> > Love to keep this thread about reverse streamer discussion. So kindly
> fork
> > another thread if you want to discuss unsubscribing.
> >
> > On Fri, Mar 31, 2023 at 1:47 AM Davidiam 
> wrote:
> >
> > > Hello Vinoth,
> > >
> > > Can you please unsubscribe me?  I have been trying to unsubscribe for
> > > months without success.
> > >
> > > Kind Regards,
> > > David
> > >
> > > Sent from Outlook for Android
> > > 
> > > From: Vinoth Chandar 
> > > Sent: Friday, March 31, 2023 5:09:52 AM
> > > To: dev 
> > > Subject: [DISCUSS] Hudi Reverse Streamer
> > >
> > > Hi all,
> > >
> > > Any interest in building a reverse streaming tool, that does the
> reverse
> > of
> > > what the DeltaStreamer tool does? It will read Hudi table incrementally
> > > (only source) and write out the data to a variety of sinks - Kafka,
> JDBC
> > > Databases, DFS.
> > >
> > > This has come up many times with data warehouse users. Often times,
> they
> > > want to use Hudi to speed up or reduce costs on their data ingestion
> and
> > > ETL (using Spark/Flink), but want to move the derived data back into a
> > data
> > > warehouse or an operational database for serving.
> > >
> > > What do you all think?
> > >
> > > Thanks
> > > Vinoth
> > >
> >
>


-- 
*Léo Biscassi*
Blog - https://leobiscassi.com

   -


Re: Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-07 Thread Vinoth Chandar
Pulled in another reviewer as well. Left a comment. We can move the
discussion to the PR?

Thanks for the useful contribution!

On Thu, Apr 6, 2023 at 12:34 AM 孔维 <18701146...@163.com> wrote:

> Hi, vinoth,
>
> I created a PR(https://github.com/apache/hudi/pull/8376) for this
> feature, could you help review it?
>
>
> BR,
> Kong
>
>
>
>
> At 2023-04-05 00:19:20, "Vinoth Chandar"  wrote:
> >Look forward to this! could really help backfill/rebootstrap scenarios.
> >
> >On Tue, Apr 4, 2023 at 9:18 AM Vinoth Chandar  wrote:
> >
> >> Thinking out loud.
> >>
> >> 1. For insert operations, it should not matter anyway.
> >> 2. For upsert etc, the preCombine would handle the ordering problems.
> >>
> >> Is that what you are saying? I feel we don't want to leak any Kafka
> >> specific logic or force use of special payloads etc. thoughts?
> >>
> >> I assigned the jira to you and also made you a contributor. So in future,
> >> you can self-assign.
> >>
> >> On Mon, Apr 3, 2023 at 7:08 PM 孔维 <18701146...@163.com> wrote:
> >>
> >>> Hi,
> >>>
> >>>
> >>> Yea, we can create multiple spark input partitions per Kafka partition.
> >>>
> >>>
> >>> I think the write operations can handle the potentially out-of-order
> >>> events, because before writing we need to preCombine the incoming events
> >>> using source-ordering-field and we also need to combineAndGetUpdateValue
> >>> with records on storage. From a business perspective, we use the combine
> >>> logic to keep our data correct. And hudi does not require any guarantees
> >>> about the ordering of kafka events.
> >>>
> >>>
> >>> I already filed one JIRA[https://issues.apache.org/jira/browse/HUDI-6019],
> >>> could you help assign the JIRA to me?
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> At 2023-04-03 23:27:13, "Vinoth Chandar"  wrote:
> >>> >Hi,
> >>> >
> >>> >Does your implementation read out offset ranges from Kafka partitions?
> >>> >which means - we can create multiple spark input partitions per Kafka
> >>> >partitions?
> >>> >if so, +1 for overall goals here.
> >>> >
> >>> >How does this affect ordering? Can you think about how/if Hudi write
> >>> >operations can handle potentially out-of-order events being read out?
> >>> >It feels like we can add a JIRA for this anyway.
> >>> >
> >>> >
> >>> >
> >>> >On Thu, Mar 30, 2023 at 10:02 PM 孔维 <18701146...@163.com> wrote:
> >>> >
> >>> >> Hi team, for the kafka source, when pulling data from kafka, the
> >>> default
> >>> >> parallelism is the number of kafka partitions.
> >>> >> There are cases:
> >>> >>
> >>> >> Pulling large amount of data from kafka (eg. maxEvents=1), but
> >>> the
> >>> >> # of kafka partition is not enough, the procedure of the pulling will
> >>> cost
> >>> >> too much of time, even worse cause the executor OOM
> >>> >> There is huge data skew between kafka partitions, the procedure of the
> >>> >> pulling will be blocked by the slowest partition
> >>> >>
> >>> >> to solve those cases, I want to add a parameter
> >>> >> hoodie.deltastreamer.kafka.per.batch.maxEvents to control the
> >>> maxEvents in
> >>> >> one kafka batch, default Long.MAX_VALUE means not trun this feature on.
> >>> >> hoodie.deltastreamer.kafka.per.batch.maxEvents  this confiuration will
> >>> >> take effect after the hoodie.deltastreamer.kafka.source.maxEvents
> >>> config.
> >>> >>
> >>> >>
> >>> >> Here is my POC of the imporvement:
> >>> >> max executor core is 128.
> >>> >> not turn the feature on
> >>> >> (hoodie.deltastreamer.kafka.source.maxEvents=5000)
> >>> >>
> >>> >>
> >>> >> turn on the feature
> >>> (hoodie.deltastreamer.kafka.per.batch.maxEvents=20)
> >>> >>
> >>> >>
> >>> >> after turn on the feature, the timing of Tagging reduce from 4.4 mins
> >>> to
> >>> >> 1.1 mins, can be more faster if given more cores.
> >>> >>
> >>> >> How do you think? can I file a jira issue for this?
> >>>
> >>
>
>


Re:Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-06 Thread 孔维
Hi, vinoth,


I created a PR(https://github.com/apache/hudi/pull/8376) for this feature, 
could you help review it?




BR,
Kong








At 2023-04-05 00:19:20, "Vinoth Chandar"  wrote:
>Look forward to this! could really help backfill/rebootstrap scenarios.
>
>On Tue, Apr 4, 2023 at 9:18 AM Vinoth Chandar  wrote:
>
>> Thinking out loud.
>>
>> 1. For insert operations, it should not matter anyway.
>> 2. For upsert etc, the preCombine would handle the ordering problems.
>>
>> Is that what you are saying? I feel we don't want to leak any Kafka
>> specific logic or force use of special payloads etc. thoughts?
>>
>> I assigned the jira to you and also made you a contributor. So in future,
>> you can self-assign.
>>
>> On Mon, Apr 3, 2023 at 7:08 PM 孔维 <18701146...@163.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> Yea, we can create multiple spark input partitions per Kafka partition.
>>>
>>>
>>> I think the write operations can handle the potentially out-of-order
>>> events, because before writing we need to preCombine the incoming events
>>> using source-ordering-field and we also need to combineAndGetUpdateValue
>>> with records on storage. From a business perspective, we use the combine
>>> logic to keep our data correct. And hudi does not require any guarantees
>>> about the ordering of kafka events.
>>>
>>>
>>> I already filed one JIRA[https://issues.apache.org/jira/browse/HUDI-6019],
>>> could you help assign the JIRA to me?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> At 2023-04-03 23:27:13, "Vinoth Chandar"  wrote:
>>> >Hi,
>>> >
>>> >Does your implementation read out offset ranges from Kafka partitions?
>>> >which means - we can create multiple spark input partitions per Kafka
>>> >partitions?
>>> >if so, +1 for overall goals here.
>>> >
>>> >How does this affect ordering? Can you think about how/if Hudi write
>>> >operations can handle potentially out-of-order events being read out?
>>> >It feels like we can add a JIRA for this anyway.
>>> >
>>> >
>>> >
>>> >On Thu, Mar 30, 2023 at 10:02 PM 孔维 <18701146...@163.com> wrote:
>>> >
>>> >> Hi team, for the kafka source, when pulling data from kafka, the
>>> default
>>> >> parallelism is the number of kafka partitions.
>>> >> There are cases:
>>> >>
>>> >> Pulling large amount of data from kafka (eg. maxEvents=1), but
>>> the
>>> >> # of kafka partition is not enough, the procedure of the pulling will
>>> cost
>>> >> too much of time, even worse cause the executor OOM
>>> >> There is huge data skew between kafka partitions, the procedure of the
>>> >> pulling will be blocked by the slowest partition
>>> >>
>>> >> to solve those cases, I want to add a parameter
>>> >> hoodie.deltastreamer.kafka.per.batch.maxEvents to control the
>>> maxEvents in
>>> >> one kafka batch, default Long.MAX_VALUE means not trun this feature on.
>>> >> hoodie.deltastreamer.kafka.per.batch.maxEvents  this confiuration will
>>> >> take effect after the hoodie.deltastreamer.kafka.source.maxEvents
>>> config.
>>> >>
>>> >>
>>> >> Here is my POC of the imporvement:
>>> >> max executor core is 128.
>>> >> not turn the feature on
>>> >> (hoodie.deltastreamer.kafka.source.maxEvents=5000)
>>> >>
>>> >>
>>> >> turn on the feature
>>> (hoodie.deltastreamer.kafka.per.batch.maxEvents=20)
>>> >>
>>> >>
>>> >> after turn on the feature, the timing of Tagging reduce from 4.4 mins
>>> to
>>> >> 1.1 mins, can be more faster if given more cores.
>>> >>
>>> >> How do you think? can I file a jira issue for this?
>>>
>>


Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-04 Thread Vinoth Chandar
Look forward to this! could really help backfill/rebootstrap scenarios.

On Tue, Apr 4, 2023 at 9:18 AM Vinoth Chandar  wrote:

> Thinking out loud.
>
> 1. For insert operations, it should not matter anyway.
> 2. For upsert etc, the preCombine would handle the ordering problems.
>
> Is that what you are saying? I feel we don't want to leak any Kafka
> specific logic or force use of special payloads etc. thoughts?
>
> I assigned the jira to you and also made you a contributor. So in future,
> you can self-assign.
>
> On Mon, Apr 3, 2023 at 7:08 PM 孔维 <18701146...@163.com> wrote:
>
>> Hi,
>>
>>
>> Yea, we can create multiple spark input partitions per Kafka partition.
>>
>>
>> I think the write operations can handle the potentially out-of-order
>> events, because before writing we need to preCombine the incoming events
>> using source-ordering-field and we also need to combineAndGetUpdateValue
>> with records on storage. From a business perspective, we use the combine
>> logic to keep our data correct. And hudi does not require any guarantees
>> about the ordering of kafka events.
>>
>>
>> I already filed one JIRA[https://issues.apache.org/jira/browse/HUDI-6019],
>> could you help assign the JIRA to me?
>>
>>
>>
>>
>>
>>
>>
>> At 2023-04-03 23:27:13, "Vinoth Chandar"  wrote:
>> >Hi,
>> >
>> >Does your implementation read out offset ranges from Kafka partitions?
>> >which means - we can create multiple spark input partitions per Kafka
>> >partitions?
>> >if so, +1 for overall goals here.
>> >
>> >How does this affect ordering? Can you think about how/if Hudi write
>> >operations can handle potentially out-of-order events being read out?
>> >It feels like we can add a JIRA for this anyway.
>> >
>> >
>> >
>> >On Thu, Mar 30, 2023 at 10:02 PM 孔维 <18701146...@163.com> wrote:
>> >
>> >> Hi team, for the kafka source, when pulling data from kafka, the
>> default
>> >> parallelism is the number of kafka partitions.
>> >> There are cases:
>> >>
>> >> Pulling large amount of data from kafka (eg. maxEvents=1), but
>> the
>> >> # of kafka partition is not enough, the procedure of the pulling will
>> cost
>> >> too much of time, even worse cause the executor OOM
>> >> There is huge data skew between kafka partitions, the procedure of the
>> >> pulling will be blocked by the slowest partition
>> >>
>> >> to solve those cases, I want to add a parameter
>> >> hoodie.deltastreamer.kafka.per.batch.maxEvents to control the
>> maxEvents in
>> >> one kafka batch, default Long.MAX_VALUE means not trun this feature on.
>> >> hoodie.deltastreamer.kafka.per.batch.maxEvents  this confiuration will
>> >> take effect after the hoodie.deltastreamer.kafka.source.maxEvents
>> config.
>> >>
>> >>
>> >> Here is my POC of the imporvement:
>> >> max executor core is 128.
>> >> not turn the feature on
>> >> (hoodie.deltastreamer.kafka.source.maxEvents=5000)
>> >>
>> >>
>> >> turn on the feature
>> (hoodie.deltastreamer.kafka.per.batch.maxEvents=20)
>> >>
>> >>
>> >> after turn on the feature, the timing of Tagging reduce from 4.4 mins
>> to
>> >> 1.1 mins, can be more faster if given more cores.
>> >>
>> >> How do you think? can I file a jira issue for this?
>>
>


Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-04 Thread Vinoth Chandar
Thinking out loud.

1. For insert operations, it should not matter anyway.
2. For upsert etc, the preCombine would handle the ordering problems.

Is that what you are saying? I feel we don't want to leak any Kafka
specific logic or force use of special payloads etc. thoughts?

I assigned the jira to you and also made you a contributor. So in future,
you can self-assign.

On Mon, Apr 3, 2023 at 7:08 PM 孔维 <18701146...@163.com> wrote:

> Hi,
>
>
> Yea, we can create multiple spark input partitions per Kafka partition.
>
>
> I think the write operations can handle the potentially out-of-order
> events, because before writing we need to preCombine the incoming events
> using source-ordering-field and we also need to combineAndGetUpdateValue
> with records on storage. From a business perspective, we use the combine
> logic to keep our data correct. And hudi does not require any guarantees
> about the ordering of kafka events.
>
>
> I already filed one JIRA[https://issues.apache.org/jira/browse/HUDI-6019],
> could you help assign the JIRA to me?
>
>
>
>
>
>
>
> At 2023-04-03 23:27:13, "Vinoth Chandar"  wrote:
> >Hi,
> >
> >Does your implementation read out offset ranges from Kafka partitions?
> >which means - we can create multiple spark input partitions per Kafka
> >partitions?
> >if so, +1 for overall goals here.
> >
> >How does this affect ordering? Can you think about how/if Hudi write
> >operations can handle potentially out-of-order events being read out?
> >It feels like we can add a JIRA for this anyway.
> >
> >
> >
> >On Thu, Mar 30, 2023 at 10:02 PM 孔维 <18701146...@163.com> wrote:
> >
> >> Hi team, for the kafka source, when pulling data from kafka, the default
> >> parallelism is the number of kafka partitions.
> >> There are cases:
> >>
> >> Pulling large amount of data from kafka (eg. maxEvents=1), but
> the
> >> # of kafka partition is not enough, the procedure of the pulling will
> cost
> >> too much of time, even worse cause the executor OOM
> >> There is huge data skew between kafka partitions, the procedure of the
> >> pulling will be blocked by the slowest partition
> >>
> >> to solve those cases, I want to add a parameter
> >> hoodie.deltastreamer.kafka.per.batch.maxEvents to control the maxEvents
> in
> >> one kafka batch, default Long.MAX_VALUE means not trun this feature on.
> >> hoodie.deltastreamer.kafka.per.batch.maxEvents  this confiuration will
> >> take effect after the hoodie.deltastreamer.kafka.source.maxEvents
> config.
> >>
> >>
> >> Here is my POC of the imporvement:
> >> max executor core is 128.
> >> not turn the feature on
> >> (hoodie.deltastreamer.kafka.source.maxEvents=5000)
> >>
> >>
> >> turn on the feature
> (hoodie.deltastreamer.kafka.per.batch.maxEvents=20)
> >>
> >>
> >> after turn on the feature, the timing of Tagging reduce from 4.4 mins to
> >> 1.1 mins, can be more faster if given more cores.
> >>
> >> How do you think? can I file a jira issue for this?
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-04-03 Thread Pratyaksh Sharma
Hi Vinoth,

I am aligned with the first reason that you mentioned. Better to have a
separate tool to take care of this.

On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar 
wrote:

> +1
>
> I was thinking that we add a new utility and NOT extend DeltaStreamer by
> adding a Sink interface, for the following reasons
>
> - It will make it look like a generic Source => Sink ETL tool, which is
> actually not our intention to support on Hudi. There are plenty of good
> tools for that out there.
> - the config management can get bit hard to understand, since we overload
> ingest and reverse ETL into a single tool. So break it off at use-case
> level?
>
> Thoughts?
>
> David:  PMC does not have control over that. Please see unsubscribe
> instructions here. https://hudi.apache.org/community/get-involved
> Love to keep this thread about reverse streamer discussion. So kindly fork
> another thread if you want to discuss unsubscribing.
>
> On Fri, Mar 31, 2023 at 1:47 AM Davidiam  wrote:
>
> > Hello Vinoth,
> >
> > Can you please unsubscribe me?  I have been trying to unsubscribe for
> > months without success.
> >
> > Kind Regards,
> > David
> >
> > Sent from Outlook for Android
> > 
> > From: Vinoth Chandar 
> > Sent: Friday, March 31, 2023 5:09:52 AM
> > To: dev 
> > Subject: [DISCUSS] Hudi Reverse Streamer
> >
> > Hi all,
> >
> > Any interest in building a reverse streaming tool, that does the reverse
> of
> > what the DeltaStreamer tool does? It will read Hudi table incrementally
> > (only source) and write out the data to a variety of sinks - Kafka, JDBC
> > Databases, DFS.
> >
> > This has come up many times with data warehouse users. Often times, they
> > want to use Hudi to speed up or reduce costs on their data ingestion and
> > ETL (using Spark/Flink), but want to move the derived data back into a
> data
> > warehouse or an operational database for serving.
> >
> > What do you all think?
> >
> > Thanks
> > Vinoth
> >
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-04-03 Thread Vinoth Chandar
+1

I was thinking that we add a new utility and NOT extend DeltaStreamer by
adding a Sink interface, for the following reasons

- It will make it look like a generic Source => Sink ETL tool, which is
actually not our intention to support on Hudi. There are plenty of good
tools for that out there.
- the config management can get bit hard to understand, since we overload
ingest and reverse ETL into a single tool. So break it off at use-case
level?

Thoughts?

David:  PMC does not have control over that. Please see unsubscribe
instructions here. https://hudi.apache.org/community/get-involved
Love to keep this thread about reverse streamer discussion. So kindly fork
another thread if you want to discuss unsubscribing.

On Fri, Mar 31, 2023 at 1:47 AM Davidiam  wrote:

> Hello Vinoth,
>
> Can you please unsubscribe me?  I have been trying to unsubscribe for
> months without success.
>
> Kind Regards,
> David
>
> Sent from Outlook for Android
> 
> From: Vinoth Chandar 
> Sent: Friday, March 31, 2023 5:09:52 AM
> To: dev 
> Subject: [DISCUSS] Hudi Reverse Streamer
>
> Hi all,
>
> Any interest in building a reverse streaming tool, that does the reverse of
> what the DeltaStreamer tool does? It will read Hudi table incrementally
> (only source) and write out the data to a variety of sinks - Kafka, JDBC
> Databases, DFS.
>
> This has come up many times with data warehouse users. Often times, they
> want to use Hudi to speed up or reduce costs on their data ingestion and
> ETL (using Spark/Flink), but want to move the derived data back into a data
> warehouse or an operational database for serving.
>
> What do you all think?
>
> Thanks
> Vinoth
>


Re: [DISCUSS] split source of kafka partition by count

2023-04-03 Thread Vinoth Chandar
Hi,

Does your implementation read out offset ranges from Kafka partitions?
which means - we can create multiple spark input partitions per Kafka
partitions?
if so, +1 for overall goals here.

How does this affect ordering? Can you think about how/if Hudi write
operations can handle potentially out-of-order events being read out?
It feels like we can add a JIRA for this anyway.



On Thu, Mar 30, 2023 at 10:02 PM 孔维 <18701146...@163.com> wrote:

> Hi team, for the kafka source, when pulling data from kafka, the default
> parallelism is the number of kafka partitions.
> There are cases:
>
> Pulling large amount of data from kafka (eg. maxEvents=1), but the
> # of kafka partition is not enough, the procedure of the pulling will cost
> too much of time, even worse cause the executor OOM
> There is huge data skew between kafka partitions, the procedure of the
> pulling will be blocked by the slowest partition
>
> to solve those cases, I want to add a parameter
> hoodie.deltastreamer.kafka.per.batch.maxEvents to control the maxEvents in
> one kafka batch, default Long.MAX_VALUE means not trun this feature on.
> hoodie.deltastreamer.kafka.per.batch.maxEvents  this confiuration will
> take effect after the hoodie.deltastreamer.kafka.source.maxEvents config.
>
>
> Here is my POC of the imporvement:
> max executor core is 128.
> not turn the feature on
> (hoodie.deltastreamer.kafka.source.maxEvents=5000)
>
>
> turn on the feature (hoodie.deltastreamer.kafka.per.batch.maxEvents=20)
>
>
> after turn on the feature, the timing of Tagging reduce from 4.4 mins to
> 1.1 mins, can be more faster if given more cores.
>
> How do you think? can I file a jira issue for this?


Re: Re: [DISCUSS] Hudi data TTL

2023-03-31 Thread Sivabalan
left some comments. thanks!

On Fri, 31 Mar 2023 at 00:59, 符其军 <18889897...@163.com> wrote:

> Hi community, we have submitted RFC-65 Partition TTL Management in this
> pr: https://github.com/apache/hudi/pull/8062.Let me know if you
> have any questions or concerns with this proposal.
> At 2022-10-21 14:42:10, "stream2000" <18889897...@163.com> wrote:
> >Yes we can have a talk about it. We will try our best to write the RFC,
> maybe publish it in a few weeks.
> >
> >
> >> On Oct 21, 2022, at 10:18, JerryYue <272614...@qq.com.INVALID> wrote:
> >>
> >> Looking forward to the RFC
> >> It's a good idea, we also need hudi data TTL in some case
> >> Do we have any plan or time to do this? We also had some simple designs
> to implement it
> >> Maybe we can had a talk about it
> >>
> >> 在 2022/10/20 上午9:47,“Bingeng Huang” qq@hudi.apache.org 代表 hbgstc...@gmail.com> 写入:
> >>
> >>Looking forward to the RFC.
> >>We can propose RFC about support TTL config using non-partition
> field after
> >>
> >>
> >>
> >>sagar sumit  于2022年10月19日周三 14:42写道:
> >>
> >>> +1 Very nice idea. Looking forward to the RFC!
> >>>
> >>> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu <
> xu.shiyan.raym...@gmail.com>
> >>> wrote:
> >>>
>  great proposal. Partition TTL is a good starting point. we can extend
> it
> >>> to
>  other TTL strategies like column-based, and make it customizable and
>  pluggable. Looking forward to the RFC!
> 
>  On Wed, Oct 19, 2022 at 11:40 AM Jian Feng
>  
>  wrote:
> 
> > Good idea,
> > this is definitely worth an  RFC
> > btw should it only depend on Hudi's partition? I feel it should be a
> >>> more
> > common feature since sometimes customers' data can not update across
> > partitions
> >
> >
> > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com>
> >>> wrote:
> >
> >> Hi all, we have implemented a partition based data ttl management,
>  which
> >> we can manage ttl for hudi partition by size, expired time and
> >> sub-partition count. When a partition is detected as outdated, we
> use
> >> delete partition interface to delete it, which will generate a
> >>> replace
> >> commit to mark the data as deleted. The real deletion will then done
> >>> by
> >> clean service.
> >>
> >>
> >> If community is interested in this idea, maybe we can propose a RFC
> >>> to
> >> discuss it in detail.
> >>
> >>
> >>> On Oct 19, 2022, at 10:06, Vinoth Chandar 
> >>> wrote:
> >>>
> >>> +1 love to discuss this on a RFC proposal.
> >>>
> >>> On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> >> wrote:
> >>>
>  That's a very interesting idea.
> 
>  Do you want to take a stab at writing a full proposal (in the form
>  of
> >> RFC)
>  for it?
> 
>  On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang <
> >>> hbgstc...@gmail.com
> >
>  wrote:
> 
> > Hi all,
> >
> > Do we have plan to integrate data TTL into HUDI, so we don't have
>  to
> > schedule a offline spark job to delete outdated data, just set a
>  TTL
> > config, then writer or some offline service will delete old data
> >>> as
> > expected.
> >
> 
> >>
> >>
> >
> > --
> > *Jian Feng,冯健*
> > Shopee | Engineer | Data Infrastructure
> >
> 
> 
>  --
>  Best,
>  Shiyan
> 
> >>>
> >>
>


-- 
Regards,
-Sivabalan


Re:Re: Re: Re: DISCUSS

2023-03-31 Thread 吕虎
Hi Vinoth, I'm glad to receive your letter. Here are some of my thoughts.
At 2023-03-31 10:17:52, "Vinoth Chandar"  wrote:
>I think we can focus more on validating the hash index + bloom filter vs
>consistent hash index more first. Have you looked at RFC-08, which is a
>kind of hash index as well, except it stores the key => file group mapping
>externally.

  The idea of RFC-08  Index (rowKey ->pationPath, fileID) is very similar 
to HBase index, but its index is implemented internally in Hudi, so there is no 
need to worry about consistency issues. Index  can be written to HFiles 
quickly, but when reading, it is necessary to read from multiple HFiles, so the 
performance of reading an index can be a problem. Therefore, RFC proposer 
naturally thought of using hash buckets to partially solve this problems. 
HBase's solution to multiple HFILE files is to add a maximum and minimum index 
and a Bloom filter index. In Hudi, you can directly create a maximum and 
minimum index and a Bloom filter index for FileGroups, eliminating the need to 
store the index in HFILE; Another solution is to do a compaction on HFILE 
files, but it also adds a burden to hudi.We need to consider the performance of 
reading HFile well when using RFC-08.

Therefore, I believe that hash partition  + bloom filter is still the simplest 
and most effective solution for predictable data growth in a small range.

















At 2023-03-31 10:17:52, "Vinoth Chandar"  wrote:
>I think we can focus more on validating the hash index + bloom filter vs
>consistent hash index more first. Have you looked at RFC-08, which is a
>kind of hash index as well, except it stores the key => file group mapping
>externally.
>
>On Fri, Mar 24, 2023 at 2:14 AM 吕虎  wrote:
>
>> Hi Vinoth, I am very happy to receive your reply. Here are some of my
>> thoughts。
>>
>> At 2023-03-21 23:32:44, "Vinoth Chandar"  wrote:
>> >>but when it is used for data expansion, it still involves the need to
>> >redistribute the data records of some data files, thus affecting the
>> >performance.
>> >but expansion of the consistent hash index is an optional operation right?
>>
>> >Sorry, not still fully understanding the differences here,
>> I'm sorry I didn't make myself clearly. The expansion I mentioned last
>> time refers to data records increase in hudi table.
>> The difference between consistent hash index and hash partition with Bloom
>> filters index is how to deal with  data increase:
>> For consistent hash index, the way of splitting the file is used.
>> Splitting files affects performance, but can permanently work effectively.
>> So consistent hash index is  suitable for scenarios where data increase
>> cannot be estimated or  data will increase large.
>> For hash partitions with Bloom filters index, the way of creating  new
>> files is used. Adding new files does not affect performance, but if there
>> are too many files, the probability of false positives in the Bloom filters
>> will increase. So hash partitions with Bloom filters index is  suitable for
>> scenario where data increase can be estimated over a relatively small range.
>>
>>
>> >>Because the hash partition field values under the parquet file in a
>> >columnar storage format are all equal, the added column field hardly
>> >occupies storage space after compression.
>> >Any new meta field added adds other overhead in terms evolving the schema,
>> >so forth. are you suggesting this is not possible to do without a new meta
>> >field?
>>
>> No new meta field  implementation is a more elegant implementation, but
>> for me, who is not yet familiar with the Hudi source code, it is somewhat
>> difficult to implement, but it is not a problem for experts. If you want to
>> implement it without adding new meta fields, I hope I can participate in
>> some simple development, and I can also learn how experts can do it.
>>
>>
>> >On Thu, Mar 16, 2023 at 2:22 AM 吕虎  wrote:
>> >
>> >> Hello,
>> >>  I feel very honored that you are interested in my views.
>> >>
>> >>  Here are some of my thoughts marked with blue font.
>> >>
>> >> At 2023-03-16 13:18:08, "Vinoth Chandar"  wrote:
>> >>
>> >> >Thanks for the proposal! Some first set of questions here.
>> >> >
>> >> >>You need to pre-select the number of buckets and use the hash
>> function to
>> >> >determine which bucket a record belongs to.
>> >> >>when building the table according to the estimated amount of data,
>> and it
>> >> >cannot be changed after building the table
>> >> >>When the amount of data in a hash partition is too large, the data in
>> >> that
>> >> >partition will be split into multiple files in the way of Bloom index.
>> >> >
>> >> >All these issues are related to bucket sizing could be alleviated by
>> the
>> >> >consistent hashing index in 0.13? Have you checked it out? Love to hear
>> >> >your thoughts on this.
>> >>
>> >> Hash partitioning is applicable to data tables that cannot give the
>> exact
>> >> capacity of data, but can estimate 

Re: [DISCUSS] Hudi Reverse Streamer

2023-03-31 Thread Davidiam
Hello Vinoth,

Can you please unsubscribe me?  I have been trying to unsubscribe for months 
without success.

Kind Regards,
David

Sent from Outlook for Android

From: Vinoth Chandar 
Sent: Friday, March 31, 2023 5:09:52 AM
To: dev 
Subject: [DISCUSS] Hudi Reverse Streamer

Hi all,

Any interest in building a reverse streaming tool, that does the reverse of
what the DeltaStreamer tool does? It will read Hudi table incrementally
(only source) and write out the data to a variety of sinks - Kafka, JDBC
Databases, DFS.

This has come up many times with data warehouse users. Often times, they
want to use Hudi to speed up or reduce costs on their data ingestion and
ETL (using Spark/Flink), but want to move the derived data back into a data
warehouse or an operational database for serving.

What do you all think?

Thanks
Vinoth


Re: [DISCUSS] Hudi Reverse Streamer

2023-03-31 Thread Pratyaksh Sharma
+1 to this.

I can help drive some of this work.

On Fri, Mar 31, 2023 at 10:09 AM Prashant Wason 
wrote:

> Could be useful. Also, may be useful for backup / replication scenario
> (keeping a copy of data in alternate/cloud DC).
>
> HoodieDeltaStreamer already has the concept of "sources". This can be
> implemented as a "sink" concept.
>
> On Thu, Mar 30, 2023 at 8:12 PM Vinoth Chandar  wrote:
>
> > Essentially.
> >
> > Old architecture :(operational database) ==> some tool ==> (data
> > warehouse raw data) ==> SQL ETL ==> (data warehouse derived data)
> >
> > New architecture : (operational database) ==> Hudi delta Streamer ==>
> (Hudi
> > raw data) ==> Spark/Flink Hudi ETL ==> (Hudi derived data) ==> Hudi
> Reverse
> > Streamer ==> (Data Warehouse/Kafka/Operational Database)
> >
> > On Thu, Mar 30, 2023 at 8:09 PM Vinoth Chandar 
> wrote:
> >
> > > Hi all,
> > >
> > > Any interest in building a reverse streaming tool, that does the
> reverse
> > > of what the DeltaStreamer tool does? It will read Hudi table
> > incrementally
> > > (only source) and write out the data to a variety of sinks - Kafka,
> JDBC
> > > Databases, DFS.
> > >
> > > This has come up many times with data warehouse users. Often times,
> they
> > > want to use Hudi to speed up or reduce costs on their data ingestion
> and
> > > ETL (using Spark/Flink), but want to move the derived data back into a
> > data
> > > warehouse or an operational database for serving.
> > >
> > > What do you all think?
> > >
> > > Thanks
> > > Vinoth
> > >
> >
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-03-30 Thread Prashant Wason
Could be useful. Also, may be useful for backup / replication scenario
(keeping a copy of data in alternate/cloud DC).

HoodieDeltaStreamer already has the concept of "sources". This can be
implemented as a "sink" concept.

On Thu, Mar 30, 2023 at 8:12 PM Vinoth Chandar  wrote:

> Essentially.
>
> Old architecture :(operational database) ==> some tool ==> (data
> warehouse raw data) ==> SQL ETL ==> (data warehouse derived data)
>
> New architecture : (operational database) ==> Hudi delta Streamer ==> (Hudi
> raw data) ==> Spark/Flink Hudi ETL ==> (Hudi derived data) ==> Hudi Reverse
> Streamer ==> (Data Warehouse/Kafka/Operational Database)
>
> On Thu, Mar 30, 2023 at 8:09 PM Vinoth Chandar  wrote:
>
> > Hi all,
> >
> > Any interest in building a reverse streaming tool, that does the reverse
> > of what the DeltaStreamer tool does? It will read Hudi table
> incrementally
> > (only source) and write out the data to a variety of sinks - Kafka, JDBC
> > Databases, DFS.
> >
> > This has come up many times with data warehouse users. Often times, they
> > want to use Hudi to speed up or reduce costs on their data ingestion and
> > ETL (using Spark/Flink), but want to move the derived data back into a
> data
> > warehouse or an operational database for serving.
> >
> > What do you all think?
> >
> > Thanks
> > Vinoth
> >
>


Re: Re: Re: DISCUSS

2023-03-30 Thread Vinoth Chandar
I think we can focus more on validating the hash index + bloom filter vs
consistent hash index more first. Have you looked at RFC-08, which is a
kind of hash index as well, except it stores the key => file group mapping
externally.

On Fri, Mar 24, 2023 at 2:14 AM 吕虎  wrote:

> Hi Vinoth, I am very happy to receive your reply. Here are some of my
> thoughts。
>
> At 2023-03-21 23:32:44, "Vinoth Chandar"  wrote:
> >>but when it is used for data expansion, it still involves the need to
> >redistribute the data records of some data files, thus affecting the
> >performance.
> >but expansion of the consistent hash index is an optional operation right?
>
> >Sorry, not still fully understanding the differences here,
> I'm sorry I didn't make myself clearly. The expansion I mentioned last
> time refers to data records increase in hudi table.
> The difference between consistent hash index and hash partition with Bloom
> filters index is how to deal with  data increase:
> For consistent hash index, the way of splitting the file is used.
> Splitting files affects performance, but can permanently work effectively.
> So consistent hash index is  suitable for scenarios where data increase
> cannot be estimated or  data will increase large.
> For hash partitions with Bloom filters index, the way of creating  new
> files is used. Adding new files does not affect performance, but if there
> are too many files, the probability of false positives in the Bloom filters
> will increase. So hash partitions with Bloom filters index is  suitable for
> scenario where data increase can be estimated over a relatively small range.
>
>
> >>Because the hash partition field values under the parquet file in a
> >columnar storage format are all equal, the added column field hardly
> >occupies storage space after compression.
> >Any new meta field added adds other overhead in terms evolving the schema,
> >so forth. are you suggesting this is not possible to do without a new meta
> >field?
>
> No new meta field  implementation is a more elegant implementation, but
> for me, who is not yet familiar with the Hudi source code, it is somewhat
> difficult to implement, but it is not a problem for experts. If you want to
> implement it without adding new meta fields, I hope I can participate in
> some simple development, and I can also learn how experts can do it.
>
>
> >On Thu, Mar 16, 2023 at 2:22 AM 吕虎  wrote:
> >
> >> Hello,
> >>  I feel very honored that you are interested in my views.
> >>
> >>  Here are some of my thoughts marked with blue font.
> >>
> >> At 2023-03-16 13:18:08, "Vinoth Chandar"  wrote:
> >>
> >> >Thanks for the proposal! Some first set of questions here.
> >> >
> >> >>You need to pre-select the number of buckets and use the hash
> function to
> >> >determine which bucket a record belongs to.
> >> >>when building the table according to the estimated amount of data,
> and it
> >> >cannot be changed after building the table
> >> >>When the amount of data in a hash partition is too large, the data in
> >> that
> >> >partition will be split into multiple files in the way of Bloom index.
> >> >
> >> >All these issues are related to bucket sizing could be alleviated by
> the
> >> >consistent hashing index in 0.13? Have you checked it out? Love to hear
> >> >your thoughts on this.
> >>
> >> Hash partitioning is applicable to data tables that cannot give the
> exact
> >> capacity of data, but can estimate a rough range. For example, if a
> company
> >> currently has 300 million customers in the United States, the company
> will
> >> have 7 billion customers in the world at most. In this scenario, using
> hash
> >> partitioning to cope with data growth within the known range by directly
> >> adding files and establishing  bloom filters can still guarantee
> >> performance.
> >> The consistent hash bucket index is also very valuable, but when it is
> >> used for data expansion, it still involves the need to redistribute the
> >> data records of some data files, thus affecting the performance. When
> it is
> >> completely impossible to estimate the range of data capacity, it is very
> >> suitable to use consistent hashing.
> >> >> you can directly search the data under the partition, which greatly
> >> >reduces the scope of the Bloom filter to search for files and reduces
> the
> >> >false positive of the Bloom filter.
> >> >the bloom index is already partition aware and unless you use the
> global
> >> >version can achieve this. Am I missing something?
> >> >
> >> >Another key thing is - if we can avoid adding a new meta field, that
> would
> >> >be great. Is it possible to implement this similar to bucket index,
> based
> >> >on jsut table properties?
> >> Add a hash partition field in the table to implement the hash partition
> >> function, which can well reuse the existing partition function, and
> >> involves very few code changes. Because the hash partition field values
> >> under the parquet file in a columnar storage 

Re: [DISCUSS] Hudi Reverse Streamer

2023-03-30 Thread Vinoth Chandar
Essentially.

Old architecture :(operational database) ==> some tool ==> (data
warehouse raw data) ==> SQL ETL ==> (data warehouse derived data)

New architecture : (operational database) ==> Hudi delta Streamer ==> (Hudi
raw data) ==> Spark/Flink Hudi ETL ==> (Hudi derived data) ==> Hudi Reverse
Streamer ==> (Data Warehouse/Kafka/Operational Database)

On Thu, Mar 30, 2023 at 8:09 PM Vinoth Chandar  wrote:

> Hi all,
>
> Any interest in building a reverse streaming tool, that does the reverse
> of what the DeltaStreamer tool does? It will read Hudi table incrementally
> (only source) and write out the data to a variety of sinks - Kafka, JDBC
> Databases, DFS.
>
> This has come up many times with data warehouse users. Often times, they
> want to use Hudi to speed up or reduce costs on their data ingestion and
> ETL (using Spark/Flink), but want to move the derived data back into a data
> warehouse or an operational database for serving.
>
> What do you all think?
>
> Thanks
> Vinoth
>


Re:Re: Re: DISCUSS

2023-03-24 Thread 吕虎
Hi Vinoth, I am very happy to receive your reply. Here are some of my thoughts。

At 2023-03-21 23:32:44, "Vinoth Chandar"  wrote:
>>but when it is used for data expansion, it still involves the need to
>redistribute the data records of some data files, thus affecting the
>performance.
>but expansion of the consistent hash index is an optional operation right?

>Sorry, not still fully understanding the differences here,
I'm sorry I didn't make myself clearly. The expansion I mentioned last time 
refers to data records increase in hudi table.
The difference between consistent hash index and hash partition with Bloom 
filters index is how to deal with  data increase:
For consistent hash index, the way of splitting the file is used. Splitting 
files affects performance, but can permanently work effectively. So consistent 
hash index is  suitable for scenarios where data increase cannot be estimated 
or  data will increase large.
For hash partitions with Bloom filters index, the way of creating  new files is 
used. Adding new files does not affect performance, but if there are too many 
files, the probability of false positives in the Bloom filters will increase. 
So hash partitions with Bloom filters index is  suitable for scenario where 
data increase can be estimated over a relatively small range.


>>Because the hash partition field values under the parquet file in a
>columnar storage format are all equal, the added column field hardly
>occupies storage space after compression.
>Any new meta field added adds other overhead in terms evolving the schema,
>so forth. are you suggesting this is not possible to do without a new meta
>field?

No new meta field  implementation is a more elegant implementation, but for me, 
who is not yet familiar with the Hudi source code, it is somewhat difficult to 
implement, but it is not a problem for experts. If you want to implement it 
without adding new meta fields, I hope I can participate in some simple 
development, and I can also learn how experts can do it.


>On Thu, Mar 16, 2023 at 2:22 AM 吕虎  wrote:
>
>> Hello,
>>  I feel very honored that you are interested in my views.
>>
>>  Here are some of my thoughts marked with blue font.
>>
>> At 2023-03-16 13:18:08, "Vinoth Chandar"  wrote:
>>
>> >Thanks for the proposal! Some first set of questions here.
>> >
>> >>You need to pre-select the number of buckets and use the hash function to
>> >determine which bucket a record belongs to.
>> >>when building the table according to the estimated amount of data, and it
>> >cannot be changed after building the table
>> >>When the amount of data in a hash partition is too large, the data in
>> that
>> >partition will be split into multiple files in the way of Bloom index.
>> >
>> >All these issues are related to bucket sizing could be alleviated by the
>> >consistent hashing index in 0.13? Have you checked it out? Love to hear
>> >your thoughts on this.
>>
>> Hash partitioning is applicable to data tables that cannot give the exact
>> capacity of data, but can estimate a rough range. For example, if a company
>> currently has 300 million customers in the United States, the company will
>> have 7 billion customers in the world at most. In this scenario, using hash
>> partitioning to cope with data growth within the known range by directly
>> adding files and establishing  bloom filters can still guarantee
>> performance.
>> The consistent hash bucket index is also very valuable, but when it is
>> used for data expansion, it still involves the need to redistribute the
>> data records of some data files, thus affecting the performance. When it is
>> completely impossible to estimate the range of data capacity, it is very
>> suitable to use consistent hashing.
>> >> you can directly search the data under the partition, which greatly
>> >reduces the scope of the Bloom filter to search for files and reduces the
>> >false positive of the Bloom filter.
>> >the bloom index is already partition aware and unless you use the global
>> >version can achieve this. Am I missing something?
>> >
>> >Another key thing is - if we can avoid adding a new meta field, that would
>> >be great. Is it possible to implement this similar to bucket index, based
>> >on jsut table properties?
>> Add a hash partition field in the table to implement the hash partition
>> function, which can well reuse the existing partition function, and
>> involves very few code changes. Because the hash partition field values
>> under the parquet file in a columnar storage format are all equal, the
>> added column field hardly occupies storage space after compression.
>> Of course, it is not necessary to add hash partition fields in the table,
>> but to store hash partition fields in the corresponding metadata to achieve
>> this function, but it will be difficult to reuse the existing functions.
>> The establishment of hash partition and partition pruning during query need
>> more time to develop code and test again.
>> 

Re: Re: DISCUSS

2023-03-21 Thread Vinoth Chandar
>but when it is used for data expansion, it still involves the need to
redistribute the data records of some data files, thus affecting the
performance.
but expansion of the consistent hash index is an optional operation right?
Sorry, not still fully understanding the differences here,

>Because the hash partition field values under the parquet file in a
columnar storage format are all equal, the added column field hardly
occupies storage space after compression.
Any new meta field added adds other overhead in terms evolving the schema,
so forth. are you suggesting this is not possible to do without a new meta
field?

On Thu, Mar 16, 2023 at 2:22 AM 吕虎  wrote:

> Hello,
>  I feel very honored that you are interested in my views.
>
>  Here are some of my thoughts marked with blue font.
>
> At 2023-03-16 13:18:08, "Vinoth Chandar"  wrote:
>
> >Thanks for the proposal! Some first set of questions here.
> >
> >>You need to pre-select the number of buckets and use the hash function to
> >determine which bucket a record belongs to.
> >>when building the table according to the estimated amount of data, and it
> >cannot be changed after building the table
> >>When the amount of data in a hash partition is too large, the data in
> that
> >partition will be split into multiple files in the way of Bloom index.
> >
> >All these issues are related to bucket sizing could be alleviated by the
> >consistent hashing index in 0.13? Have you checked it out? Love to hear
> >your thoughts on this.
>
> Hash partitioning is applicable to data tables that cannot give the exact
> capacity of data, but can estimate a rough range. For example, if a company
> currently has 300 million customers in the United States, the company will
> have 7 billion customers in the world at most. In this scenario, using hash
> partitioning to cope with data growth within the known range by directly
> adding files and establishing  bloom filters can still guarantee
> performance.
> The consistent hash bucket index is also very valuable, but when it is
> used for data expansion, it still involves the need to redistribute the
> data records of some data files, thus affecting the performance. When it is
> completely impossible to estimate the range of data capacity, it is very
> suitable to use consistent hashing.
> >> you can directly search the data under the partition, which greatly
> >reduces the scope of the Bloom filter to search for files and reduces the
> >false positive of the Bloom filter.
> >the bloom index is already partition aware and unless you use the global
> >version can achieve this. Am I missing something?
> >
> >Another key thing is - if we can avoid adding a new meta field, that would
> >be great. Is it possible to implement this similar to bucket index, based
> >on jsut table properties?
> Add a hash partition field in the table to implement the hash partition
> function, which can well reuse the existing partition function, and
> involves very few code changes. Because the hash partition field values
> under the parquet file in a columnar storage format are all equal, the
> added column field hardly occupies storage space after compression.
> Of course, it is not necessary to add hash partition fields in the table,
> but to store hash partition fields in the corresponding metadata to achieve
> this function, but it will be difficult to reuse the existing functions.
> The establishment of hash partition and partition pruning during query need
> more time to develop code and test again.
> >On Sat, Feb 18, 2023 at 8:18 PM 吕虎  wrote:
> >
> >> Hi folks,
> >>
> >> Here is my proposal.Thank you very much for reading it.I am looking
> >> forward to your agreement  to create an RFC for it.
> >>
> >> Background
> >>
> >> In order to deal with the problem that the modification of a small
> amount
> >> of local data needs to rewrite the entire partition data, Hudi divided
> the
> >> partition into multiple File Groups, and each File Group is identified
> by
> >> the File ID. In this way, when a small amount of local data is modified,
> >> only the data of the corresponding File Group needs to be rewritten.
> Hudi
> >> consistently maps the given Hudi record to the File ID through the index
> >> mechanism. The mapping relationship between Record Key and File
> Group/File
> >> ID will not change once the first version of Record is determined.
> >>
> >> At present, Hudi's indexes mainly include Bloom filter index, Hbase
> >> index and bucket index. The Bloom filter index has a false positive
> >> problem. When a large amount of data results in a large number of File
> >> Groups, the false positive problem will magnify and lead to poor
> >> performance. The Hbase index depends on the external Hbase database, and
> >> may be inconsistent, which will ultimately increase the operation and
> >> maintenance costs. Bucket index makes each bucket of the bucket index
> >> correspond to a File Group. You need to pre-select the number of 

Re: DISCUSS

2023-03-16 Thread Vinoth Chandar
Thanks for the proposal! Some first set of questions here.

>You need to pre-select the number of buckets and use the hash function to
determine which bucket a record belongs to.
>when building the table according to the estimated amount of data, and it
cannot be changed after building the table
>When the amount of data in a hash partition is too large, the data in that
partition will be split into multiple files in the way of Bloom index.

All these issues are related to bucket sizing could be alleviated by the
consistent hashing index in 0.13? Have you checked it out? Love to hear
your thoughts on this.

> you can directly search the data under the partition, which greatly
reduces the scope of the Bloom filter to search for files and reduces the
false positive of the Bloom filter.
the bloom index is already partition aware and unless you use the global
version can achieve this. Am I missing something?

Another key thing is - if we can avoid adding a new meta field, that would
be great. Is it possible to implement this similar to bucket index, based
on jsut table properties?

On Sat, Feb 18, 2023 at 8:18 PM 吕虎  wrote:

> Hi folks,
>
> Here is my proposal.Thank you very much for reading it.I am looking
> forward to your agreement  to create an RFC for it.
>
> Background
>
> In order to deal with the problem that the modification of a small amount
> of local data needs to rewrite the entire partition data, Hudi divided the
> partition into multiple File Groups, and each File Group is identified by
> the File ID. In this way, when a small amount of local data is modified,
> only the data of the corresponding File Group needs to be rewritten. Hudi
> consistently maps the given Hudi record to the File ID through the index
> mechanism. The mapping relationship between Record Key and File Group/File
> ID will not change once the first version of Record is determined.
>
> At present, Hudi's indexes mainly include Bloom filter index, Hbase
> index and bucket index. The Bloom filter index has a false positive
> problem. When a large amount of data results in a large number of File
> Groups, the false positive problem will magnify and lead to poor
> performance. The Hbase index depends on the external Hbase database, and
> may be inconsistent, which will ultimately increase the operation and
> maintenance costs. Bucket index makes each bucket of the bucket index
> correspond to a File Group. You need to pre-select the number of buckets
> and use the hash function to determine which bucket a record belongs to.
> Therefore, you can directly determine the mapping relationship between the
> Record Key and the File Group/File ID through the hash function. Using the
> bucket index, you need to determine the number of buckets in advance when
> building the table according to the estimated amount of data, and it cannot
> be changed after building the table. The unreasonable number of buckets
> will seriously affect performance. Unfortunately, the amount of data is
> often unpredictable and will continue to grow.
>
> Hash partition feasibility
>
>  In this context, I put forward the idea of hash partitioning. The
> principle is similar to bucket index, but in addition to the advantages of
> bucket index, hash partitioning can retain the Bloom index. When the amount
> of data in a hash partition is too large, the data in that partition will
> be split into multiple files in the way of Bloom index. Therefore, the
> problem that bucket index depends heavily on the number of buckets does not
> exist in the hash partition. Compared with the Bloom index, when searching
> for a data, you can directly search the data under the partition, which
> greatly reduces the scope of the Bloom filter to search for files and
> reduces the false positive of the Bloom filter.
>
> Design of a simple hash partition implementation
>
> The idea is to use the capabilities of the ComplexKeyGenerator to
> implement hash partitioning. Hash partition field is one of the partition
> fields of the ComplexKeyGenerator.
>
>When hash.partition.fields is specified and partition.fields contains
> _hoodie_hash_partition, a column named _hoodie_hash_partition will be added
> in this table as one of the partition key.
>
> If predicates of hash.partition.fields appear in the query statement, the
> _hoodie_hash_partition = X predicate will be automatically added to the
> query statement for partition pruning.
>
> Advantages of this design: simple implementation, no modification of core
> functions, so low risk.
>
> The above design has been implemented in pr 7984.
>
> https://github.com/apache/hudi/pull/7984
>
> How to use hash partition in spark data source can refer to
>
>
> https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
>
> #testHashPartition
>
> Perhaps for experts, the implementation of PR is not elegant enough. I
> also look 

Re: [DISCUSS] Build tool upgrade

2023-02-13 Thread Vinoth Chandar
This is cool! :)

On Mon, Feb 13, 2023 at 2:02 PM Daniel Kaźmirski 
wrote:

> Hi,
>
> I did try to add the mentioned extension to Hudi pom. Here are the results:
>
> Clean with cache extension disabled
> mvn clean package -DskipTests -Dspark3.3 -Dscala-2.12
> -Dmaven.build.cache.enabled=false
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 10:57 min
>
> With cache after changing HoodieSpark33CatalogUtils
> mvn package -DskipTests -Dspark3.3 -Dscala-2.12
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 03:52 min
>
> With cache no changes
> mvn package -DskipTests -Dspark3.3 -Dscala-2.12
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 3.485 s
>
> If anyone would like to try it out:
> https://github.com/apache/hudi/pull/7935
>
> BR,
> Daniel
>
> pt., 10 lut 2023 o 16:59 Daniel Kaźmirski 
> napisał(a):
>
> > Hi all,
> >
> > Going back to this topic, Maven 3.9.0 has been released recently along
> > with a new build cache extension that provides incremental builds:
> > https://maven.apache.org/extensions/maven-build-cache-extension/
> > Might be worth considering.
> >
> > pon., 24 paź 2022 o 19:59 Shiyan Xu 
> > napisał(a):
> >
> >> Thank you all for the valuable inputs! I think we can close this topic
> for
> >> now, given the majority is leaning towards continuing with maven.
> >>
> >> On Mon, Oct 17, 2022 at 8:48 PM zhaojing yu 
> wrote:
> >>
> >> > I have experienced some gradle development projects and want to share
> >> some
> >> > thoughts.
> >> >
> >> > The flexibility and faster speed of gradle itself can certainly bring
> >> some
> >> > advantages, but it will also greatly increase the troubleshooting time
> >> due
> >> > to the bugs of gradle itself, and gradle DSL is very different from
> >> that of
> >> > maven. There are also many learning costs for developers in the
> >> community.
> >> >
> >> > I think it does consume too much time on code release, but users or
> >> > developers usually only compile part of the module.
> >> >
> >> > So I think, a certain advantage in build time alone is not enough to
> >> cover
> >> > so much cost.
> >> >
> >> > Best,
> >> > Zhaojing
> >> >
> >> > Gary Li  于2022年10月17日周一 19:22写道:
> >> >
> >> > > Hi folks,
> >> > >
> >> > > I'd share my thoughts as well. I personally won't build the whole
> >> project
> >> > > too often, only before push to the remote branch or make big changes
> >> in
> >> > > different modules. If I just make some changes and run a test, the
> IDE
> >> > will
> >> > > only build the necessary modules I believe. In addition, each time I
> >> deal
> >> > > with dependency issues, the years of maven experience does help me
> >> locate
> >> > > the issue quickly, especially when the dependency tree is pretty
> >> > > complicated. The learning effort and the new environment setup
> effort
> >> are
> >> > > considerable as well.
> >> > >
> >> > > Happy to learn if there are other benefits gradle or bazel could
> >> bring to
> >> > > us, but if the only benefit is the xx% faster build time, I am a bit
> >> > > unconvinced to make this change.
> >> > >
> >> > > Best,
> >> > > Gary
> >> > >
> >> > > On Mon, Oct 17, 2022 at 2:58 PM Danny Chan 
> >> wrote:
> >> > >
> >> > > > I have a full experience with how Apache Calcite switches from
> Maven
> >> > > > to Gradle, and I want to share some thoughts.
> >> > > >
> >> > > > The gradle build is fast, but it relies heavily on its local
> cache,
> >> > > > usually it needs too much time to download these cache jars
> because
> >> > > > gradle upgrade itself very frequently.
> >> > > >
> >> > > > The gradle is very flexive for building, but it also has many
> bugs,
> >> > > > you may need more time to debug its bug compared with building
> with
> >> > > > maven.
> >> > > >
> >> > > > The gradle DSL for building is a must to learn for all the
> >> developers.
> >> > > >
> >> > > > For all above cases, I don't think switching to gradle is a right
> >> > > > decision for Apache Calcite. Julian Hyde which is the creator of
> >> > > > Calcite may have more words to say here.
> >> > > >
> >> > > > So I would not suggest we do that for Hudi.
> >> > > >
> >> > > >
> >> > > > Best,
> >> > > > Danny Chan
> >> > > >
> >> > > > Shiyan Xu  于2022年10月1日周六 13:48写道:
> >> > > > >
> >> > > > > Hi all,
> >> > > > >
> >> > > > > I'd like to raise a discussion around the build tool for Hudi.
> >> > > > >
> >> > > > > Maven has been a mature yet slow (10min to build on 2021 macbook
> >> pro)
> >> > > > build
> >> > > > > tool compared to modern ones like gradle or bazel. We all want
> >> faster
> >> > > > > builds, however, we also need to consider the efforts and risks
> to
> >> > > > upgrade,
> >> > > > > and the developers' feedback on usability.
> >> > > > >
> >> > > > > What do you all think 

Re: [DISCUSS] Build tool upgrade

2023-02-13 Thread Daniel Kaźmirski
Hi,

I did try to add the mentioned extension to Hudi pom. Here are the results:

Clean with cache extension disabled
mvn clean package -DskipTests -Dspark3.3 -Dscala-2.12
-Dmaven.build.cache.enabled=false
[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 10:57 min

With cache after changing HoodieSpark33CatalogUtils
mvn package -DskipTests -Dspark3.3 -Dscala-2.12
[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 03:52 min

With cache no changes
mvn package -DskipTests -Dspark3.3 -Dscala-2.12
[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 3.485 s

If anyone would like to try it out: https://github.com/apache/hudi/pull/7935

BR,
Daniel

pt., 10 lut 2023 o 16:59 Daniel Kaźmirski 
napisał(a):

> Hi all,
>
> Going back to this topic, Maven 3.9.0 has been released recently along
> with a new build cache extension that provides incremental builds:
> https://maven.apache.org/extensions/maven-build-cache-extension/
> Might be worth considering.
>
> pon., 24 paź 2022 o 19:59 Shiyan Xu 
> napisał(a):
>
>> Thank you all for the valuable inputs! I think we can close this topic for
>> now, given the majority is leaning towards continuing with maven.
>>
>> On Mon, Oct 17, 2022 at 8:48 PM zhaojing yu  wrote:
>>
>> > I have experienced some gradle development projects and want to share
>> some
>> > thoughts.
>> >
>> > The flexibility and faster speed of gradle itself can certainly bring
>> some
>> > advantages, but it will also greatly increase the troubleshooting time
>> due
>> > to the bugs of gradle itself, and gradle DSL is very different from
>> that of
>> > maven. There are also many learning costs for developers in the
>> community.
>> >
>> > I think it does consume too much time on code release, but users or
>> > developers usually only compile part of the module.
>> >
>> > So I think, a certain advantage in build time alone is not enough to
>> cover
>> > so much cost.
>> >
>> > Best,
>> > Zhaojing
>> >
>> > Gary Li  于2022年10月17日周一 19:22写道:
>> >
>> > > Hi folks,
>> > >
>> > > I'd share my thoughts as well. I personally won't build the whole
>> project
>> > > too often, only before push to the remote branch or make big changes
>> in
>> > > different modules. If I just make some changes and run a test, the IDE
>> > will
>> > > only build the necessary modules I believe. In addition, each time I
>> deal
>> > > with dependency issues, the years of maven experience does help me
>> locate
>> > > the issue quickly, especially when the dependency tree is pretty
>> > > complicated. The learning effort and the new environment setup effort
>> are
>> > > considerable as well.
>> > >
>> > > Happy to learn if there are other benefits gradle or bazel could
>> bring to
>> > > us, but if the only benefit is the xx% faster build time, I am a bit
>> > > unconvinced to make this change.
>> > >
>> > > Best,
>> > > Gary
>> > >
>> > > On Mon, Oct 17, 2022 at 2:58 PM Danny Chan 
>> wrote:
>> > >
>> > > > I have a full experience with how Apache Calcite switches from Maven
>> > > > to Gradle, and I want to share some thoughts.
>> > > >
>> > > > The gradle build is fast, but it relies heavily on its local cache,
>> > > > usually it needs too much time to download these cache jars because
>> > > > gradle upgrade itself very frequently.
>> > > >
>> > > > The gradle is very flexive for building, but it also has many bugs,
>> > > > you may need more time to debug its bug compared with building with
>> > > > maven.
>> > > >
>> > > > The gradle DSL for building is a must to learn for all the
>> developers.
>> > > >
>> > > > For all above cases, I don't think switching to gradle is a right
>> > > > decision for Apache Calcite. Julian Hyde which is the creator of
>> > > > Calcite may have more words to say here.
>> > > >
>> > > > So I would not suggest we do that for Hudi.
>> > > >
>> > > >
>> > > > Best,
>> > > > Danny Chan
>> > > >
>> > > > Shiyan Xu  于2022年10月1日周六 13:48写道:
>> > > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I'd like to raise a discussion around the build tool for Hudi.
>> > > > >
>> > > > > Maven has been a mature yet slow (10min to build on 2021 macbook
>> pro)
>> > > > build
>> > > > > tool compared to modern ones like gradle or bazel. We all want
>> faster
>> > > > > builds, however, we also need to consider the efforts and risks to
>> > > > upgrade,
>> > > > > and the developers' feedback on usability.
>> > > > >
>> > > > > What do you all think about upgrading to gradle or bazel? Please
>> > share
>> > > > your
>> > > > > thoughts. Thanks.
>> > > > >
>> > > > > --
>> > > > > Best,
>> > > > > Shiyan
>> > > >
>> > >
>> >
>>
>>
>> --
>> Best,
>> Shiyan
>>
>
>
> --
> Daniel Kaźmirski
>


-- 
Daniel Kaźmirski


Re: [DISCUSS] Build tool upgrade

2023-02-10 Thread Daniel Kaźmirski
Hi all,

Going back to this topic, Maven 3.9.0 has been released recently along with
a new build cache extension that provides incremental builds:
https://maven.apache.org/extensions/maven-build-cache-extension/
Might be worth considering.

pon., 24 paź 2022 o 19:59 Shiyan Xu 
napisał(a):

> Thank you all for the valuable inputs! I think we can close this topic for
> now, given the majority is leaning towards continuing with maven.
>
> On Mon, Oct 17, 2022 at 8:48 PM zhaojing yu  wrote:
>
> > I have experienced some gradle development projects and want to share
> some
> > thoughts.
> >
> > The flexibility and faster speed of gradle itself can certainly bring
> some
> > advantages, but it will also greatly increase the troubleshooting time
> due
> > to the bugs of gradle itself, and gradle DSL is very different from that
> of
> > maven. There are also many learning costs for developers in the
> community.
> >
> > I think it does consume too much time on code release, but users or
> > developers usually only compile part of the module.
> >
> > So I think, a certain advantage in build time alone is not enough to
> cover
> > so much cost.
> >
> > Best,
> > Zhaojing
> >
> > Gary Li  于2022年10月17日周一 19:22写道:
> >
> > > Hi folks,
> > >
> > > I'd share my thoughts as well. I personally won't build the whole
> project
> > > too often, only before push to the remote branch or make big changes in
> > > different modules. If I just make some changes and run a test, the IDE
> > will
> > > only build the necessary modules I believe. In addition, each time I
> deal
> > > with dependency issues, the years of maven experience does help me
> locate
> > > the issue quickly, especially when the dependency tree is pretty
> > > complicated. The learning effort and the new environment setup effort
> are
> > > considerable as well.
> > >
> > > Happy to learn if there are other benefits gradle or bazel could bring
> to
> > > us, but if the only benefit is the xx% faster build time, I am a bit
> > > unconvinced to make this change.
> > >
> > > Best,
> > > Gary
> > >
> > > On Mon, Oct 17, 2022 at 2:58 PM Danny Chan 
> wrote:
> > >
> > > > I have a full experience with how Apache Calcite switches from Maven
> > > > to Gradle, and I want to share some thoughts.
> > > >
> > > > The gradle build is fast, but it relies heavily on its local cache,
> > > > usually it needs too much time to download these cache jars because
> > > > gradle upgrade itself very frequently.
> > > >
> > > > The gradle is very flexive for building, but it also has many bugs,
> > > > you may need more time to debug its bug compared with building with
> > > > maven.
> > > >
> > > > The gradle DSL for building is a must to learn for all the
> developers.
> > > >
> > > > For all above cases, I don't think switching to gradle is a right
> > > > decision for Apache Calcite. Julian Hyde which is the creator of
> > > > Calcite may have more words to say here.
> > > >
> > > > So I would not suggest we do that for Hudi.
> > > >
> > > >
> > > > Best,
> > > > Danny Chan
> > > >
> > > > Shiyan Xu  于2022年10月1日周六 13:48写道:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > I'd like to raise a discussion around the build tool for Hudi.
> > > > >
> > > > > Maven has been a mature yet slow (10min to build on 2021 macbook
> pro)
> > > > build
> > > > > tool compared to modern ones like gradle or bazel. We all want
> faster
> > > > > builds, however, we also need to consider the efforts and risks to
> > > > upgrade,
> > > > > and the developers' feedback on usability.
> > > > >
> > > > > What do you all think about upgrading to gradle or bazel? Please
> > share
> > > > your
> > > > > thoughts. Thanks.
> > > > >
> > > > > --
> > > > > Best,
> > > > > Shiyan
> > > >
> > >
> >
>
>
> --
> Best,
> Shiyan
>


-- 
Daniel Kaźmirski


Re: [DISCUSS] Merging Nov and Dec community sync calls

2022-11-18 Thread Bhavani Sudha
Thank you all. I'll update the calendar invites and the website for
necessary changes.

-Sudha


On Thu, Nov 17, 2022 at 9:01 AM Pratyaksh Sharma 
wrote:

> +1 as well.
>
> On Thu, Nov 17, 2022 at 9:57 PM sagar sumit 
> wrote:
>
> > +1
> >
> > On Thu, Nov 17, 2022 at 9:44 AM Sivabalan  wrote:
> >
> > > +1 makes sense.
> > >
> > > On Wed, 16 Nov 2022 at 17:40, Y Ethan Guo  wrote:
> > >
> > > > +1 on having a single community sync all on Dec 14 during the holiday
> > > > season.
> > > >
> > > > On Wed, Nov 16, 2022 at 5:12 PM Bhavani Sudha <
> bhavanisud...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hello Hudi community,
> > > > >
> > > > > We have monthly community sync calls on the last wednesday of every
> > > > month.
> > > > > For November and December months these collide with public holidays
> > and
> > > > > cause limited attendance due to the same. For this reason, I am
> > > proposing
> > > > > to merge Nov and Dec sync calls into one and placing them in the
> 2nd
> > > week
> > > > > of December. I am thinking of December 14th for the community sync
> > > call.
> > > > > Please let me know your thoughts.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Sudha
> > > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>


Re: [DISCUSS] Merging Nov and Dec community sync calls

2022-11-17 Thread Pratyaksh Sharma
+1 as well.

On Thu, Nov 17, 2022 at 9:57 PM sagar sumit  wrote:

> +1
>
> On Thu, Nov 17, 2022 at 9:44 AM Sivabalan  wrote:
>
> > +1 makes sense.
> >
> > On Wed, 16 Nov 2022 at 17:40, Y Ethan Guo  wrote:
> >
> > > +1 on having a single community sync all on Dec 14 during the holiday
> > > season.
> > >
> > > On Wed, Nov 16, 2022 at 5:12 PM Bhavani Sudha  >
> > > wrote:
> > >
> > > > Hello Hudi community,
> > > >
> > > > We have monthly community sync calls on the last wednesday of every
> > > month.
> > > > For November and December months these collide with public holidays
> and
> > > > cause limited attendance due to the same. For this reason, I am
> > proposing
> > > > to merge Nov and Dec sync calls into one and placing them in the 2nd
> > week
> > > > of December. I am thinking of December 14th for the community sync
> > call.
> > > > Please let me know your thoughts.
> > > >
> > > >
> > > > Thanks,
> > > > Sudha
> > > >
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


Re: [DISCUSS] Merging Nov and Dec community sync calls

2022-11-17 Thread sagar sumit
+1

On Thu, Nov 17, 2022 at 9:44 AM Sivabalan  wrote:

> +1 makes sense.
>
> On Wed, 16 Nov 2022 at 17:40, Y Ethan Guo  wrote:
>
> > +1 on having a single community sync all on Dec 14 during the holiday
> > season.
> >
> > On Wed, Nov 16, 2022 at 5:12 PM Bhavani Sudha 
> > wrote:
> >
> > > Hello Hudi community,
> > >
> > > We have monthly community sync calls on the last wednesday of every
> > month.
> > > For November and December months these collide with public holidays and
> > > cause limited attendance due to the same. For this reason, I am
> proposing
> > > to merge Nov and Dec sync calls into one and placing them in the 2nd
> week
> > > of December. I am thinking of December 14th for the community sync
> call.
> > > Please let me know your thoughts.
> > >
> > >
> > > Thanks,
> > > Sudha
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Merging Nov and Dec community sync calls

2022-11-17 Thread Shiyan Xu
+1

On Thu, Nov 17, 2022 at 12:15 PM Sivabalan  wrote:

> +1 makes sense.
>
> On Wed, 16 Nov 2022 at 17:40, Y Ethan Guo  wrote:
>
> > +1 on having a single community sync all on Dec 14 during the holiday
> > season.
> >
> > On Wed, Nov 16, 2022 at 5:12 PM Bhavani Sudha 
> > wrote:
> >
> > > Hello Hudi community,
> > >
> > > We have monthly community sync calls on the last wednesday of every
> > month.
> > > For November and December months these collide with public holidays and
> > > cause limited attendance due to the same. For this reason, I am
> proposing
> > > to merge Nov and Dec sync calls into one and placing them in the 2nd
> week
> > > of December. I am thinking of December 14th for the community sync
> call.
> > > Please let me know your thoughts.
> > >
> > >
> > > Thanks,
> > > Sudha
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


-- 
Best,
Shiyan


Re: [DISCUSS] Merging Nov and Dec community sync calls

2022-11-16 Thread Sivabalan
+1 makes sense.

On Wed, 16 Nov 2022 at 17:40, Y Ethan Guo  wrote:

> +1 on having a single community sync all on Dec 14 during the holiday
> season.
>
> On Wed, Nov 16, 2022 at 5:12 PM Bhavani Sudha 
> wrote:
>
> > Hello Hudi community,
> >
> > We have monthly community sync calls on the last wednesday of every
> month.
> > For November and December months these collide with public holidays and
> > cause limited attendance due to the same. For this reason, I am proposing
> > to merge Nov and Dec sync calls into one and placing them in the 2nd week
> > of December. I am thinking of December 14th for the community sync call.
> > Please let me know your thoughts.
> >
> >
> > Thanks,
> > Sudha
> >
>


-- 
Regards,
-Sivabalan


Re: [DISCUSS] Merging Nov and Dec community sync calls

2022-11-16 Thread Y Ethan Guo
+1 on having a single community sync all on Dec 14 during the holiday
season.

On Wed, Nov 16, 2022 at 5:12 PM Bhavani Sudha 
wrote:

> Hello Hudi community,
>
> We have monthly community sync calls on the last wednesday of every month.
> For November and December months these collide with public holidays and
> cause limited attendance due to the same. For this reason, I am proposing
> to merge Nov and Dec sync calls into one and placing them in the 2nd week
> of December. I am thinking of December 14th for the community sync call.
> Please let me know your thoughts.
>
>
> Thanks,
> Sudha
>


Re: [Discuss] SCD-2 Payload

2022-10-24 Thread 冯健
to Raymond:  now combineAndGetUpdateValue can only return one
IndexedRecord, but in the case of SCD-2, both old and new records need to
be stored.
to Alexey: yeah,  this feature should be designed on top of RFC-46.  Can
HoodieRecordMerger return 2 HoodieRecord in this case?



On Tue, 25 Oct 2022 at 03:55, Alexey Kudinkin  wrote:

> Hey, hey, Fengjian!
>
> With the landing of the RFC-46 we'll be kick-starting a process of phasing
> out HoodieRecordPayload as an abstraction and instead migrating to
> HoodieRecordMerger interface.
> I'd recommend to base your design considerations off the new
> HoodieRecordMerger interface instead of legacy HoodieRecordPayload to make
> sure it's future-proof.
>
> On Thu, Oct 20, 2022 at 10:08 AM 冯健  wrote:
>
> > Hi guys,
> > After reading this article with respect to how to implement SCD-2
> with
> > Hudi Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and
> > Apache Hudi on Amazon EMR
> > <
> >
> https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/
> > >
> > I have an idea about implementing embedded SCD-2 support in hudi by
> > using a new Payload. Users don't need to manually join the data, then
> > update end_data and status.
> >For example, the record key is 'id,end_date',  Let's say the current
> > data's id is 1 and the end_date is 2099-12-31,  when a new record with
> id=1
> > arrives, it will update the current record's end_date to 2022-10-21, and
> > also insert this new record with end_data ' 2099-12-31'.  so this Payload
> > will generate two records in combineAndGetUpdateValue . there will be no
> > join cost, and the whole process is transparent to users.
> >
> >Any thoughts?
> >
>


Re: [Discuss] SCD-2 Payload

2022-10-24 Thread Alexey Kudinkin
Hey, hey, Fengjian!

With the landing of the RFC-46 we'll be kick-starting a process of phasing
out HoodieRecordPayload as an abstraction and instead migrating to
HoodieRecordMerger interface.
I'd recommend to base your design considerations off the new
HoodieRecordMerger interface instead of legacy HoodieRecordPayload to make
sure it's future-proof.

On Thu, Oct 20, 2022 at 10:08 AM 冯健  wrote:

> Hi guys,
> After reading this article with respect to how to implement SCD-2 with
> Hudi Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and
> Apache Hudi on Amazon EMR
> <
> https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/
> >
> I have an idea about implementing embedded SCD-2 support in hudi by
> using a new Payload. Users don't need to manually join the data, then
> update end_data and status.
>For example, the record key is 'id,end_date',  Let's say the current
> data's id is 1 and the end_date is 2099-12-31,  when a new record with id=1
> arrives, it will update the current record's end_date to 2022-10-21, and
> also insert this new record with end_data ' 2099-12-31'.  so this Payload
> will generate two records in combineAndGetUpdateValue . there will be no
> join cost, and the whole process is transparent to users.
>
>Any thoughts?
>


Re: [Discuss] SCD-2 Payload

2022-10-24 Thread Shiyan Xu
Interesting thoughts. Not sure if I fully understand this part: "generate 2
records in combineAndGetUpdateValue". the API is defined to return just 1
record?

On Fri, Oct 21, 2022 at 1:07 AM 冯健  wrote:

> Hi guys,
> After reading this article with respect to how to implement SCD-2 with
> Hudi Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and
> Apache Hudi on Amazon EMR
> <
> https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/
> >
> I have an idea about implementing embedded SCD-2 support in hudi by
> using a new Payload. Users don't need to manually join the data, then
> update end_data and status.
>For example, the record key is 'id,end_date',  Let's say the current
> data's id is 1 and the end_date is 2099-12-31,  when a new record with id=1
> arrives, it will update the current record's end_date to 2022-10-21, and
> also insert this new record with end_data ' 2099-12-31'.  so this Payload
> will generate two records in combineAndGetUpdateValue . there will be no
> join cost, and the whole process is transparent to users.
>
>Any thoughts?
>


-- 
Best,
Shiyan


Re: [DISCUSS] [RFC] Hudi bundle standards

2022-10-24 Thread Shiyan Xu
Thanks Xinyao for raising the problem. Let's align more on the RFC to help
clarify usage. Agree on the importance - the bundle artifacts are the
user-facing components from this project.

On Mon, Oct 10, 2022 at 4:44 PM 田昕峣 (Xinyao Tian) 
wrote:

> Hi Shiyan,
>
>
> Having carefully read the RFC-63.md on the PR, I really think this feature
> is crucial for everyone who builds Hudi from source. For example, when I
> tried to compile Hudi 0.12.0 with flink1.15, I used command ‘mvn clean
> package -DskipTests -Dflink1.15 -Dscala-2.12’ but still get flink1.14
> bundle. Also, in the documents it’s a highlight of supporting Fink 1.15 but
> Hudi 0.12.0 Github readme.md doesn’t mention anything about flink1.15 in
> compile section. All in all, there are many misleading points of Hudi
> bundles, which have to be enhanced asap.
>
>
> Really appreciate to have this RFC trying to solve all these problems.
> On 10/10/2022 13:36,Shiyan Xu wrote:
> Hi Hudi devs and users,
>
> I've raised an RFC around Hudi bundles, aiming to address issues around
> dependency conflicts, and to establish standards for bundle jar usage and
> change process. Please have a look. Thanks!
>
> https://github.com/apache/hudi/pull/6902
>
> --
> Best,
> Shiyan
>


-- 
Best,
Shiyan


Re: [DISCUSS] Build tool upgrade

2022-10-24 Thread Shiyan Xu
Thank you all for the valuable inputs! I think we can close this topic for
now, given the majority is leaning towards continuing with maven.

On Mon, Oct 17, 2022 at 8:48 PM zhaojing yu  wrote:

> I have experienced some gradle development projects and want to share some
> thoughts.
>
> The flexibility and faster speed of gradle itself can certainly bring some
> advantages, but it will also greatly increase the troubleshooting time due
> to the bugs of gradle itself, and gradle DSL is very different from that of
> maven. There are also many learning costs for developers in the community.
>
> I think it does consume too much time on code release, but users or
> developers usually only compile part of the module.
>
> So I think, a certain advantage in build time alone is not enough to cover
> so much cost.
>
> Best,
> Zhaojing
>
> Gary Li  于2022年10月17日周一 19:22写道:
>
> > Hi folks,
> >
> > I'd share my thoughts as well. I personally won't build the whole project
> > too often, only before push to the remote branch or make big changes in
> > different modules. If I just make some changes and run a test, the IDE
> will
> > only build the necessary modules I believe. In addition, each time I deal
> > with dependency issues, the years of maven experience does help me locate
> > the issue quickly, especially when the dependency tree is pretty
> > complicated. The learning effort and the new environment setup effort are
> > considerable as well.
> >
> > Happy to learn if there are other benefits gradle or bazel could bring to
> > us, but if the only benefit is the xx% faster build time, I am a bit
> > unconvinced to make this change.
> >
> > Best,
> > Gary
> >
> > On Mon, Oct 17, 2022 at 2:58 PM Danny Chan  wrote:
> >
> > > I have a full experience with how Apache Calcite switches from Maven
> > > to Gradle, and I want to share some thoughts.
> > >
> > > The gradle build is fast, but it relies heavily on its local cache,
> > > usually it needs too much time to download these cache jars because
> > > gradle upgrade itself very frequently.
> > >
> > > The gradle is very flexive for building, but it also has many bugs,
> > > you may need more time to debug its bug compared with building with
> > > maven.
> > >
> > > The gradle DSL for building is a must to learn for all the developers.
> > >
> > > For all above cases, I don't think switching to gradle is a right
> > > decision for Apache Calcite. Julian Hyde which is the creator of
> > > Calcite may have more words to say here.
> > >
> > > So I would not suggest we do that for Hudi.
> > >
> > >
> > > Best,
> > > Danny Chan
> > >
> > > Shiyan Xu  于2022年10月1日周六 13:48写道:
> > > >
> > > > Hi all,
> > > >
> > > > I'd like to raise a discussion around the build tool for Hudi.
> > > >
> > > > Maven has been a mature yet slow (10min to build on 2021 macbook pro)
> > > build
> > > > tool compared to modern ones like gradle or bazel. We all want faster
> > > > builds, however, we also need to consider the efforts and risks to
> > > upgrade,
> > > > and the developers' feedback on usability.
> > > >
> > > > What do you all think about upgrading to gradle or bazel? Please
> share
> > > your
> > > > thoughts. Thanks.
> > > >
> > > > --
> > > > Best,
> > > > Shiyan
> > >
> >
>


-- 
Best,
Shiyan


Re: [DISCUSS] Hudi data TTL

2022-10-21 Thread stream2000
Yes we can have a talk about it. We will try our best to write the RFC, maybe 
publish it in a few weeks.


> On Oct 21, 2022, at 10:18, JerryYue <272614...@qq.com.INVALID> wrote:
> 
> Looking forward to the RFC
> It's a good idea, we also need hudi data TTL in some case
> Do we have any plan or time to do this? We also had some simple designs to 
> implement it
> Maybe we can had a talk about it
> 
> 在 2022/10/20 上午9:47,“Bingeng 
> Huang” hbgstc...@gmail.com> 写入:
> 
>Looking forward to the RFC.
>We can propose RFC about support TTL config using non-partition field after
> 
> 
> 
>sagar sumit  于2022年10月19日周三 14:42写道:
> 
>> +1 Very nice idea. Looking forward to the RFC!
>> 
>> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
>> wrote:
>> 
>>> great proposal. Partition TTL is a good starting point. we can extend it
>> to
>>> other TTL strategies like column-based, and make it customizable and
>>> pluggable. Looking forward to the RFC!
>>> 
>>> On Wed, Oct 19, 2022 at 11:40 AM Jian Feng >> 
>>> wrote:
>>> 
 Good idea,
 this is definitely worth an  RFC
 btw should it only depend on Hudi's partition? I feel it should be a
>> more
 common feature since sometimes customers' data can not update across
 partitions
 
 
 On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com>
>> wrote:
 
> Hi all, we have implemented a partition based data ttl management,
>>> which
> we can manage ttl for hudi partition by size, expired time and
> sub-partition count. When a partition is detected as outdated, we use
> delete partition interface to delete it, which will generate a
>> replace
> commit to mark the data as deleted. The real deletion will then done
>> by
> clean service.
> 
> 
> If community is interested in this idea, maybe we can propose a RFC
>> to
> discuss it in detail.
> 
> 
>> On Oct 19, 2022, at 10:06, Vinoth Chandar 
>> wrote:
>> 
>> +1 love to discuss this on a RFC proposal.
>> 
>> On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> wrote:
>> 
>>> That's a very interesting idea.
>>> 
>>> Do you want to take a stab at writing a full proposal (in the form
>>> of
> RFC)
>>> for it?
>>> 
>>> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang <
>> hbgstc...@gmail.com
 
>>> wrote:
>>> 
 Hi all,
 
 Do we have plan to integrate data TTL into HUDI, so we don't have
>>> to
 schedule a offline spark job to delete outdated data, just set a
>>> TTL
 config, then writer or some offline service will delete old data
>> as
 expected.
 
>>> 
> 
> 
 
 --
 *Jian Feng,冯健*
 Shopee | Engineer | Data Infrastructure
 
>>> 
>>> 
>>> --
>>> Best,
>>> Shiyan
>>> 
>> 
> 



Re: [DISCUSS] Hudi data TTL

2022-10-20 Thread JerryYue
Looking forward to the RFC
It's a good idea, we also need hudi data TTL in some case
Do we have any plan or time to do this? We also had some simple designs to 
implement it
Maybe we can had a talk about it

在 2022/10/20 上午9:47,“Bingeng 
Huang” 
写入:

Looking forward to the RFC.
We can propose RFC about support TTL config using non-partition field after



sagar sumit  于2022年10月19日周三 14:42写道:

> +1 Very nice idea. Looking forward to the RFC!
>
> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
> wrote:
>
> > great proposal. Partition TTL is a good starting point. we can extend it
> to
> > other TTL strategies like column-based, and make it customizable and
> > pluggable. Looking forward to the RFC!
> >
> > On Wed, Oct 19, 2022 at 11:40 AM Jian Feng  >
> > wrote:
> >
> > > Good idea,
> > > this is definitely worth an  RFC
> > > btw should it only depend on Hudi's partition? I feel it should be a
> more
> > > common feature since sometimes customers' data can not update across
> > > partitions
> > >
> > >
> > > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com>
> wrote:
> > >
> > > > Hi all, we have implemented a partition based data ttl management,
> > which
> > > > we can manage ttl for hudi partition by size, expired time and
> > > > sub-partition count. When a partition is detected as outdated, we 
use
> > > > delete partition interface to delete it, which will generate a
> replace
> > > > commit to mark the data as deleted. The real deletion will then done
> by
> > > > clean service.
> > > >
> > > >
> > > > If community is interested in this idea, maybe we can propose a RFC
> to
> > > > discuss it in detail.
> > > >
> > > >
> > > > > On Oct 19, 2022, at 10:06, Vinoth Chandar 
> wrote:
> > > > >
> > > > > +1 love to discuss this on a RFC proposal.
> > > > >
> > > > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > > > wrote:
> > > > >
> > > > >> That's a very interesting idea.
> > > > >>
> > > > >> Do you want to take a stab at writing a full proposal (in the 
form
> > of
> > > > RFC)
> > > > >> for it?
> > > > >>
> > > > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang <
> hbgstc...@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >>> Hi all,
> > > > >>>
> > > > >>> Do we have plan to integrate data TTL into HUDI, so we don't 
have
> > to
> > > > >>> schedule a offline spark job to delete outdated data, just set a
> > TTL
> > > > >>> config, then writer or some offline service will delete old data
> as
> > > > >>> expected.
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> > > --
> > > *Jian Feng,冯健*
> > > Shopee | Engineer | Data Infrastructure
> > >
> >
> >
> > --
> > Best,
> > Shiyan
> >
>




Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread Bingeng Huang
Looking forward to the RFC.
We can propose RFC about support TTL config using non-partition field after



sagar sumit  于2022年10月19日周三 14:42写道:

> +1 Very nice idea. Looking forward to the RFC!
>
> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
> wrote:
>
> > great proposal. Partition TTL is a good starting point. we can extend it
> to
> > other TTL strategies like column-based, and make it customizable and
> > pluggable. Looking forward to the RFC!
> >
> > On Wed, Oct 19, 2022 at 11:40 AM Jian Feng  >
> > wrote:
> >
> > > Good idea,
> > > this is definitely worth an  RFC
> > > btw should it only depend on Hudi's partition? I feel it should be a
> more
> > > common feature since sometimes customers' data can not update across
> > > partitions
> > >
> > >
> > > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com>
> wrote:
> > >
> > > > Hi all, we have implemented a partition based data ttl management,
> > which
> > > > we can manage ttl for hudi partition by size, expired time and
> > > > sub-partition count. When a partition is detected as outdated, we use
> > > > delete partition interface to delete it, which will generate a
> replace
> > > > commit to mark the data as deleted. The real deletion will then done
> by
> > > > clean service.
> > > >
> > > >
> > > > If community is interested in this idea, maybe we can propose a RFC
> to
> > > > discuss it in detail.
> > > >
> > > >
> > > > > On Oct 19, 2022, at 10:06, Vinoth Chandar 
> wrote:
> > > > >
> > > > > +1 love to discuss this on a RFC proposal.
> > > > >
> > > > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > > > wrote:
> > > > >
> > > > >> That's a very interesting idea.
> > > > >>
> > > > >> Do you want to take a stab at writing a full proposal (in the form
> > of
> > > > RFC)
> > > > >> for it?
> > > > >>
> > > > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang <
> hbgstc...@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >>> Hi all,
> > > > >>>
> > > > >>> Do we have plan to integrate data TTL into HUDI, so we don't have
> > to
> > > > >>> schedule a offline spark job to delete outdated data, just set a
> > TTL
> > > > >>> config, then writer or some offline service will delete old data
> as
> > > > >>> expected.
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> > > --
> > > *Jian Feng,冯健*
> > > Shopee | Engineer | Data Infrastructure
> > >
> >
> >
> > --
> > Best,
> > Shiyan
> >
>


Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread stream2000
Since it is marked as outdated by ttl policy, we think that it’s better to 
delete it anyway. Compaction & Clustering should deal with the case that the 
source data is already marked as deleted by ttl, otherwise there will still 
left some unused data in the partition.  What do you think? 

> On Oct 19, 2022, at 15:09, Teng Huo  wrote:
> 
> Nice feature!
> @stream2000
> 
> Just one question, can it work with compaction logs? I mean, if there are 
> some log files already marked in a compaction plan, will they be deleted by 
> TTL?
> 
> From: sagar sumit 
> Sent: Wednesday, October 19, 2022 2:42:36 PM
> To: dev@hudi.apache.org 
> Subject: Re: [DISCUSS] Hudi data TTL
> 
> +1 Very nice idea. Looking forward to the RFC!
> 
> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
> wrote:
> 
>> great proposal. Partition TTL is a good starting point. we can extend it to
>> other TTL strategies like column-based, and make it customizable and
>> pluggable. Looking forward to the RFC!
>> 
>> On Wed, Oct 19, 2022 at 11:40 AM Jian Feng 
>> wrote:
>> 
>>> Good idea,
>>> this is definitely worth an  RFC
>>> btw should it only depend on Hudi's partition? I feel it should be a more
>>> common feature since sometimes customers' data can not update across
>>> partitions
>>> 
>>> 
>>> On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:
>>> 
>>>> Hi all, we have implemented a partition based data ttl management,
>> which
>>>> we can manage ttl for hudi partition by size, expired time and
>>>> sub-partition count. When a partition is detected as outdated, we use
>>>> delete partition interface to delete it, which will generate a replace
>>>> commit to mark the data as deleted. The real deletion will then done by
>>>> clean service.
>>>> 
>>>> 
>>>> If community is interested in this idea, maybe we can propose a RFC to
>>>> discuss it in detail.
>>>> 
>>>> 
>>>>> On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
>>>>> 
>>>>> +1 love to discuss this on a RFC proposal.
>>>>> 
>>>>> On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
>>>> wrote:
>>>>> 
>>>>>> That's a very interesting idea.
>>>>>> 
>>>>>> Do you want to take a stab at writing a full proposal (in the form
>> of
>>>> RFC)
>>>>>> for it?
>>>>>> 
>>>>>> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang >> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> Do we have plan to integrate data TTL into HUDI, so we don't have
>> to
>>>>>>> schedule a offline spark job to delete outdated data, just set a
>> TTL
>>>>>>> config, then writer or some offline service will delete old data as
>>>>>>> expected.
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> *Jian Feng,冯健*
>>> Shopee | Engineer | Data Infrastructure
>>> 
>> 
>> 
>> --
>> Best,
>> Shiyan
>> 



Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread Teng Huo
Nice feature!
@stream2000

Just one question, can it work with compaction logs? I mean, if there are some 
log files already marked in a compaction plan, will they be deleted by TTL?

From: sagar sumit 
Sent: Wednesday, October 19, 2022 2:42:36 PM
To: dev@hudi.apache.org 
Subject: Re: [DISCUSS] Hudi data TTL

+1 Very nice idea. Looking forward to the RFC!

On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
wrote:

> great proposal. Partition TTL is a good starting point. we can extend it to
> other TTL strategies like column-based, and make it customizable and
> pluggable. Looking forward to the RFC!
>
> On Wed, Oct 19, 2022 at 11:40 AM Jian Feng 
> wrote:
>
> > Good idea,
> > this is definitely worth an  RFC
> > btw should it only depend on Hudi's partition? I feel it should be a more
> > common feature since sometimes customers' data can not update across
> > partitions
> >
> >
> > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:
> >
> > > Hi all, we have implemented a partition based data ttl management,
> which
> > > we can manage ttl for hudi partition by size, expired time and
> > > sub-partition count. When a partition is detected as outdated, we use
> > > delete partition interface to delete it, which will generate a replace
> > > commit to mark the data as deleted. The real deletion will then done by
> > > clean service.
> > >
> > >
> > > If community is interested in this idea, maybe we can propose a RFC to
> > > discuss it in detail.
> > >
> > >
> > > > On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> > > >
> > > > +1 love to discuss this on a RFC proposal.
> > > >
> > > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > > wrote:
> > > >
> > > >> That's a very interesting idea.
> > > >>
> > > >> Do you want to take a stab at writing a full proposal (in the form
> of
> > > RFC)
> > > >> for it?
> > > >>
> > > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang  >
> > > >> wrote:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> Do we have plan to integrate data TTL into HUDI, so we don't have
> to
> > > >>> schedule a offline spark job to delete outdated data, just set a
> TTL
> > > >>> config, then writer or some offline service will delete old data as
> > > >>> expected.
> > > >>>
> > > >>
> > >
> > >
> >
> > --
> > *Jian Feng,冯健*
> > Shopee | Engineer | Data Infrastructure
> >
>
>
> --
> Best,
> Shiyan
>


Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread sagar sumit
+1 Very nice idea. Looking forward to the RFC!

On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
wrote:

> great proposal. Partition TTL is a good starting point. we can extend it to
> other TTL strategies like column-based, and make it customizable and
> pluggable. Looking forward to the RFC!
>
> On Wed, Oct 19, 2022 at 11:40 AM Jian Feng 
> wrote:
>
> > Good idea,
> > this is definitely worth an  RFC
> > btw should it only depend on Hudi's partition? I feel it should be a more
> > common feature since sometimes customers' data can not update across
> > partitions
> >
> >
> > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:
> >
> > > Hi all, we have implemented a partition based data ttl management,
> which
> > > we can manage ttl for hudi partition by size, expired time and
> > > sub-partition count. When a partition is detected as outdated, we use
> > > delete partition interface to delete it, which will generate a replace
> > > commit to mark the data as deleted. The real deletion will then done by
> > > clean service.
> > >
> > >
> > > If community is interested in this idea, maybe we can propose a RFC to
> > > discuss it in detail.
> > >
> > >
> > > > On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> > > >
> > > > +1 love to discuss this on a RFC proposal.
> > > >
> > > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > > wrote:
> > > >
> > > >> That's a very interesting idea.
> > > >>
> > > >> Do you want to take a stab at writing a full proposal (in the form
> of
> > > RFC)
> > > >> for it?
> > > >>
> > > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang  >
> > > >> wrote:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> Do we have plan to integrate data TTL into HUDI, so we don't have
> to
> > > >>> schedule a offline spark job to delete outdated data, just set a
> TTL
> > > >>> config, then writer or some offline service will delete old data as
> > > >>> expected.
> > > >>>
> > > >>
> > >
> > >
> >
> > --
> > *Jian Feng,冯健*
> > Shopee | Engineer | Data Infrastructure
> >
>
>
> --
> Best,
> Shiyan
>


Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Shiyan Xu
great proposal. Partition TTL is a good starting point. we can extend it to
other TTL strategies like column-based, and make it customizable and
pluggable. Looking forward to the RFC!

On Wed, Oct 19, 2022 at 11:40 AM Jian Feng 
wrote:

> Good idea,
> this is definitely worth an  RFC
> btw should it only depend on Hudi's partition? I feel it should be a more
> common feature since sometimes customers' data can not update across
> partitions
>
>
> On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:
>
> > Hi all, we have implemented a partition based data ttl management, which
> > we can manage ttl for hudi partition by size, expired time and
> > sub-partition count. When a partition is detected as outdated, we use
> > delete partition interface to delete it, which will generate a replace
> > commit to mark the data as deleted. The real deletion will then done by
> > clean service.
> >
> >
> > If community is interested in this idea, maybe we can propose a RFC to
> > discuss it in detail.
> >
> >
> > > On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> > >
> > > +1 love to discuss this on a RFC proposal.
> > >
> > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > wrote:
> > >
> > >> That's a very interesting idea.
> > >>
> > >> Do you want to take a stab at writing a full proposal (in the form of
> > RFC)
> > >> for it?
> > >>
> > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang 
> > >> wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> Do we have plan to integrate data TTL into HUDI, so we don't have to
> > >>> schedule a offline spark job to delete outdated data, just set a TTL
> > >>> config, then writer or some offline service will delete old data as
> > >>> expected.
> > >>>
> > >>
> >
> >
>
> --
> *Jian Feng,冯健*
> Shopee | Engineer | Data Infrastructure
>


-- 
Best,
Shiyan


Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Jian Feng
Good idea,
this is definitely worth an  RFC
btw should it only depend on Hudi's partition? I feel it should be a more
common feature since sometimes customers' data can not update across
partitions


On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:

> Hi all, we have implemented a partition based data ttl management, which
> we can manage ttl for hudi partition by size, expired time and
> sub-partition count. When a partition is detected as outdated, we use
> delete partition interface to delete it, which will generate a replace
> commit to mark the data as deleted. The real deletion will then done by
> clean service.
>
>
> If community is interested in this idea, maybe we can propose a RFC to
> discuss it in detail.
>
>
> > On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> >
> > +1 love to discuss this on a RFC proposal.
> >
> > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> wrote:
> >
> >> That's a very interesting idea.
> >>
> >> Do you want to take a stab at writing a full proposal (in the form of
> RFC)
> >> for it?
> >>
> >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang 
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Do we have plan to integrate data TTL into HUDI, so we don't have to
> >>> schedule a offline spark job to delete outdated data, just set a TTL
> >>> config, then writer or some offline service will delete old data as
> >>> expected.
> >>>
> >>
>
>

-- 
*Jian Feng,冯健*
Shopee | Engineer | Data Infrastructure


Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread stream2000
Hi all, we have implemented a partition based data ttl management, which we can 
manage ttl for hudi partition by size, expired time and sub-partition count. 
When a partition is detected as outdated, we use delete partition interface to 
delete it, which will generate a replace commit to mark the data as deleted. 
The real deletion will then done by clean service. 


If community is interested in this idea, maybe we can propose a RFC to discuss 
it in detail.


> On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> 
> +1 love to discuss this on a RFC proposal.
> 
> On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin  wrote:
> 
>> That's a very interesting idea.
>> 
>> Do you want to take a stab at writing a full proposal (in the form of RFC)
>> for it?
>> 
>> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang 
>> wrote:
>> 
>>> Hi all,
>>> 
>>> Do we have plan to integrate data TTL into HUDI, so we don't have to
>>> schedule a offline spark job to delete outdated data, just set a TTL
>>> config, then writer or some offline service will delete old data as
>>> expected.
>>> 
>> 



Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Vinoth Chandar
+1 love to discuss this on a RFC proposal.

On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin  wrote:

> That's a very interesting idea.
>
> Do you want to take a stab at writing a full proposal (in the form of RFC)
> for it?
>
> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang 
> wrote:
>
> > Hi all,
> >
> > Do we have plan to integrate data TTL into HUDI, so we don't have to
> > schedule a offline spark job to delete outdated data, just set a TTL
> > config, then writer or some offline service will delete old data as
> > expected.
> >
>


Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Alexey Kudinkin
That's a very interesting idea.

Do you want to take a stab at writing a full proposal (in the form of RFC)
for it?

On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang  wrote:

> Hi all,
>
> Do we have plan to integrate data TTL into HUDI, so we don't have to
> schedule a offline spark job to delete outdated data, just set a TTL
> config, then writer or some offline service will delete old data as
> expected.
>


Re: [DISCUSS] Build tool upgrade

2022-10-17 Thread zhaojing yu
I have experienced some gradle development projects and want to share some
thoughts.

The flexibility and faster speed of gradle itself can certainly bring some
advantages, but it will also greatly increase the troubleshooting time due
to the bugs of gradle itself, and gradle DSL is very different from that of
maven. There are also many learning costs for developers in the community.

I think it does consume too much time on code release, but users or
developers usually only compile part of the module.

So I think, a certain advantage in build time alone is not enough to cover
so much cost.

Best,
Zhaojing

Gary Li  于2022年10月17日周一 19:22写道:

> Hi folks,
>
> I'd share my thoughts as well. I personally won't build the whole project
> too often, only before push to the remote branch or make big changes in
> different modules. If I just make some changes and run a test, the IDE will
> only build the necessary modules I believe. In addition, each time I deal
> with dependency issues, the years of maven experience does help me locate
> the issue quickly, especially when the dependency tree is pretty
> complicated. The learning effort and the new environment setup effort are
> considerable as well.
>
> Happy to learn if there are other benefits gradle or bazel could bring to
> us, but if the only benefit is the xx% faster build time, I am a bit
> unconvinced to make this change.
>
> Best,
> Gary
>
> On Mon, Oct 17, 2022 at 2:58 PM Danny Chan  wrote:
>
> > I have a full experience with how Apache Calcite switches from Maven
> > to Gradle, and I want to share some thoughts.
> >
> > The gradle build is fast, but it relies heavily on its local cache,
> > usually it needs too much time to download these cache jars because
> > gradle upgrade itself very frequently.
> >
> > The gradle is very flexive for building, but it also has many bugs,
> > you may need more time to debug its bug compared with building with
> > maven.
> >
> > The gradle DSL for building is a must to learn for all the developers.
> >
> > For all above cases, I don't think switching to gradle is a right
> > decision for Apache Calcite. Julian Hyde which is the creator of
> > Calcite may have more words to say here.
> >
> > So I would not suggest we do that for Hudi.
> >
> >
> > Best,
> > Danny Chan
> >
> > Shiyan Xu  于2022年10月1日周六 13:48写道:
> > >
> > > Hi all,
> > >
> > > I'd like to raise a discussion around the build tool for Hudi.
> > >
> > > Maven has been a mature yet slow (10min to build on 2021 macbook pro)
> > build
> > > tool compared to modern ones like gradle or bazel. We all want faster
> > > builds, however, we also need to consider the efforts and risks to
> > upgrade,
> > > and the developers' feedback on usability.
> > >
> > > What do you all think about upgrading to gradle or bazel? Please share
> > your
> > > thoughts. Thanks.
> > >
> > > --
> > > Best,
> > > Shiyan
> >
>


Re: [DISCUSS] Build tool upgrade

2022-10-17 Thread Gary Li
Hi folks,

I'd share my thoughts as well. I personally won't build the whole project
too often, only before push to the remote branch or make big changes in
different modules. If I just make some changes and run a test, the IDE will
only build the necessary modules I believe. In addition, each time I deal
with dependency issues, the years of maven experience does help me locate
the issue quickly, especially when the dependency tree is pretty
complicated. The learning effort and the new environment setup effort are
considerable as well.

Happy to learn if there are other benefits gradle or bazel could bring to
us, but if the only benefit is the xx% faster build time, I am a bit
unconvinced to make this change.

Best,
Gary

On Mon, Oct 17, 2022 at 2:58 PM Danny Chan  wrote:

> I have a full experience with how Apache Calcite switches from Maven
> to Gradle, and I want to share some thoughts.
>
> The gradle build is fast, but it relies heavily on its local cache,
> usually it needs too much time to download these cache jars because
> gradle upgrade itself very frequently.
>
> The gradle is very flexive for building, but it also has many bugs,
> you may need more time to debug its bug compared with building with
> maven.
>
> The gradle DSL for building is a must to learn for all the developers.
>
> For all above cases, I don't think switching to gradle is a right
> decision for Apache Calcite. Julian Hyde which is the creator of
> Calcite may have more words to say here.
>
> So I would not suggest we do that for Hudi.
>
>
> Best,
> Danny Chan
>
> Shiyan Xu  于2022年10月1日周六 13:48写道:
> >
> > Hi all,
> >
> > I'd like to raise a discussion around the build tool for Hudi.
> >
> > Maven has been a mature yet slow (10min to build on 2021 macbook pro)
> build
> > tool compared to modern ones like gradle or bazel. We all want faster
> > builds, however, we also need to consider the efforts and risks to
> upgrade,
> > and the developers' feedback on usability.
> >
> > What do you all think about upgrading to gradle or bazel? Please share
> your
> > thoughts. Thanks.
> >
> > --
> > Best,
> > Shiyan
>


Re: [DISCUSS] Build tool upgrade

2022-10-17 Thread Danny Chan
I have a full experience with how Apache Calcite switches from Maven
to Gradle, and I want to share some thoughts.

The gradle build is fast, but it relies heavily on its local cache,
usually it needs too much time to download these cache jars because
gradle upgrade itself very frequently.

The gradle is very flexive for building, but it also has many bugs,
you may need more time to debug its bug compared with building with
maven.

The gradle DSL for building is a must to learn for all the developers.

For all above cases, I don't think switching to gradle is a right
decision for Apache Calcite. Julian Hyde which is the creator of
Calcite may have more words to say here.

So I would not suggest we do that for Hudi.


Best,
Danny Chan

Shiyan Xu  于2022年10月1日周六 13:48写道:
>
> Hi all,
>
> I'd like to raise a discussion around the build tool for Hudi.
>
> Maven has been a mature yet slow (10min to build on 2021 macbook pro) build
> tool compared to modern ones like gradle or bazel. We all want faster
> builds, however, we also need to consider the efforts and risks to upgrade,
> and the developers' feedback on usability.
>
> What do you all think about upgrading to gradle or bazel? Please share your
> thoughts. Thanks.
>
> --
> Best,
> Shiyan


Re: [DISCUSS] Diagnostic reporter

2022-10-14 Thread Forward Xu
+1, Thanks Shiyan Xu and Zhang Yue, This is a very useful function.

Best,
Forward

sagar sumit  于2022年9月12日周一 18:39写道:

> Thanks Zhang Yue for drafting the RFC.
> It's an interesting read! I have left some comments.
>
> While exposing certain info such as "sample_hoodie_key",
> we have to consider masking/obfuscation.
>
> Looking forward to the implementation.
>
> Regards,
> Sagar
>
> On Wed, Sep 7, 2022 at 1:49 PM Yue Zhang  wrote:
>
> > Hi Hudi,
> > Just raise a RFC about this diagnostic reporter
> > https://github.com/apache/hudi/pull/6600. PLEASE feel free to leave any
> > comments or concerns if you are interested!
> >
> >
> > | |
> > Yue Zhang
> > |
> > |
> > zhangyue921...@163.com
> > |
> >
> >
> > On 08/4/2022 19:38,Yue Zhang wrote:
> > Hi Shiyan and everyone,
> > This is a great idea! As one of Hudi user, I also struggle to Hudi
> > troubleshooting sometimes. With this feature, it will definitely be able
> to
> > reduce the burden.
> > So I volunteer to draft a discuss and maybe raise a RFC about if you
> > don't mind. Thanks :)
> >
> >
> > | |
> > Yue Zhang
> > |
> > |
> > zhangyue921...@163.com
> > |
> >
> >
> > On 08/3/2022 00:44,冯健 wrote:
> > Maybe we can start this with an audit feature? Since we need some sort of
> > "images" to represent “facts”, can create an identity of a writer to link
> > them. and in this audit file, we can label each operation with IP,
> > environment, platform, version, write config and etc.
> >
> > On Sun, 31 Jul 2022 at 12:18, Shiyan Xu 
> > wrote:
> >
> > To bubble this up
> >
> > On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar 
> wrote:
> >
> > +1 from me.
> >
> > It will be very useful if we can have something that can gather
> > troubleshooting info easily.
> > This part takes a while currently.
> >
> > On Mon, May 30, 2022 at 9:52 AM Shiyan Xu 
> > wrote:
> >
> > Hi all,
> >
> > When troubleshooting Hudi jobs in users' environments, we always ask
> > users
> > to share configs, environment info, check spark UI, etc. Here is an RFC
> > idea: can we extend the Hudi metrics system and make a diagnostic
> > reporter?
> > It can be turned on like a normal metrics reporter. it should collect
> > common troubleshooting info and save to json or other human-readable
> > text
> > format. Users should be able to run with it and share the diagnosis
> > file.
> > The RFC should discuss what info should / can be collected.
> >
> > Does this make sense? Anyone interested in driving the RFC design and
> > implementation work?
> >
> > --
> > Best,
> > Shiyan
> >
> >
> > --
> > Best,
> > Shiyan
> >
> >
>


Re:[DISCUSS] [RFC] Hudi bundle standards

2022-10-10 Thread Xinyao Tian
Hi Shiyan,


Having carefully read the RFC-63.md on the PR, I really think this feature is 
crucial for everyone who builds Hudi from source. For example, when I tried to 
compile Hudi 0.12.0 with flink1.15, I used command ‘mvn clean package 
-DskipTests -Dflink1.15 -Dscala-2.12’ but still get flink1.14 bundle. Also, in 
the documents it’s a highlight of supporting Fink 1.15 but Hudi 0.12.0 Github 
readme.md doesn’t mention anything about flink1.15 in compile section. All in 
all, there are many misleading points of Hudi bundles, which have to be 
enhanced asap.


Really appreciate to have this RFC trying to solve all these problems.
On 10/10/2022 13:36,Shiyan Xu wrote:
Hi Hudi devs and users,

I've raised an RFC around Hudi bundles, aiming to address issues around
dependency conflicts, and to establish standards for bundle jar usage and
change process. Please have a look. Thanks!

https://github.com/apache/hudi/pull/6902

--
Best,
Shiyan


Re: [DISCUSS] Build tool upgrade

2022-10-03 Thread Prashant Wason
+1 for incremental builds with build cache which will be a huge prod bosot
especially when working with multiple branches at the same time.

Prashant


On Mon, Oct 3, 2022 at 11:42 PM Alexey Kudinkin  wrote:

> I think full project build slowly gravitates towards 15min already (it’s
> about 12-14min on my 2021 Macbook).
>
> @Vinoth the most important aspect that Maven couldn’t provide us with are
> local incremental builds. Currently you have to build full dependency
> hierarchy of the project whenever you’re changing even a single file.
> There’re some limited workarounds but they aren’t really a replacement for
> fully incremental builds.
>
> Fully incremental builds will be a huge boost to Dev productivity.
>
> On Sun, Oct 2, 2022 at 11:40 PM Pratyaksh Sharma 
> wrote:
>
> > My two cents. I have seen open source projects take more than 20-25
> minutes
> > for building on maven, so I guess we are fine for now. But we can
> > definitely investigate and try to optimize if we can.
> >
> > On Sun, Oct 2, 2022 at 9:33 AM Shiyan Xu 
> > wrote:
> >
> > > Yes, Vinoth, agree on the efforts and impact being big.
> > >
> > > Some perf comparison on gradle vs maven can be found in
> > > https://gradle.org/gradle-vs-maven-performance/ where it claims
> > multi-fold
> > > build time reduction. I'd estimate maybe 2-4 min for a full build and
> > based
> > > on that.
> > >
> > > I mainly hope to collect some feedback on if build time is a dev
> > experience
> > > concern or if it's okay for people in general. If it's the latter case,
> > > then no need to investigate further at this point.
> > >
> > > On Sat, Oct 1, 2022 at 1:52 PM Vinoth Chandar 
> wrote:
> > >
> > > > Hi Raymond.
> > > >
> > > > This would be a large undertaking and a big change for everyone.
> > > >
> > > > What does the build time look like if we switch to gradle or bazel?
> And
> > > do
> > > > we know why it takes 10 min to build and why is that not okay? Given
> we
> > > all
> > > > use IDEs mostly anyway
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Fri, Sep 30, 2022 at 22:48 Shiyan Xu  >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I'd like to raise a discussion around the build tool for Hudi.
> > > > >
> > > > > Maven has been a mature yet slow (10min to build on 2021 macbook
> pro)
> > > > build
> > > > > tool compared to modern ones like gradle or bazel. We all want
> faster
> > > > > builds, however, we also need to consider the efforts and risks to
> > > > upgrade,
> > > > > and the developers' feedback on usability.
> > > > >
> > > > > What do you all think about upgrading to gradle or bazel? Please
> > share
> > > > your
> > > > > thoughts. Thanks.
> > > > >
> > > > > --
> > > > > Best,
> > > > > Shiyan
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Shiyan
> > >
> >
>


Re: [DISCUSS] Build tool upgrade

2022-10-03 Thread Alexey Kudinkin
I think full project build slowly gravitates towards 15min already (it’s
about 12-14min on my 2021 Macbook).

@Vinoth the most important aspect that Maven couldn’t provide us with are
local incremental builds. Currently you have to build full dependency
hierarchy of the project whenever you’re changing even a single file.
There’re some limited workarounds but they aren’t really a replacement for
fully incremental builds.

Fully incremental builds will be a huge boost to Dev productivity.

On Sun, Oct 2, 2022 at 11:40 PM Pratyaksh Sharma 
wrote:

> My two cents. I have seen open source projects take more than 20-25 minutes
> for building on maven, so I guess we are fine for now. But we can
> definitely investigate and try to optimize if we can.
>
> On Sun, Oct 2, 2022 at 9:33 AM Shiyan Xu 
> wrote:
>
> > Yes, Vinoth, agree on the efforts and impact being big.
> >
> > Some perf comparison on gradle vs maven can be found in
> > https://gradle.org/gradle-vs-maven-performance/ where it claims
> multi-fold
> > build time reduction. I'd estimate maybe 2-4 min for a full build and
> based
> > on that.
> >
> > I mainly hope to collect some feedback on if build time is a dev
> experience
> > concern or if it's okay for people in general. If it's the latter case,
> > then no need to investigate further at this point.
> >
> > On Sat, Oct 1, 2022 at 1:52 PM Vinoth Chandar  wrote:
> >
> > > Hi Raymond.
> > >
> > > This would be a large undertaking and a big change for everyone.
> > >
> > > What does the build time look like if we switch to gradle or bazel? And
> > do
> > > we know why it takes 10 min to build and why is that not okay? Given we
> > all
> > > use IDEs mostly anyway
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Fri, Sep 30, 2022 at 22:48 Shiyan Xu 
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'd like to raise a discussion around the build tool for Hudi.
> > > >
> > > > Maven has been a mature yet slow (10min to build on 2021 macbook pro)
> > > build
> > > > tool compared to modern ones like gradle or bazel. We all want faster
> > > > builds, however, we also need to consider the efforts and risks to
> > > upgrade,
> > > > and the developers' feedback on usability.
> > > >
> > > > What do you all think about upgrading to gradle or bazel? Please
> share
> > > your
> > > > thoughts. Thanks.
> > > >
> > > > --
> > > > Best,
> > > > Shiyan
> > > >
> > >
> >
> >
> > --
> > Best,
> > Shiyan
> >
>


Re: [DISCUSS] Build tool upgrade

2022-10-03 Thread Pratyaksh Sharma
My two cents. I have seen open source projects take more than 20-25 minutes
for building on maven, so I guess we are fine for now. But we can
definitely investigate and try to optimize if we can.

On Sun, Oct 2, 2022 at 9:33 AM Shiyan Xu 
wrote:

> Yes, Vinoth, agree on the efforts and impact being big.
>
> Some perf comparison on gradle vs maven can be found in
> https://gradle.org/gradle-vs-maven-performance/ where it claims multi-fold
> build time reduction. I'd estimate maybe 2-4 min for a full build and based
> on that.
>
> I mainly hope to collect some feedback on if build time is a dev experience
> concern or if it's okay for people in general. If it's the latter case,
> then no need to investigate further at this point.
>
> On Sat, Oct 1, 2022 at 1:52 PM Vinoth Chandar  wrote:
>
> > Hi Raymond.
> >
> > This would be a large undertaking and a big change for everyone.
> >
> > What does the build time look like if we switch to gradle or bazel? And
> do
> > we know why it takes 10 min to build and why is that not okay? Given we
> all
> > use IDEs mostly anyway
> >
> > Thanks
> > Vinoth
> >
> > On Fri, Sep 30, 2022 at 22:48 Shiyan Xu 
> > wrote:
> >
> > > Hi all,
> > >
> > > I'd like to raise a discussion around the build tool for Hudi.
> > >
> > > Maven has been a mature yet slow (10min to build on 2021 macbook pro)
> > build
> > > tool compared to modern ones like gradle or bazel. We all want faster
> > > builds, however, we also need to consider the efforts and risks to
> > upgrade,
> > > and the developers' feedback on usability.
> > >
> > > What do you all think about upgrading to gradle or bazel? Please share
> > your
> > > thoughts. Thanks.
> > >
> > > --
> > > Best,
> > > Shiyan
> > >
> >
>
>
> --
> Best,
> Shiyan
>


Re: [DISCUSS] Build tool upgrade

2022-10-01 Thread Shiyan Xu
Yes, Vinoth, agree on the efforts and impact being big.

Some perf comparison on gradle vs maven can be found in
https://gradle.org/gradle-vs-maven-performance/ where it claims multi-fold
build time reduction. I'd estimate maybe 2-4 min for a full build and based
on that.

I mainly hope to collect some feedback on if build time is a dev experience
concern or if it's okay for people in general. If it's the latter case,
then no need to investigate further at this point.

On Sat, Oct 1, 2022 at 1:52 PM Vinoth Chandar  wrote:

> Hi Raymond.
>
> This would be a large undertaking and a big change for everyone.
>
> What does the build time look like if we switch to gradle or bazel? And do
> we know why it takes 10 min to build and why is that not okay? Given we all
> use IDEs mostly anyway
>
> Thanks
> Vinoth
>
> On Fri, Sep 30, 2022 at 22:48 Shiyan Xu 
> wrote:
>
> > Hi all,
> >
> > I'd like to raise a discussion around the build tool for Hudi.
> >
> > Maven has been a mature yet slow (10min to build on 2021 macbook pro)
> build
> > tool compared to modern ones like gradle or bazel. We all want faster
> > builds, however, we also need to consider the efforts and risks to
> upgrade,
> > and the developers' feedback on usability.
> >
> > What do you all think about upgrading to gradle or bazel? Please share
> your
> > thoughts. Thanks.
> >
> > --
> > Best,
> > Shiyan
> >
>


-- 
Best,
Shiyan


Re: [DISCUSS] Build tool upgrade

2022-09-30 Thread Vinoth Chandar
Hi Raymond.

This would be a large undertaking and a big change for everyone.

What does the build time look like if we switch to gradle or bazel? And do
we know why it takes 10 min to build and why is that not okay? Given we all
use IDEs mostly anyway

Thanks
Vinoth

On Fri, Sep 30, 2022 at 22:48 Shiyan Xu  wrote:

> Hi all,
>
> I'd like to raise a discussion around the build tool for Hudi.
>
> Maven has been a mature yet slow (10min to build on 2021 macbook pro) build
> tool compared to modern ones like gradle or bazel. We all want faster
> builds, however, we also need to consider the efforts and risks to upgrade,
> and the developers' feedback on usability.
>
> What do you all think about upgrading to gradle or bazel? Please share your
> thoughts. Thanks.
>
> --
> Best,
> Shiyan
>


Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-09-16 Thread 冯健
Hi Sagar,
HMS shouldn't be the core part, the external table location will depend on
which metastore the user is using.
 I'm still working on it, will add more detail in this RFC pr.
https://github.com/apache/hudi/pull/6576


On Fri, 16 Sept 2022 at 11:28, sagar sumit  wrote:

> Automatic lifecycle management based on a few configurations
> would be very useful for the community.
>
> I read the description in
> https://issues.apache.org/jira/browse/HUDI-4677
> May I ask the rationale for choosing
> Hive Metastore to manage the snapshots?
>
> Perhaps, RFC would have more details. Looking forward to it!
>
> Regards,
> Sagar
>
>
> On Wed, Sep 14, 2022 at 8:13 AM 冯健  wrote:
>
> > Hi Ethan,
> >
> > Yes, based on the current situation, we still need to do much extra
> > work to provide snapshot view feature for the users( or users do this by
> > themself)
> > . I plan to merge the COW part of this feature to 0.13.0 at least.
> will
> > consider your suggestion if time is tight
> > Thanks
> >
> >
> >
> > On Wed, 14 Sept 2022 at 03:02, Y Ethan Guo  wrote:
> >
> > > Hi Feng Jian,
> > >
> > > Looking forward to the RFC!  Is the snapshot view management more like
> > > managing commits / savepoints in the Hudi timeline and hiding Hudi
> > > internals from the users?
> > >
> > > Do you plan to merge the implementation of snapshot view and lifecycle
> > > management for the next major release (0.13.0)?  Timeline-wise, if time
> > is
> > > tight, you may also consider scoping out a subset of features to target
> > > 0.13.0.
> > >
> > > Best,
> > > - Ethan
> > >
> > > On Mon, Sep 12, 2022 at 10:43 PM Sivabalan  wrote:
> > >
> > > > Sounds like a nice feature to have. Eagerly looking forward for the
> > RFC.
> > > >
> > > > On Sat, 27 Aug 2022 at 20:51, 冯健  wrote:
> > > >
> > > > > I attached the image in this Jira Epic
> > > > > https://issues.apache.org/jira/browse/HUDI-4677, and the RFC is
> WIP,
> > > > will
> > > > > create a pr in the next few days
> > > > > Yeah, the basic idea is to implement lifecycle management based on
> > the
> > > > > savepoint and time travel features, providing new ways for the user
> > to
> > > > > operate
> > > > > and coordinate. won't propose any new concept
> > > > >
> > > > > On Sun, 28 Aug 2022 at 02:06, Shiyan Xu <
> xu.shiyan.raym...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > The dev email list does not support showing images unfortunately.
> > you
> > > > may
> > > > > > want to put it behind a link.
> > > > > >
> > > > > > As for the idea itself,
> > > > > >
> > > > > > What I plan to do is to let Hudi support release a snapshot view
> > and
> > > > > > > lifecycle management out-of-box.
> > > > > >
> > > > > >
> > > > > >  Are you planning to extend the savepoint feature to have
> lifecycle
> > > > mgmt
> > > > > > capabilities? We should consolidate overlapping features
> properly.
> > > > > >
> > > > > > On Sun, Aug 21, 2022 at 12:59 PM 冯健 
> wrote:
> > > > > >
> > > > > > > Hi team,
> > > > > > > [image: image.png]
> > > > > > > for the snapshot view scenario, Hudi already provides two
> key
> > > > > > > features to support it:
> > > > > > >
> > > > > > >- Time travel: user provides a timestamp to query a specific
> > > > > snapshot
> > > > > > >view of a Hudi table
> > > > > > >- Savepoint/restore: "savepoint" saves the table as of the
> > > commit
> > > > > time
> > > > > > >so that it lets you restore the table to this savepoint at a
> > > later
> > > > > > point in
> > > > > > >time if need be. but in this case, the user usually uses
> this
> > to
> > > > > > prevent
> > > > > > >cleaning snapshot view at a specific timestamp, only clean
> > > unused
> > > > > > files
> > > > > > >
> > > > > > > The situation is there some inconvenience for users if use them
> > > > > directly
> > > > > > >
> > > > > > >- Usually users incline to use a meaningful name instead of
> > > > querying
> > > > > > >Hudi table with a timestamp, using the timestamp in SQL may
> > lead
> > > > to
> > > > > > the
> > > > > > >wrong snapshot view being used. for example, we can announce
> > > that
> > > > a
> > > > > > new tag
> > > > > > >of hudi table with table_nameMMDD was released, then the
> > > user
> > > > > can
> > > > > > use
> > > > > > >this new table name to query.
> > > > > > >- Savepoint is not designed for this "snapshot view"
> scenario
> > in
> > > > the
> > > > > > >beginning, it is designed for disaster recovery. let's say a
> > new
> > > > > > snapshot
> > > > > > >view will be created every day, and it has 7 days retention,
> > we
> > > > > should
> > > > > > >support lifecycle management on top of it.
> > > > > > >
> > > > > > > What I plan to do is to let Hudi support release a snapshot
> view
> > > and
> > > > > > > lifecycle management out-of-box. We have already done some work
> > > when
> > > > > > > supporting customers' snapshot view requirements in my company,
> > and
> > > > > 

Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-09-15 Thread sagar sumit
Automatic lifecycle management based on a few configurations
would be very useful for the community.

I read the description in
https://issues.apache.org/jira/browse/HUDI-4677
May I ask the rationale for choosing
Hive Metastore to manage the snapshots?

Perhaps, RFC would have more details. Looking forward to it!

Regards,
Sagar


On Wed, Sep 14, 2022 at 8:13 AM 冯健  wrote:

> Hi Ethan,
>
> Yes, based on the current situation, we still need to do much extra
> work to provide snapshot view feature for the users( or users do this by
> themself)
> . I plan to merge the COW part of this feature to 0.13.0 at least. will
> consider your suggestion if time is tight
> Thanks
>
>
>
> On Wed, 14 Sept 2022 at 03:02, Y Ethan Guo  wrote:
>
> > Hi Feng Jian,
> >
> > Looking forward to the RFC!  Is the snapshot view management more like
> > managing commits / savepoints in the Hudi timeline and hiding Hudi
> > internals from the users?
> >
> > Do you plan to merge the implementation of snapshot view and lifecycle
> > management for the next major release (0.13.0)?  Timeline-wise, if time
> is
> > tight, you may also consider scoping out a subset of features to target
> > 0.13.0.
> >
> > Best,
> > - Ethan
> >
> > On Mon, Sep 12, 2022 at 10:43 PM Sivabalan  wrote:
> >
> > > Sounds like a nice feature to have. Eagerly looking forward for the
> RFC.
> > >
> > > On Sat, 27 Aug 2022 at 20:51, 冯健  wrote:
> > >
> > > > I attached the image in this Jira Epic
> > > > https://issues.apache.org/jira/browse/HUDI-4677, and the RFC is WIP,
> > > will
> > > > create a pr in the next few days
> > > > Yeah, the basic idea is to implement lifecycle management based on
> the
> > > > savepoint and time travel features, providing new ways for the user
> to
> > > > operate
> > > > and coordinate. won't propose any new concept
> > > >
> > > > On Sun, 28 Aug 2022 at 02:06, Shiyan Xu  >
> > > > wrote:
> > > >
> > > > > The dev email list does not support showing images unfortunately.
> you
> > > may
> > > > > want to put it behind a link.
> > > > >
> > > > > As for the idea itself,
> > > > >
> > > > > What I plan to do is to let Hudi support release a snapshot view
> and
> > > > > > lifecycle management out-of-box.
> > > > >
> > > > >
> > > > >  Are you planning to extend the savepoint feature to have lifecycle
> > > mgmt
> > > > > capabilities? We should consolidate overlapping features properly.
> > > > >
> > > > > On Sun, Aug 21, 2022 at 12:59 PM 冯健  wrote:
> > > > >
> > > > > > Hi team,
> > > > > > [image: image.png]
> > > > > > for the snapshot view scenario, Hudi already provides two key
> > > > > > features to support it:
> > > > > >
> > > > > >- Time travel: user provides a timestamp to query a specific
> > > > snapshot
> > > > > >view of a Hudi table
> > > > > >- Savepoint/restore: "savepoint" saves the table as of the
> > commit
> > > > time
> > > > > >so that it lets you restore the table to this savepoint at a
> > later
> > > > > point in
> > > > > >time if need be. but in this case, the user usually uses this
> to
> > > > > prevent
> > > > > >cleaning snapshot view at a specific timestamp, only clean
> > unused
> > > > > files
> > > > > >
> > > > > > The situation is there some inconvenience for users if use them
> > > > directly
> > > > > >
> > > > > >- Usually users incline to use a meaningful name instead of
> > > querying
> > > > > >Hudi table with a timestamp, using the timestamp in SQL may
> lead
> > > to
> > > > > the
> > > > > >wrong snapshot view being used. for example, we can announce
> > that
> > > a
> > > > > new tag
> > > > > >of hudi table with table_nameMMDD was released, then the
> > user
> > > > can
> > > > > use
> > > > > >this new table name to query.
> > > > > >- Savepoint is not designed for this "snapshot view" scenario
> in
> > > the
> > > > > >beginning, it is designed for disaster recovery. let's say a
> new
> > > > > snapshot
> > > > > >view will be created every day, and it has 7 days retention,
> we
> > > > should
> > > > > >support lifecycle management on top of it.
> > > > > >
> > > > > > What I plan to do is to let Hudi support release a snapshot view
> > and
> > > > > > lifecycle management out-of-box. We have already done some work
> > when
> > > > > > supporting customers' snapshot view requirements in my company,
> and
> > > > hope
> > > > > to
> > > > > > land this feature in Community too.
> > > > > >
> > > > > > Please feel free to let me know if you have any idea about this.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jian Feng
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best,
> > > > > Shiyan
> > > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>


Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-09-13 Thread 冯健
Hi Ethan,

Yes, based on the current situation, we still need to do much extra
work to provide snapshot view feature for the users( or users do this by
themself)
. I plan to merge the COW part of this feature to 0.13.0 at least. will
consider your suggestion if time is tight
Thanks



On Wed, 14 Sept 2022 at 03:02, Y Ethan Guo  wrote:

> Hi Feng Jian,
>
> Looking forward to the RFC!  Is the snapshot view management more like
> managing commits / savepoints in the Hudi timeline and hiding Hudi
> internals from the users?
>
> Do you plan to merge the implementation of snapshot view and lifecycle
> management for the next major release (0.13.0)?  Timeline-wise, if time is
> tight, you may also consider scoping out a subset of features to target
> 0.13.0.
>
> Best,
> - Ethan
>
> On Mon, Sep 12, 2022 at 10:43 PM Sivabalan  wrote:
>
> > Sounds like a nice feature to have. Eagerly looking forward for the RFC.
> >
> > On Sat, 27 Aug 2022 at 20:51, 冯健  wrote:
> >
> > > I attached the image in this Jira Epic
> > > https://issues.apache.org/jira/browse/HUDI-4677, and the RFC is WIP,
> > will
> > > create a pr in the next few days
> > > Yeah, the basic idea is to implement lifecycle management based on the
> > > savepoint and time travel features, providing new ways for the user to
> > > operate
> > > and coordinate. won't propose any new concept
> > >
> > > On Sun, 28 Aug 2022 at 02:06, Shiyan Xu 
> > > wrote:
> > >
> > > > The dev email list does not support showing images unfortunately. you
> > may
> > > > want to put it behind a link.
> > > >
> > > > As for the idea itself,
> > > >
> > > > What I plan to do is to let Hudi support release a snapshot view and
> > > > > lifecycle management out-of-box.
> > > >
> > > >
> > > >  Are you planning to extend the savepoint feature to have lifecycle
> > mgmt
> > > > capabilities? We should consolidate overlapping features properly.
> > > >
> > > > On Sun, Aug 21, 2022 at 12:59 PM 冯健  wrote:
> > > >
> > > > > Hi team,
> > > > > [image: image.png]
> > > > > for the snapshot view scenario, Hudi already provides two key
> > > > > features to support it:
> > > > >
> > > > >- Time travel: user provides a timestamp to query a specific
> > > snapshot
> > > > >view of a Hudi table
> > > > >- Savepoint/restore: "savepoint" saves the table as of the
> commit
> > > time
> > > > >so that it lets you restore the table to this savepoint at a
> later
> > > > point in
> > > > >time if need be. but in this case, the user usually uses this to
> > > > prevent
> > > > >cleaning snapshot view at a specific timestamp, only clean
> unused
> > > > files
> > > > >
> > > > > The situation is there some inconvenience for users if use them
> > > directly
> > > > >
> > > > >- Usually users incline to use a meaningful name instead of
> > querying
> > > > >Hudi table with a timestamp, using the timestamp in SQL may lead
> > to
> > > > the
> > > > >wrong snapshot view being used. for example, we can announce
> that
> > a
> > > > new tag
> > > > >of hudi table with table_nameMMDD was released, then the
> user
> > > can
> > > > use
> > > > >this new table name to query.
> > > > >- Savepoint is not designed for this "snapshot view" scenario in
> > the
> > > > >beginning, it is designed for disaster recovery. let's say a new
> > > > snapshot
> > > > >view will be created every day, and it has 7 days retention, we
> > > should
> > > > >support lifecycle management on top of it.
> > > > >
> > > > > What I plan to do is to let Hudi support release a snapshot view
> and
> > > > > lifecycle management out-of-box. We have already done some work
> when
> > > > > supporting customers' snapshot view requirements in my company, and
> > > hope
> > > > to
> > > > > land this feature in Community too.
> > > > >
> > > > > Please feel free to let me know if you have any idea about this.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jian Feng
> > > > >
> > > >
> > > >
> > > > --
> > > > Best,
> > > > Shiyan
> > > >
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-09-13 Thread Y Ethan Guo
Hi Feng Jian,

Looking forward to the RFC!  Is the snapshot view management more like
managing commits / savepoints in the Hudi timeline and hiding Hudi
internals from the users?

Do you plan to merge the implementation of snapshot view and lifecycle
management for the next major release (0.13.0)?  Timeline-wise, if time is
tight, you may also consider scoping out a subset of features to target
0.13.0.

Best,
- Ethan

On Mon, Sep 12, 2022 at 10:43 PM Sivabalan  wrote:

> Sounds like a nice feature to have. Eagerly looking forward for the RFC.
>
> On Sat, 27 Aug 2022 at 20:51, 冯健  wrote:
>
> > I attached the image in this Jira Epic
> > https://issues.apache.org/jira/browse/HUDI-4677, and the RFC is WIP,
> will
> > create a pr in the next few days
> > Yeah, the basic idea is to implement lifecycle management based on the
> > savepoint and time travel features, providing new ways for the user to
> > operate
> > and coordinate. won't propose any new concept
> >
> > On Sun, 28 Aug 2022 at 02:06, Shiyan Xu 
> > wrote:
> >
> > > The dev email list does not support showing images unfortunately. you
> may
> > > want to put it behind a link.
> > >
> > > As for the idea itself,
> > >
> > > What I plan to do is to let Hudi support release a snapshot view and
> > > > lifecycle management out-of-box.
> > >
> > >
> > >  Are you planning to extend the savepoint feature to have lifecycle
> mgmt
> > > capabilities? We should consolidate overlapping features properly.
> > >
> > > On Sun, Aug 21, 2022 at 12:59 PM 冯健  wrote:
> > >
> > > > Hi team,
> > > > [image: image.png]
> > > > for the snapshot view scenario, Hudi already provides two key
> > > > features to support it:
> > > >
> > > >- Time travel: user provides a timestamp to query a specific
> > snapshot
> > > >view of a Hudi table
> > > >- Savepoint/restore: "savepoint" saves the table as of the commit
> > time
> > > >so that it lets you restore the table to this savepoint at a later
> > > point in
> > > >time if need be. but in this case, the user usually uses this to
> > > prevent
> > > >cleaning snapshot view at a specific timestamp, only clean unused
> > > files
> > > >
> > > > The situation is there some inconvenience for users if use them
> > directly
> > > >
> > > >- Usually users incline to use a meaningful name instead of
> querying
> > > >Hudi table with a timestamp, using the timestamp in SQL may lead
> to
> > > the
> > > >wrong snapshot view being used. for example, we can announce that
> a
> > > new tag
> > > >of hudi table with table_nameMMDD was released, then the user
> > can
> > > use
> > > >this new table name to query.
> > > >- Savepoint is not designed for this "snapshot view" scenario in
> the
> > > >beginning, it is designed for disaster recovery. let's say a new
> > > snapshot
> > > >view will be created every day, and it has 7 days retention, we
> > should
> > > >support lifecycle management on top of it.
> > > >
> > > > What I plan to do is to let Hudi support release a snapshot view and
> > > > lifecycle management out-of-box. We have already done some work when
> > > > supporting customers' snapshot view requirements in my company, and
> > hope
> > > to
> > > > land this feature in Community too.
> > > >
> > > > Please feel free to let me know if you have any idea about this.
> > > >
> > > > Thanks,
> > > >
> > > > Jian Feng
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Shiyan
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-09-12 Thread Sivabalan
Sounds like a nice feature to have. Eagerly looking forward for the RFC.

On Sat, 27 Aug 2022 at 20:51, 冯健  wrote:

> I attached the image in this Jira Epic
> https://issues.apache.org/jira/browse/HUDI-4677, and the RFC is WIP, will
> create a pr in the next few days
> Yeah, the basic idea is to implement lifecycle management based on the
> savepoint and time travel features, providing new ways for the user to
> operate
> and coordinate. won't propose any new concept
>
> On Sun, 28 Aug 2022 at 02:06, Shiyan Xu 
> wrote:
>
> > The dev email list does not support showing images unfortunately. you may
> > want to put it behind a link.
> >
> > As for the idea itself,
> >
> > What I plan to do is to let Hudi support release a snapshot view and
> > > lifecycle management out-of-box.
> >
> >
> >  Are you planning to extend the savepoint feature to have lifecycle mgmt
> > capabilities? We should consolidate overlapping features properly.
> >
> > On Sun, Aug 21, 2022 at 12:59 PM 冯健  wrote:
> >
> > > Hi team,
> > > [image: image.png]
> > > for the snapshot view scenario, Hudi already provides two key
> > > features to support it:
> > >
> > >- Time travel: user provides a timestamp to query a specific
> snapshot
> > >view of a Hudi table
> > >- Savepoint/restore: "savepoint" saves the table as of the commit
> time
> > >so that it lets you restore the table to this savepoint at a later
> > point in
> > >time if need be. but in this case, the user usually uses this to
> > prevent
> > >cleaning snapshot view at a specific timestamp, only clean unused
> > files
> > >
> > > The situation is there some inconvenience for users if use them
> directly
> > >
> > >- Usually users incline to use a meaningful name instead of querying
> > >Hudi table with a timestamp, using the timestamp in SQL may lead to
> > the
> > >wrong snapshot view being used. for example, we can announce that a
> > new tag
> > >of hudi table with table_nameMMDD was released, then the user
> can
> > use
> > >this new table name to query.
> > >- Savepoint is not designed for this "snapshot view" scenario in the
> > >beginning, it is designed for disaster recovery. let's say a new
> > snapshot
> > >view will be created every day, and it has 7 days retention, we
> should
> > >support lifecycle management on top of it.
> > >
> > > What I plan to do is to let Hudi support release a snapshot view and
> > > lifecycle management out-of-box. We have already done some work when
> > > supporting customers' snapshot view requirements in my company, and
> hope
> > to
> > > land this feature in Community too.
> > >
> > > Please feel free to let me know if you have any idea about this.
> > >
> > > Thanks,
> > >
> > > Jian Feng
> > >
> >
> >
> > --
> > Best,
> > Shiyan
> >
>


-- 
Regards,
-Sivabalan


Re: [DISCUSS] Diagnostic reporter

2022-09-12 Thread sagar sumit
Thanks Zhang Yue for drafting the RFC.
It's an interesting read! I have left some comments.

While exposing certain info such as "sample_hoodie_key",
we have to consider masking/obfuscation.

Looking forward to the implementation.

Regards,
Sagar

On Wed, Sep 7, 2022 at 1:49 PM Yue Zhang  wrote:

> Hi Hudi,
> Just raise a RFC about this diagnostic reporter
> https://github.com/apache/hudi/pull/6600. PLEASE feel free to leave any
> comments or concerns if you are interested!
>
>
> | |
> Yue Zhang
> |
> |
> zhangyue921...@163.com
> |
>
>
> On 08/4/2022 19:38,Yue Zhang wrote:
> Hi Shiyan and everyone,
> This is a great idea! As one of Hudi user, I also struggle to Hudi
> troubleshooting sometimes. With this feature, it will definitely be able to
> reduce the burden.
> So I volunteer to draft a discuss and maybe raise a RFC about if you
> don't mind. Thanks :)
>
>
> | |
> Yue Zhang
> |
> |
> zhangyue921...@163.com
> |
>
>
> On 08/3/2022 00:44,冯健 wrote:
> Maybe we can start this with an audit feature? Since we need some sort of
> "images" to represent “facts”, can create an identity of a writer to link
> them. and in this audit file, we can label each operation with IP,
> environment, platform, version, write config and etc.
>
> On Sun, 31 Jul 2022 at 12:18, Shiyan Xu 
> wrote:
>
> To bubble this up
>
> On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar  wrote:
>
> +1 from me.
>
> It will be very useful if we can have something that can gather
> troubleshooting info easily.
> This part takes a while currently.
>
> On Mon, May 30, 2022 at 9:52 AM Shiyan Xu 
> wrote:
>
> Hi all,
>
> When troubleshooting Hudi jobs in users' environments, we always ask
> users
> to share configs, environment info, check spark UI, etc. Here is an RFC
> idea: can we extend the Hudi metrics system and make a diagnostic
> reporter?
> It can be turned on like a normal metrics reporter. it should collect
> common troubleshooting info and save to json or other human-readable
> text
> format. Users should be able to run with it and share the diagnosis
> file.
> The RFC should discuss what info should / can be collected.
>
> Does this make sense? Anyone interested in driving the RFC design and
> implementation work?
>
> --
> Best,
> Shiyan
>
>
> --
> Best,
> Shiyan
>
>


Re: [DISCUSS] Diagnostic reporter

2022-09-07 Thread Yue Zhang
Hi Hudi, 
Just raise a RFC about this diagnostic reporter 
https://github.com/apache/hudi/pull/6600. PLEASE feel free to leave any 
comments or concerns if you are interested!


| |
Yue Zhang
|
|
zhangyue921...@163.com
|


On 08/4/2022 19:38,Yue Zhang wrote:
Hi Shiyan and everyone,
This is a great idea! As one of Hudi user, I also struggle to Hudi 
troubleshooting sometimes. With this feature, it will definitely be able to 
reduce the burden.
So I volunteer to draft a discuss and maybe raise a RFC about if you don't 
mind. Thanks :)


| |
Yue Zhang
|
|
zhangyue921...@163.com
|


On 08/3/2022 00:44,冯健 wrote:
Maybe we can start this with an audit feature? Since we need some sort of
"images" to represent “facts”, can create an identity of a writer to link
them. and in this audit file, we can label each operation with IP,
environment, platform, version, write config and etc.

On Sun, 31 Jul 2022 at 12:18, Shiyan Xu  wrote:

To bubble this up

On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar  wrote:

+1 from me.

It will be very useful if we can have something that can gather
troubleshooting info easily.
This part takes a while currently.

On Mon, May 30, 2022 at 9:52 AM Shiyan Xu 
wrote:

Hi all,

When troubleshooting Hudi jobs in users' environments, we always ask
users
to share configs, environment info, check spark UI, etc. Here is an RFC
idea: can we extend the Hudi metrics system and make a diagnostic
reporter?
It can be turned on like a normal metrics reporter. it should collect
common troubleshooting info and save to json or other human-readable
text
format. Users should be able to run with it and share the diagnosis
file.
The RFC should discuss what info should / can be collected.

Does this make sense? Anyone interested in driving the RFC design and
implementation work?

--
Best,
Shiyan


--
Best,
Shiyan



Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-08-27 Thread 冯健
I attached the image in this Jira Epic
https://issues.apache.org/jira/browse/HUDI-4677, and the RFC is WIP, will
create a pr in the next few days
Yeah, the basic idea is to implement lifecycle management based on the
savepoint and time travel features, providing new ways for the user to operate
and coordinate. won't propose any new concept

On Sun, 28 Aug 2022 at 02:06, Shiyan Xu  wrote:

> The dev email list does not support showing images unfortunately. you may
> want to put it behind a link.
>
> As for the idea itself,
>
> What I plan to do is to let Hudi support release a snapshot view and
> > lifecycle management out-of-box.
>
>
>  Are you planning to extend the savepoint feature to have lifecycle mgmt
> capabilities? We should consolidate overlapping features properly.
>
> On Sun, Aug 21, 2022 at 12:59 PM 冯健  wrote:
>
> > Hi team,
> > [image: image.png]
> > for the snapshot view scenario, Hudi already provides two key
> > features to support it:
> >
> >- Time travel: user provides a timestamp to query a specific snapshot
> >view of a Hudi table
> >- Savepoint/restore: "savepoint" saves the table as of the commit time
> >so that it lets you restore the table to this savepoint at a later
> point in
> >time if need be. but in this case, the user usually uses this to
> prevent
> >cleaning snapshot view at a specific timestamp, only clean unused
> files
> >
> > The situation is there some inconvenience for users if use them directly
> >
> >- Usually users incline to use a meaningful name instead of querying
> >Hudi table with a timestamp, using the timestamp in SQL may lead to
> the
> >wrong snapshot view being used. for example, we can announce that a
> new tag
> >of hudi table with table_nameMMDD was released, then the user can
> use
> >this new table name to query.
> >- Savepoint is not designed for this "snapshot view" scenario in the
> >beginning, it is designed for disaster recovery. let's say a new
> snapshot
> >view will be created every day, and it has 7 days retention, we should
> >support lifecycle management on top of it.
> >
> > What I plan to do is to let Hudi support release a snapshot view and
> > lifecycle management out-of-box. We have already done some work when
> > supporting customers' snapshot view requirements in my company, and hope
> to
> > land this feature in Community too.
> >
> > Please feel free to let me know if you have any idea about this.
> >
> > Thanks,
> >
> > Jian Feng
> >
>
>
> --
> Best,
> Shiyan
>


Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-08-27 Thread Shiyan Xu
The dev email list does not support showing images unfortunately. you may
want to put it behind a link.

As for the idea itself,

What I plan to do is to let Hudi support release a snapshot view and
> lifecycle management out-of-box.


 Are you planning to extend the savepoint feature to have lifecycle mgmt
capabilities? We should consolidate overlapping features properly.

On Sun, Aug 21, 2022 at 12:59 PM 冯健  wrote:

> Hi team,
> [image: image.png]
> for the snapshot view scenario, Hudi already provides two key
> features to support it:
>
>- Time travel: user provides a timestamp to query a specific snapshot
>view of a Hudi table
>- Savepoint/restore: "savepoint" saves the table as of the commit time
>so that it lets you restore the table to this savepoint at a later point in
>time if need be. but in this case, the user usually uses this to prevent
>cleaning snapshot view at a specific timestamp, only clean unused files
>
> The situation is there some inconvenience for users if use them directly
>
>- Usually users incline to use a meaningful name instead of querying
>Hudi table with a timestamp, using the timestamp in SQL may lead to the
>wrong snapshot view being used. for example, we can announce that a new tag
>of hudi table with table_nameMMDD was released, then the user can use
>this new table name to query.
>- Savepoint is not designed for this "snapshot view" scenario in the
>beginning, it is designed for disaster recovery. let's say a new snapshot
>view will be created every day, and it has 7 days retention, we should
>support lifecycle management on top of it.
>
> What I plan to do is to let Hudi support release a snapshot view and
> lifecycle management out-of-box. We have already done some work when
> supporting customers' snapshot view requirements in my company, and hope to
> land this feature in Community too.
>
> Please feel free to let me know if you have any idea about this.
>
> Thanks,
>
> Jian Feng
>


-- 
Best,
Shiyan


Re: [DISCUSS]: Integrate column stats index with all query engines

2022-08-10 Thread Pratyaksh Sharma
Surely we can work together once we get some feedback on the RFC Meng!

On Thu, Aug 11, 2022 at 9:32 AM 1037817390 
wrote:

> +1 for this
> it will be better to provide some filter converters to faciliate the
> integration of the engine:
> eg: converter presto domain to hudi domain
>
>
>
> and i have already finish the first version of dataskipping/partition
> prune/filter pushdown for presto,
>
> https://github.com/xiarixiaoyao/presto/commit/800646608d4b88799de0addcddd97d03592954ce
>
> maybe we can work together
>
>
>
>
>
>
>
> 孟涛
> mengtao0...@qq.com
>
>
>
> 
>
>
>
>
> --原始邮件--
> 发件人:
>   "dev"
>         <
> vin...@apache.org;
> 发送时间:2022年8月11日(星期四) 中午12:11
> 收件人:"dev"
> 主题:Re: [DISCUSS]: Integrate column stats index with all query engines
>
>
>
> +1 for this.
>
> Suggested new reviewers on the RFC.
> https://github.com/apache/hudi/pull/6345/files#r943073339
>
> On Wed, Aug 10, 2022 at 9:56 PM Pratyaksh Sharma  
> wrote:
>
>  Hello community,
> 
>  With the introduction of multi modal index in Hudi, there is a lot of
> scope
>  for improvement on the querying side. There are 2 major ways of
> reducing
>  the data scan at the time of querying - partition pruning and file
> pruning.
>  While with the latest developments in the community, partition
> pruning is
>  supported for commonly used query engines like spark, presto and
> hive, File
>  pruning using column stats index is only supported for spark and
> flink.
> 
>  We intend to support data skipping for the rest of the engines as well
>  which include hive, presto and trino. I have written a draft RFC here
> -
>  https://github.com/apache/hudi/pull/6345.
> 
>  Please take a look and let me know what you think. Once we have some
>  feedback from the community, we can decide on the next steps.
> 


Re: [DISCUSS]: Integrate column stats index with all query engines

2022-08-10 Thread Vinoth Chandar
+1 for this.

Suggested new reviewers on the RFC.
https://github.com/apache/hudi/pull/6345/files#r943073339

On Wed, Aug 10, 2022 at 9:56 PM Pratyaksh Sharma 
wrote:

> Hello community,
>
> With the introduction of multi modal index in Hudi, there is a lot of scope
> for improvement on the querying side. There are 2 major ways of reducing
> the data scan at the time of querying - partition pruning and file pruning.
> While with the latest developments in the community, partition pruning is
> supported for commonly used query engines like spark, presto and hive, File
> pruning using column stats index is only supported for spark and flink.
>
> We intend to support data skipping for the rest of the engines as well
> which include hive, presto and trino. I have written a draft RFC here -
> https://github.com/apache/hudi/pull/6345.
>
> Please take a look and let me know what you think. Once we have some
> feedback from the community, we can decide on the next steps.
>


Re: [DISCUSS] Diagnostic reporter

2022-08-05 Thread Shiyan Xu
Sure, Zhang Yue, feel free to initiate the RFC!

On Fri, Aug 5, 2022 at 4:57 AM 田昕峣 (Xinyao Tian) 
wrote:

> Hi Shiyan and everyone,
>
>
> Definitely this feature is very important. We really need to gather error
> infos to fix bugs more efficiently.
>
>
> If there’s any thing I could help please feel free to let me know :)
>
>
> Regards,
> Xinyao
>
>
>
>
> Hi Shiyan and everyone,
> This is a great idea! As one of Hudi user, I also struggle to Hudi
> troubleshooting sometimes. With this feature, it will definitely be able to
> reduce the burden.
> So I volunteer to draft a discuss and maybe raise a RFC about if you don't
> mind. Thanks :)
>
>
> | |
> Yue Zhang
> |
> |
> zhangyue921...@163.com
> |
>
>
> On 08/3/2022 00:44,冯健 wrote:
> Maybe we can start this with an audit feature? Since we need some sort of
> "images" to represent “facts”, can create an identity of a writer to link
> them. and in this audit file, we can label each operation with IP,
> environment, platform, version, write config and etc.
>
> On Sun, 31 Jul 2022 at 12:18, Shiyan Xu 
> wrote:
>
> To bubble this up
>
> On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar  wrote:
>
> +1 from me.
>
> It will be very useful if we can have something that can gather
> troubleshooting info easily.
> This part takes a while currently.
>
> On Mon, May 30, 2022 at 9:52 AM Shiyan Xu 
> wrote:
>
> Hi all,
>
> When troubleshooting Hudi jobs in users' environments, we always ask
> users
> to share configs, environment info, check spark UI, etc. Here is an RFC
> idea: can we extend the Hudi metrics system and make a diagnostic
> reporter?
> It can be turned on like a normal metrics reporter. it should collect
> common troubleshooting info and save to json or other human-readable
> text
> format. Users should be able to run with it and share the diagnosis
> file.
> The RFC should discuss what info should / can be collected.
>
> Does this make sense? Anyone interested in driving the RFC design and
> implementation work?
>
> --
> Best,
> Shiyan
>
>
> --
> Best,
> Shiyan
>
>

-- 
Best,
Shiyan


Re: [DISCUSS] Diagnostic reporter

2022-08-05 Thread Xinyao Tian
Hi Shiyan and everyone,


Definitely this feature is very important. We really need to gather error infos 
to fix bugs more efficiently.


If there’s any thing I could help please feel free to let me know :)


Regards,
Xinyao




Hi Shiyan and everyone,
This is a great idea! As one of Hudi user, I also struggle to Hudi 
troubleshooting sometimes. With this feature, it will definitely be able to 
reduce the burden.
So I volunteer to draft a discuss and maybe raise a RFC about if you don't 
mind. Thanks :)


| |
Yue Zhang
|
|
zhangyue921...@163.com
|


On 08/3/2022 00:44,冯健 wrote:
Maybe we can start this with an audit feature? Since we need some sort of
"images" to represent “facts”, can create an identity of a writer to link
them. and in this audit file, we can label each operation with IP,
environment, platform, version, write config and etc.

On Sun, 31 Jul 2022 at 12:18, Shiyan Xu  wrote:

To bubble this up

On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar  wrote:

+1 from me.

It will be very useful if we can have something that can gather
troubleshooting info easily.
This part takes a while currently.

On Mon, May 30, 2022 at 9:52 AM Shiyan Xu 
wrote:

Hi all,

When troubleshooting Hudi jobs in users' environments, we always ask
users
to share configs, environment info, check spark UI, etc. Here is an RFC
idea: can we extend the Hudi metrics system and make a diagnostic
reporter?
It can be turned on like a normal metrics reporter. it should collect
common troubleshooting info and save to json or other human-readable
text
format. Users should be able to run with it and share the diagnosis
file.
The RFC should discuss what info should / can be collected.

Does this make sense? Anyone interested in driving the RFC design and
implementation work?

--
Best,
Shiyan


--
Best,
Shiyan



Re: [DISCUSS] Diagnostic reporter

2022-08-04 Thread Yue Zhang
Hi Shiyan and everyone,
This is a great idea! As one of Hudi user, I also struggle to Hudi 
troubleshooting sometimes. With this feature, it will definitely be able to 
reduce the burden.
So I volunteer to draft a discuss and maybe raise a RFC about if you don't 
mind. Thanks :)


| |
Yue Zhang
|
|
zhangyue921...@163.com
|


On 08/3/2022 00:44,冯健 wrote:
Maybe we can start this with an audit feature? Since we need some sort of
"images" to represent “facts”, can create an identity of a writer to link
them. and in this audit file, we can label each operation with IP,
environment, platform, version, write config and etc.

On Sun, 31 Jul 2022 at 12:18, Shiyan Xu  wrote:

To bubble this up

On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar  wrote:

+1 from me.

It will be very useful if we can have something that can gather
troubleshooting info easily.
This part takes a while currently.

On Mon, May 30, 2022 at 9:52 AM Shiyan Xu 
wrote:

Hi all,

When troubleshooting Hudi jobs in users' environments, we always ask
users
to share configs, environment info, check spark UI, etc. Here is an RFC
idea: can we extend the Hudi metrics system and make a diagnostic
reporter?
It can be turned on like a normal metrics reporter. it should collect
common troubleshooting info and save to json or other human-readable
text
format. Users should be able to run with it and share the diagnosis
file.
The RFC should discuss what info should / can be collected.

Does this make sense? Anyone interested in driving the RFC design and
implementation work?

--
Best,
Shiyan


--
Best,
Shiyan



Re: [DISCUSS] Diagnostic reporter

2022-08-02 Thread 冯健
Maybe we can start this with an audit feature? Since we need some sort of
"images" to represent “facts”, can create an identity of a writer to link
them. and in this audit file, we can label each operation with IP,
environment, platform, version, write config and etc.

On Sun, 31 Jul 2022 at 12:18, Shiyan Xu  wrote:

> To bubble this up
>
> On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar  wrote:
>
> > +1 from me.
> >
> > It will be very useful if we can have something that can gather
> > troubleshooting info easily.
> > This part takes a while currently.
> >
> > On Mon, May 30, 2022 at 9:52 AM Shiyan Xu 
> > wrote:
> >
> > > Hi all,
> > >
> > > When troubleshooting Hudi jobs in users' environments, we always ask
> > users
> > > to share configs, environment info, check spark UI, etc. Here is an RFC
> > > idea: can we extend the Hudi metrics system and make a diagnostic
> > reporter?
> > > It can be turned on like a normal metrics reporter. it should collect
> > > common troubleshooting info and save to json or other human-readable
> text
> > > format. Users should be able to run with it and share the diagnosis
> file.
> > > The RFC should discuss what info should / can be collected.
> > >
> > > Does this make sense? Anyone interested in driving the RFC design and
> > > implementation work?
> > >
> > > --
> > > Best,
> > > Shiyan
> > >
> >
> --
> Best,
> Shiyan
>


Re: [DISCUSS] Diagnostic reporter

2022-07-30 Thread Shiyan Xu
To bubble this up

On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar  wrote:

> +1 from me.
>
> It will be very useful if we can have something that can gather
> troubleshooting info easily.
> This part takes a while currently.
>
> On Mon, May 30, 2022 at 9:52 AM Shiyan Xu 
> wrote:
>
> > Hi all,
> >
> > When troubleshooting Hudi jobs in users' environments, we always ask
> users
> > to share configs, environment info, check spark UI, etc. Here is an RFC
> > idea: can we extend the Hudi metrics system and make a diagnostic
> reporter?
> > It can be turned on like a normal metrics reporter. it should collect
> > common troubleshooting info and save to json or other human-readable text
> > format. Users should be able to run with it and share the diagnosis file.
> > The RFC should discuss what info should / can be collected.
> >
> > Does this make sense? Anyone interested in driving the RFC design and
> > implementation work?
> >
> > --
> > Best,
> > Shiyan
> >
>
-- 
Best,
Shiyan


Re: [DISCUSS] Diagnostic reporter

2022-06-15 Thread Vinoth Chandar
+1 from me.

It will be very useful if we can have something that can gather
troubleshooting info easily.
This part takes a while currently.

On Mon, May 30, 2022 at 9:52 AM Shiyan Xu 
wrote:

> Hi all,
>
> When troubleshooting Hudi jobs in users' environments, we always ask users
> to share configs, environment info, check spark UI, etc. Here is an RFC
> idea: can we extend the Hudi metrics system and make a diagnostic reporter?
> It can be turned on like a normal metrics reporter. it should collect
> common troubleshooting info and save to json or other human-readable text
> format. Users should be able to run with it and share the diagnosis file.
> The RFC should discuss what info should / can be collected.
>
> Does this make sense? Anyone interested in driving the RFC design and
> implementation work?
>
> --
> Best,
> Shiyan
>


Re: [DISCUSS] Hudi sync meetings for Chinese community

2022-06-15 Thread Shimin Yang
Hi all, the proposal of hudi sync meeting for Chinese community is
attached. The first sync meeting will be held online on Thursday, June 29
at 10:00 AM CST. Welcome everyone to the meeting!
大家好,以下是 Hudi 中文社区交流会议的提议文档。首次会议会在北京时间 6 月 29 日上午 10 点线上举行。欢迎大家加入!
Hudi 中文社区交流会议提议


Vinoth Chandar  于2022年5月27日周五 07:10写道:

> Great! Thanks for volunteering
>
> On Thu, May 26, 2022 at 02:09 Shiyan Xu 
> wrote:
>
> > Awesome! looking forward to an initial proposal!
> >
> > On Thu, May 26, 2022 at 4:17 PM Shimin Yang  wrote:
> >
> > > Hi Shiyan, I'm from bytedance data lake team, and our team would like
> to
> > > drive and host the hudi sync meetsing for Chinese community.
> > >
> > > Shiyan Xu  于2022年5月26日周四 16:14写道:
> > >
> > > > Related info: we are noting down the current community sync info here
> > > > https://hudi.apache.org/community/syncs
> > > >
> > > >
> > > > On Thu, May 26, 2022 at 3:44 PM Shiyan Xu <
> xu.shiyan.raym...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > This is a topic brought up previously, and also recently raised in
> > this
> > > > > issue : we are
> thinking
> > of
> > > > > hosting another regular sync meeting for the Chinese community,
> > having
> > > a
> > > > > more suitable time and converse in Chinese. This requires efforts
> on
> > > > > coordinating time, agenda, speakers, and hosting platform. Hence I
> > > would
> > > > > like to call out for volunteers to start driving this. Thank you.
> > > > >
> > > > > --
> > > > > Best,
> > > > > Shiyan
> > > > >
> > > >
> > > >
> > > > --
> > > > Best,
> > > > Shiyan
> > > >
> > >
> >
> >
> > --
> > Best,
> > Shiyan
> >
>


Re: [DISCUSS] Hudi sync meetings for Chinese community

2022-05-26 Thread Vinoth Chandar
Great! Thanks for volunteering

On Thu, May 26, 2022 at 02:09 Shiyan Xu  wrote:

> Awesome! looking forward to an initial proposal!
>
> On Thu, May 26, 2022 at 4:17 PM Shimin Yang  wrote:
>
> > Hi Shiyan, I'm from bytedance data lake team, and our team would like to
> > drive and host the hudi sync meetsing for Chinese community.
> >
> > Shiyan Xu  于2022年5月26日周四 16:14写道:
> >
> > > Related info: we are noting down the current community sync info here
> > > https://hudi.apache.org/community/syncs
> > >
> > >
> > > On Thu, May 26, 2022 at 3:44 PM Shiyan Xu  >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > This is a topic brought up previously, and also recently raised in
> this
> > > > issue : we are thinking
> of
> > > > hosting another regular sync meeting for the Chinese community,
> having
> > a
> > > > more suitable time and converse in Chinese. This requires efforts on
> > > > coordinating time, agenda, speakers, and hosting platform. Hence I
> > would
> > > > like to call out for volunteers to start driving this. Thank you.
> > > >
> > > > --
> > > > Best,
> > > > Shiyan
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Shiyan
> > >
> >
>
>
> --
> Best,
> Shiyan
>


Re: [DISCUSS] Hudi sync meetings for Chinese community

2022-05-26 Thread Shiyan Xu
Awesome! looking forward to an initial proposal!

On Thu, May 26, 2022 at 4:17 PM Shimin Yang  wrote:

> Hi Shiyan, I'm from bytedance data lake team, and our team would like to
> drive and host the hudi sync meetsing for Chinese community.
>
> Shiyan Xu  于2022年5月26日周四 16:14写道:
>
> > Related info: we are noting down the current community sync info here
> > https://hudi.apache.org/community/syncs
> >
> >
> > On Thu, May 26, 2022 at 3:44 PM Shiyan Xu 
> > wrote:
> >
> > > Hi all,
> > >
> > > This is a topic brought up previously, and also recently raised in this
> > > issue : we are thinking of
> > > hosting another regular sync meeting for the Chinese community, having
> a
> > > more suitable time and converse in Chinese. This requires efforts on
> > > coordinating time, agenda, speakers, and hosting platform. Hence I
> would
> > > like to call out for volunteers to start driving this. Thank you.
> > >
> > > --
> > > Best,
> > > Shiyan
> > >
> >
> >
> > --
> > Best,
> > Shiyan
> >
>


-- 
Best,
Shiyan


Re: [DISCUSS] Hudi sync meetings for Chinese community

2022-05-26 Thread Shimin Yang
Hi Shiyan, I'm from bytedance data lake team, and our team would like to
drive and host the hudi sync meetsing for Chinese community.

Shiyan Xu  于2022年5月26日周四 16:14写道:

> Related info: we are noting down the current community sync info here
> https://hudi.apache.org/community/syncs
>
>
> On Thu, May 26, 2022 at 3:44 PM Shiyan Xu 
> wrote:
>
> > Hi all,
> >
> > This is a topic brought up previously, and also recently raised in this
> > issue : we are thinking of
> > hosting another regular sync meeting for the Chinese community, having a
> > more suitable time and converse in Chinese. This requires efforts on
> > coordinating time, agenda, speakers, and hosting platform. Hence I would
> > like to call out for volunteers to start driving this. Thank you.
> >
> > --
> > Best,
> > Shiyan
> >
>
>
> --
> Best,
> Shiyan
>


Re: [DISCUSS] Hudi sync meetings for Chinese community

2022-05-26 Thread Shiyan Xu
Related info: we are noting down the current community sync info here
https://hudi.apache.org/community/syncs


On Thu, May 26, 2022 at 3:44 PM Shiyan Xu 
wrote:

> Hi all,
>
> This is a topic brought up previously, and also recently raised in this
> issue : we are thinking of
> hosting another regular sync meeting for the Chinese community, having a
> more suitable time and converse in Chinese. This requires efforts on
> coordinating time, agenda, speakers, and hosting platform. Hence I would
> like to call out for volunteers to start driving this. Thank you.
>
> --
> Best,
> Shiyan
>


-- 
Best,
Shiyan


Re: [DISCUSS] Hudi community sync time

2022-05-17 Thread Bhavani Sudha
Sounds good. Thank you all for chiming in. Based on the responses we have
had here, we can move the existing community sync to later time.  I will
send a separate voting thread to finalize the exact time.

Thanks,
Sudha

On Thu, Apr 28, 2022 at 1:55 AM Pratyaksh Sharma 
wrote:

> I would propose 8 AM or 8.30 AM PST though since 9 AM PST will clash with
> my other meetings.
> But happy to go with time that suits most of the folks.
>
> On Thu, Apr 28, 2022 at 3:31 AM Vinoth Govindarajan <
> vinoth.govindara...@gmail.com> wrote:
>
> > +1 for 9 am PST call, the current time is super early hence I missed one
> of
> > the meetings in the past.
> >
> > Best,
> > Vinoth
> >
> >
> > On Tue, Apr 26, 2022 at 8:01 PM Vinoth Chandar 
> wrote:
> >
> > > +1 as well. Current PST times are pretty hard for many folks.
> > >
> > > On Sat, Apr 16, 2022 at 6:20 AM Gary Li 
> > wrote:
> > >
> > > > +1 for splitting into two sessions. The current schedule is
> challenging
> > > for
> > > > both US and Chinese folks. We can organize another session for the
> > > Chinese
> > > > timezone.
> > > >
> > > > Calling out for folks living in the Chinese timezone, please reply to
> > > this
> > > > email thread if you are interested to join a sync meeting. We can
> > > schedule
> > > > one if we have enough interest.
> > > >
> > > > Best,
> > > > Gary
> > > >
> > > > On Sat, Apr 16, 2022 at 2:36 AM Bhavani Sudha <
> bhavanisud...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > Our current monthly community syncs happen around 7 am pacific time
> > on
> > > > the
> > > > > last wednesday of each month. It is already 10 pm in China and we
> > dont
> > > > get
> > > > > to see Chinese folks in the community sync call. We have users from
> > > > > different time zones and finding an overlap is challenging as it
> is.
> > In
> > > > > this context I am proposing the following:
> > > > >
> > > > > - We can split the community syncs into two - one catered towards
> > > Chinese
> > > > > time and the other one that happens currently for the rest of the
> > > folks.
> > > > > - If we split it into two different syncs then, we can move the 7
> am
> > > > > pacific time to 8 am or 9am as well.
> > > > >
> > > > > Please share your thoughts on this proposal.
> > > > >
> > > > > Thanks,
> > > > > Sudha
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] Hudi community sync time

2022-04-28 Thread Pratyaksh Sharma
I would propose 8 AM or 8.30 AM PST though since 9 AM PST will clash with
my other meetings.
But happy to go with time that suits most of the folks.

On Thu, Apr 28, 2022 at 3:31 AM Vinoth Govindarajan <
vinoth.govindara...@gmail.com> wrote:

> +1 for 9 am PST call, the current time is super early hence I missed one of
> the meetings in the past.
>
> Best,
> Vinoth
>
>
> On Tue, Apr 26, 2022 at 8:01 PM Vinoth Chandar  wrote:
>
> > +1 as well. Current PST times are pretty hard for many folks.
> >
> > On Sat, Apr 16, 2022 at 6:20 AM Gary Li 
> wrote:
> >
> > > +1 for splitting into two sessions. The current schedule is challenging
> > for
> > > both US and Chinese folks. We can organize another session for the
> > Chinese
> > > timezone.
> > >
> > > Calling out for folks living in the Chinese timezone, please reply to
> > this
> > > email thread if you are interested to join a sync meeting. We can
> > schedule
> > > one if we have enough interest.
> > >
> > > Best,
> > > Gary
> > >
> > > On Sat, Apr 16, 2022 at 2:36 AM Bhavani Sudha  >
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > Our current monthly community syncs happen around 7 am pacific time
> on
> > > the
> > > > last wednesday of each month. It is already 10 pm in China and we
> dont
> > > get
> > > > to see Chinese folks in the community sync call. We have users from
> > > > different time zones and finding an overlap is challenging as it is.
> In
> > > > this context I am proposing the following:
> > > >
> > > > - We can split the community syncs into two - one catered towards
> > Chinese
> > > > time and the other one that happens currently for the rest of the
> > folks.
> > > > - If we split it into two different syncs then, we can move the 7 am
> > > > pacific time to 8 am or 9am as well.
> > > >
> > > > Please share your thoughts on this proposal.
> > > >
> > > > Thanks,
> > > > Sudha
> > > >
> > >
> >
>


Re: [DISCUSS] Hudi community sync time

2022-04-27 Thread Vinoth Govindarajan
+1 for 9 am PST call, the current time is super early hence I missed one of
the meetings in the past.

Best,
Vinoth


On Tue, Apr 26, 2022 at 8:01 PM Vinoth Chandar  wrote:

> +1 as well. Current PST times are pretty hard for many folks.
>
> On Sat, Apr 16, 2022 at 6:20 AM Gary Li  wrote:
>
> > +1 for splitting into two sessions. The current schedule is challenging
> for
> > both US and Chinese folks. We can organize another session for the
> Chinese
> > timezone.
> >
> > Calling out for folks living in the Chinese timezone, please reply to
> this
> > email thread if you are interested to join a sync meeting. We can
> schedule
> > one if we have enough interest.
> >
> > Best,
> > Gary
> >
> > On Sat, Apr 16, 2022 at 2:36 AM Bhavani Sudha 
> > wrote:
> >
> > > Hello all,
> > >
> > > Our current monthly community syncs happen around 7 am pacific time on
> > the
> > > last wednesday of each month. It is already 10 pm in China and we dont
> > get
> > > to see Chinese folks in the community sync call. We have users from
> > > different time zones and finding an overlap is challenging as it is. In
> > > this context I am proposing the following:
> > >
> > > - We can split the community syncs into two - one catered towards
> Chinese
> > > time and the other one that happens currently for the rest of the
> folks.
> > > - If we split it into two different syncs then, we can move the 7 am
> > > pacific time to 8 am or 9am as well.
> > >
> > > Please share your thoughts on this proposal.
> > >
> > > Thanks,
> > > Sudha
> > >
> >
>


  1   2   3   4   5   6   7   8   9   10   >