Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread Bingeng Huang
Looking forward to the RFC.
We can propose RFC about support TTL config using non-partition field after



sagar sumit  于2022年10月19日周三 14:42写道:

> +1 Very nice idea. Looking forward to the RFC!
>
> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
> wrote:
>
> > great proposal. Partition TTL is a good starting point. we can extend it
> to
> > other TTL strategies like column-based, and make it customizable and
> > pluggable. Looking forward to the RFC!
> >
> > On Wed, Oct 19, 2022 at 11:40 AM Jian Feng  >
> > wrote:
> >
> > > Good idea,
> > > this is definitely worth an  RFC
> > > btw should it only depend on Hudi's partition? I feel it should be a
> more
> > > common feature since sometimes customers' data can not update across
> > > partitions
> > >
> > >
> > > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com>
> wrote:
> > >
> > > > Hi all, we have implemented a partition based data ttl management,
> > which
> > > > we can manage ttl for hudi partition by size, expired time and
> > > > sub-partition count. When a partition is detected as outdated, we use
> > > > delete partition interface to delete it, which will generate a
> replace
> > > > commit to mark the data as deleted. The real deletion will then done
> by
> > > > clean service.
> > > >
> > > >
> > > > If community is interested in this idea, maybe we can propose a RFC
> to
> > > > discuss it in detail.
> > > >
> > > >
> > > > > On Oct 19, 2022, at 10:06, Vinoth Chandar 
> wrote:
> > > > >
> > > > > +1 love to discuss this on a RFC proposal.
> > > > >
> > > > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > > > wrote:
> > > > >
> > > > >> That's a very interesting idea.
> > > > >>
> > > > >> Do you want to take a stab at writing a full proposal (in the form
> > of
> > > > RFC)
> > > > >> for it?
> > > > >>
> > > > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang <
> hbgstc...@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >>> Hi all,
> > > > >>>
> > > > >>> Do we have plan to integrate data TTL into HUDI, so we don't have
> > to
> > > > >>> schedule a offline spark job to delete outdated data, just set a
> > TTL
> > > > >>> config, then writer or some offline service will delete old data
> as
> > > > >>> expected.
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> > > --
> > > *Jian Feng,冯健*
> > > Shopee | Engineer | Data Infrastructure
> > >
> >
> >
> > --
> > Best,
> > Shiyan
> >
>


Re: [ANNOUNCE] Apache Hudi 0.12.1 released

2022-10-19 Thread leesf
Great job!

Alexey Kudinkin  于2022年10月20日周四 03:45写道:

> Thanks Zhaojing for masterfully navigating this release!
>
> On Wed, Oct 19, 2022 at 7:46 AM Vinoth Chandar  wrote:
>
> > Great job everyone!
> >
> > On Wed, Oct 19, 2022 at 07:11 zhaojing yu  wrote:
> >
> > > The Apache Hudi team is pleased to announce the release of Apache Hudi
> > > 0.12.1.
> > >
> > > Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes
> > > and Incrementals. Apache Hudi manages storage of large analytical
> > > datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible
> > > storage) and provides the ability to query them.
> > >
> > > This release comes 2 months after 0.12.0. It includes more than
> > > 150 resolved issues, comprising of a few new features as well as
> > > general improvements and bug fixes. You can read the release
> > > highlights at https://hudi.apache.org/releases/release-0.12.1.
> > >
> > > For details on how to use Hudi, please look at the quick start page
> > located
> > > at https://hudi.apache.org/docs/quick-start-guide.html
> > >
> > > If you'd like to download the source release, you can find it here:
> > > https://github.com/apache/hudi/releases/tag/release-0.12.1
> > >
> > > Release notes including the resolved issues can be found here:
> > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352182
> > >
> > > We welcome your help and feedback. For more information on how to
> report
> > > problems, and to get involved, visit the project website at
> > > https://hudi.apache.org
> > >
> > > Thanks to everyone involved!
> > >
> > > Release Manager
> > >
> >
>


Re: [ANNOUNCE] Apache Hudi 0.12.1 released

2022-10-19 Thread Alexey Kudinkin
Thanks Zhaojing for masterfully navigating this release!

On Wed, Oct 19, 2022 at 7:46 AM Vinoth Chandar  wrote:

> Great job everyone!
>
> On Wed, Oct 19, 2022 at 07:11 zhaojing yu  wrote:
>
> > The Apache Hudi team is pleased to announce the release of Apache Hudi
> > 0.12.1.
> >
> > Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes
> > and Incrementals. Apache Hudi manages storage of large analytical
> > datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible
> > storage) and provides the ability to query them.
> >
> > This release comes 2 months after 0.12.0. It includes more than
> > 150 resolved issues, comprising of a few new features as well as
> > general improvements and bug fixes. You can read the release
> > highlights at https://hudi.apache.org/releases/release-0.12.1.
> >
> > For details on how to use Hudi, please look at the quick start page
> located
> > at https://hudi.apache.org/docs/quick-start-guide.html
> >
> > If you'd like to download the source release, you can find it here:
> > https://github.com/apache/hudi/releases/tag/release-0.12.1
> >
> > Release notes including the resolved issues can be found here:
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352182
> >
> > We welcome your help and feedback. For more information on how to report
> > problems, and to get involved, visit the project website at
> > https://hudi.apache.org
> >
> > Thanks to everyone involved!
> >
> > Release Manager
> >
>


Re: [ANNOUNCE] Apache Hudi 0.12.1 released

2022-10-19 Thread Vinoth Chandar
Great job everyone!

On Wed, Oct 19, 2022 at 07:11 zhaojing yu  wrote:

> The Apache Hudi team is pleased to announce the release of Apache Hudi
> 0.12.1.
>
> Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes
> and Incrementals. Apache Hudi manages storage of large analytical
> datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible
> storage) and provides the ability to query them.
>
> This release comes 2 months after 0.12.0. It includes more than
> 150 resolved issues, comprising of a few new features as well as
> general improvements and bug fixes. You can read the release
> highlights at https://hudi.apache.org/releases/release-0.12.1.
>
> For details on how to use Hudi, please look at the quick start page located
> at https://hudi.apache.org/docs/quick-start-guide.html
>
> If you'd like to download the source release, you can find it here:
> https://github.com/apache/hudi/releases/tag/release-0.12.1
>
> Release notes including the resolved issues can be found here:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352182
>
> We welcome your help and feedback. For more information on how to report
> problems, and to get involved, visit the project website at
> https://hudi.apache.org
>
> Thanks to everyone involved!
>
> Release Manager
>


[ANNOUNCE] Apache Hudi 0.12.1 released

2022-10-19 Thread zhaojing yu
The Apache Hudi team is pleased to announce the release of Apache Hudi
0.12.1.

Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes
and Incrementals. Apache Hudi manages storage of large analytical
datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible
storage) and provides the ability to query them.

This release comes 2 months after 0.12.0. It includes more than
150 resolved issues, comprising of a few new features as well as
general improvements and bug fixes. You can read the release
highlights at https://hudi.apache.org/releases/release-0.12.1.

For details on how to use Hudi, please look at the quick start page located
at https://hudi.apache.org/docs/quick-start-guide.html

If you'd like to download the source release, you can find it here:
https://github.com/apache/hudi/releases/tag/release-0.12.1

Release notes including the resolved issues can be found here:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352182

We welcome your help and feedback. For more information on how to report
problems, and to get involved, visit the project website at
https://hudi.apache.org

Thanks to everyone involved!

Release Manager


Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread stream2000
Since it is marked as outdated by ttl policy, we think that it’s better to 
delete it anyway. Compaction & Clustering should deal with the case that the 
source data is already marked as deleted by ttl, otherwise there will still 
left some unused data in the partition.  What do you think? 

> On Oct 19, 2022, at 15:09, Teng Huo  wrote:
> 
> Nice feature!
> @stream2000
> 
> Just one question, can it work with compaction logs? I mean, if there are 
> some log files already marked in a compaction plan, will they be deleted by 
> TTL?
> 
> From: sagar sumit 
> Sent: Wednesday, October 19, 2022 2:42:36 PM
> To: dev@hudi.apache.org 
> Subject: Re: [DISCUSS] Hudi data TTL
> 
> +1 Very nice idea. Looking forward to the RFC!
> 
> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
> wrote:
> 
>> great proposal. Partition TTL is a good starting point. we can extend it to
>> other TTL strategies like column-based, and make it customizable and
>> pluggable. Looking forward to the RFC!
>> 
>> On Wed, Oct 19, 2022 at 11:40 AM Jian Feng 
>> wrote:
>> 
>>> Good idea,
>>> this is definitely worth an  RFC
>>> btw should it only depend on Hudi's partition? I feel it should be a more
>>> common feature since sometimes customers' data can not update across
>>> partitions
>>> 
>>> 
>>> On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:
>>> 
 Hi all, we have implemented a partition based data ttl management,
>> which
 we can manage ttl for hudi partition by size, expired time and
 sub-partition count. When a partition is detected as outdated, we use
 delete partition interface to delete it, which will generate a replace
 commit to mark the data as deleted. The real deletion will then done by
 clean service.
 
 
 If community is interested in this idea, maybe we can propose a RFC to
 discuss it in detail.
 
 
> On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> 
> +1 love to discuss this on a RFC proposal.
> 
> On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
 wrote:
> 
>> That's a very interesting idea.
>> 
>> Do you want to take a stab at writing a full proposal (in the form
>> of
 RFC)
>> for it?
>> 
>> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang >> 
>> wrote:
>> 
>>> Hi all,
>>> 
>>> Do we have plan to integrate data TTL into HUDI, so we don't have
>> to
>>> schedule a offline spark job to delete outdated data, just set a
>> TTL
>>> config, then writer or some offline service will delete old data as
>>> expected.
>>> 
>> 
 
 
>>> 
>>> --
>>> *Jian Feng,冯健*
>>> Shopee | Engineer | Data Infrastructure
>>> 
>> 
>> 
>> --
>> Best,
>> Shiyan
>> 



Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread Teng Huo
Nice feature!
@stream2000

Just one question, can it work with compaction logs? I mean, if there are some 
log files already marked in a compaction plan, will they be deleted by TTL?

From: sagar sumit 
Sent: Wednesday, October 19, 2022 2:42:36 PM
To: dev@hudi.apache.org 
Subject: Re: [DISCUSS] Hudi data TTL

+1 Very nice idea. Looking forward to the RFC!

On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
wrote:

> great proposal. Partition TTL is a good starting point. we can extend it to
> other TTL strategies like column-based, and make it customizable and
> pluggable. Looking forward to the RFC!
>
> On Wed, Oct 19, 2022 at 11:40 AM Jian Feng 
> wrote:
>
> > Good idea,
> > this is definitely worth an  RFC
> > btw should it only depend on Hudi's partition? I feel it should be a more
> > common feature since sometimes customers' data can not update across
> > partitions
> >
> >
> > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:
> >
> > > Hi all, we have implemented a partition based data ttl management,
> which
> > > we can manage ttl for hudi partition by size, expired time and
> > > sub-partition count. When a partition is detected as outdated, we use
> > > delete partition interface to delete it, which will generate a replace
> > > commit to mark the data as deleted. The real deletion will then done by
> > > clean service.
> > >
> > >
> > > If community is interested in this idea, maybe we can propose a RFC to
> > > discuss it in detail.
> > >
> > >
> > > > On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> > > >
> > > > +1 love to discuss this on a RFC proposal.
> > > >
> > > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > > wrote:
> > > >
> > > >> That's a very interesting idea.
> > > >>
> > > >> Do you want to take a stab at writing a full proposal (in the form
> of
> > > RFC)
> > > >> for it?
> > > >>
> > > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang  >
> > > >> wrote:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> Do we have plan to integrate data TTL into HUDI, so we don't have
> to
> > > >>> schedule a offline spark job to delete outdated data, just set a
> TTL
> > > >>> config, then writer or some offline service will delete old data as
> > > >>> expected.
> > > >>>
> > > >>
> > >
> > >
> >
> > --
> > *Jian Feng,冯健*
> > Shopee | Engineer | Data Infrastructure
> >
>
>
> --
> Best,
> Shiyan
>


Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread sagar sumit
+1 Very nice idea. Looking forward to the RFC!

On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
wrote:

> great proposal. Partition TTL is a good starting point. we can extend it to
> other TTL strategies like column-based, and make it customizable and
> pluggable. Looking forward to the RFC!
>
> On Wed, Oct 19, 2022 at 11:40 AM Jian Feng 
> wrote:
>
> > Good idea,
> > this is definitely worth an  RFC
> > btw should it only depend on Hudi's partition? I feel it should be a more
> > common feature since sometimes customers' data can not update across
> > partitions
> >
> >
> > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:
> >
> > > Hi all, we have implemented a partition based data ttl management,
> which
> > > we can manage ttl for hudi partition by size, expired time and
> > > sub-partition count. When a partition is detected as outdated, we use
> > > delete partition interface to delete it, which will generate a replace
> > > commit to mark the data as deleted. The real deletion will then done by
> > > clean service.
> > >
> > >
> > > If community is interested in this idea, maybe we can propose a RFC to
> > > discuss it in detail.
> > >
> > >
> > > > On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> > > >
> > > > +1 love to discuss this on a RFC proposal.
> > > >
> > > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > > wrote:
> > > >
> > > >> That's a very interesting idea.
> > > >>
> > > >> Do you want to take a stab at writing a full proposal (in the form
> of
> > > RFC)
> > > >> for it?
> > > >>
> > > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang  >
> > > >> wrote:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> Do we have plan to integrate data TTL into HUDI, so we don't have
> to
> > > >>> schedule a offline spark job to delete outdated data, just set a
> TTL
> > > >>> config, then writer or some offline service will delete old data as
> > > >>> expected.
> > > >>>
> > > >>
> > >
> > >
> >
> > --
> > *Jian Feng,冯健*
> > Shopee | Engineer | Data Infrastructure
> >
>
>
> --
> Best,
> Shiyan
>