Re: Re: [DISCUSS] Hudi data TTL

2023-03-31 Thread Sivabalan
left some comments. thanks!

On Fri, 31 Mar 2023 at 00:59, 符其军 <18889897...@163.com> wrote:

> Hi community, we have submitted RFC-65 Partition TTL Management in this
> pr: https://github.com/apache/hudi/pull/8062.Let me know if you
> have any questions or concerns with this proposal.
> At 2022-10-21 14:42:10, "stream2000" <18889897...@163.com> wrote:
> >Yes we can have a talk about it. We will try our best to write the RFC,
> maybe publish it in a few weeks.
> >
> >
> >> On Oct 21, 2022, at 10:18, JerryYue <272614...@qq.com.INVALID> wrote:
> >>
> >> Looking forward to the RFC
> >> It's a good idea, we also need hudi data TTL in some case
> >> Do we have any plan or time to do this? We also had some simple designs
> to implement it
> >> Maybe we can had a talk about it
> >>
> >> 在 2022/10/20 上午9:47,“Bingeng Huang” qq@hudi.apache.org 代表 hbgstc...@gmail.com> 写入:
> >>
> >>Looking forward to the RFC.
> >>We can propose RFC about support TTL config using non-partition
> field after
> >>
> >>
> >>
> >>sagar sumit  于2022年10月19日周三 14:42写道:
> >>
> >>> +1 Very nice idea. Looking forward to the RFC!
> >>>
> >>> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu <
> xu.shiyan.raym...@gmail.com>
> >>> wrote:
> >>>
>  great proposal. Partition TTL is a good starting point. we can extend
> it
> >>> to
>  other TTL strategies like column-based, and make it customizable and
>  pluggable. Looking forward to the RFC!
> 
>  On Wed, Oct 19, 2022 at 11:40 AM Jian Feng
>  
>  wrote:
> 
> > Good idea,
> > this is definitely worth an  RFC
> > btw should it only depend on Hudi's partition? I feel it should be a
> >>> more
> > common feature since sometimes customers' data can not update across
> > partitions
> >
> >
> > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com>
> >>> wrote:
> >
> >> Hi all, we have implemented a partition based data ttl management,
>  which
> >> we can manage ttl for hudi partition by size, expired time and
> >> sub-partition count. When a partition is detected as outdated, we
> use
> >> delete partition interface to delete it, which will generate a
> >>> replace
> >> commit to mark the data as deleted. The real deletion will then done
> >>> by
> >> clean service.
> >>
> >>
> >> If community is interested in this idea, maybe we can propose a RFC
> >>> to
> >> discuss it in detail.
> >>
> >>
> >>> On Oct 19, 2022, at 10:06, Vinoth Chandar 
> >>> wrote:
> >>>
> >>> +1 love to discuss this on a RFC proposal.
> >>>
> >>> On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> >> wrote:
> >>>
>  That's a very interesting idea.
> 
>  Do you want to take a stab at writing a full proposal (in the form
>  of
> >> RFC)
>  for it?
> 
>  On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang <
> >>> hbgstc...@gmail.com
> >
>  wrote:
> 
> > Hi all,
> >
> > Do we have plan to integrate data TTL into HUDI, so we don't have
>  to
> > schedule a offline spark job to delete outdated data, just set a
>  TTL
> > config, then writer or some offline service will delete old data
> >>> as
> > expected.
> >
> 
> >>
> >>
> >
> > --
> > *Jian Feng,冯健*
> > Shopee | Engineer | Data Infrastructure
> >
> 
> 
>  --
>  Best,
>  Shiyan
> 
> >>>
> >>
>


-- 
Regards,
-Sivabalan


Re:Re: [DISCUSS] Hudi data TTL

2023-03-31 Thread 符其军
Hi community, we have submitted RFC-65 Partition TTL Management in this pr: 
https://github.com/apache/hudi/pull/8062.Let me know if you have any 
questions or concerns with this proposal.
At 2022-10-21 14:42:10, "stream2000" <18889897...@163.com> wrote:
>Yes we can have a talk about it. We will try our best to write the RFC, maybe 
>publish it in a few weeks.
>
>
>> On Oct 21, 2022, at 10:18, JerryYue <272614...@qq.com.INVALID> wrote:
>> 
>> Looking forward to the RFC
>> It's a good idea, we also need hudi data TTL in some case
>> Do we have any plan or time to do this? We also had some simple designs to 
>> implement it
>> Maybe we can had a talk about it
>> 
>> 在 2022/10/20 上午9:47,“Bingeng 
>> Huang”> hbgstc...@gmail.com> 写入:
>> 
>>Looking forward to the RFC.
>>We can propose RFC about support TTL config using non-partition field 
>> after
>> 
>> 
>> 
>>sagar sumit  于2022年10月19日周三 14:42写道:
>> 
>>> +1 Very nice idea. Looking forward to the RFC!
>>> 
>>> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
>>> wrote:
>>> 
 great proposal. Partition TTL is a good starting point. we can extend it
>>> to
 other TTL strategies like column-based, and make it customizable and
 pluggable. Looking forward to the RFC!
 
 On Wed, Oct 19, 2022 at 11:40 AM Jian Feng >>> 
 wrote:
 
> Good idea,
> this is definitely worth an  RFC
> btw should it only depend on Hudi's partition? I feel it should be a
>>> more
> common feature since sometimes customers' data can not update across
> partitions
> 
> 
> On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com>
>>> wrote:
> 
>> Hi all, we have implemented a partition based data ttl management,
 which
>> we can manage ttl for hudi partition by size, expired time and
>> sub-partition count. When a partition is detected as outdated, we use
>> delete partition interface to delete it, which will generate a
>>> replace
>> commit to mark the data as deleted. The real deletion will then done
>>> by
>> clean service.
>> 
>> 
>> If community is interested in this idea, maybe we can propose a RFC
>>> to
>> discuss it in detail.
>> 
>> 
>>> On Oct 19, 2022, at 10:06, Vinoth Chandar 
>>> wrote:
>>> 
>>> +1 love to discuss this on a RFC proposal.
>>> 
>>> On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
>> wrote:
>>> 
 That's a very interesting idea.
 
 Do you want to take a stab at writing a full proposal (in the form
 of
>> RFC)
 for it?
 
 On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang <
>>> hbgstc...@gmail.com
> 
 wrote:
 
> Hi all,
> 
> Do we have plan to integrate data TTL into HUDI, so we don't have
 to
> schedule a offline spark job to delete outdated data, just set a
 TTL
> config, then writer or some offline service will delete old data
>>> as
> expected.
> 
 
>> 
>> 
> 
> --
> *Jian Feng,冯健*
> Shopee | Engineer | Data Infrastructure
> 
 
 
 --
 Best,
 Shiyan
 
>>> 
>> 


Re: [DISCUSS] Hudi data TTL

2022-10-21 Thread stream2000
Yes we can have a talk about it. We will try our best to write the RFC, maybe 
publish it in a few weeks.


> On Oct 21, 2022, at 10:18, JerryYue <272614...@qq.com.INVALID> wrote:
> 
> Looking forward to the RFC
> It's a good idea, we also need hudi data TTL in some case
> Do we have any plan or time to do this? We also had some simple designs to 
> implement it
> Maybe we can had a talk about it
> 
> 在 2022/10/20 上午9:47,“Bingeng 
> Huang” hbgstc...@gmail.com> 写入:
> 
>Looking forward to the RFC.
>We can propose RFC about support TTL config using non-partition field after
> 
> 
> 
>sagar sumit  于2022年10月19日周三 14:42写道:
> 
>> +1 Very nice idea. Looking forward to the RFC!
>> 
>> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
>> wrote:
>> 
>>> great proposal. Partition TTL is a good starting point. we can extend it
>> to
>>> other TTL strategies like column-based, and make it customizable and
>>> pluggable. Looking forward to the RFC!
>>> 
>>> On Wed, Oct 19, 2022 at 11:40 AM Jian Feng >> 
>>> wrote:
>>> 
 Good idea,
 this is definitely worth an  RFC
 btw should it only depend on Hudi's partition? I feel it should be a
>> more
 common feature since sometimes customers' data can not update across
 partitions
 
 
 On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com>
>> wrote:
 
> Hi all, we have implemented a partition based data ttl management,
>>> which
> we can manage ttl for hudi partition by size, expired time and
> sub-partition count. When a partition is detected as outdated, we use
> delete partition interface to delete it, which will generate a
>> replace
> commit to mark the data as deleted. The real deletion will then done
>> by
> clean service.
> 
> 
> If community is interested in this idea, maybe we can propose a RFC
>> to
> discuss it in detail.
> 
> 
>> On Oct 19, 2022, at 10:06, Vinoth Chandar 
>> wrote:
>> 
>> +1 love to discuss this on a RFC proposal.
>> 
>> On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> wrote:
>> 
>>> That's a very interesting idea.
>>> 
>>> Do you want to take a stab at writing a full proposal (in the form
>>> of
> RFC)
>>> for it?
>>> 
>>> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang <
>> hbgstc...@gmail.com
 
>>> wrote:
>>> 
 Hi all,
 
 Do we have plan to integrate data TTL into HUDI, so we don't have
>>> to
 schedule a offline spark job to delete outdated data, just set a
>>> TTL
 config, then writer or some offline service will delete old data
>> as
 expected.
 
>>> 
> 
> 
 
 --
 *Jian Feng,冯健*
 Shopee | Engineer | Data Infrastructure
 
>>> 
>>> 
>>> --
>>> Best,
>>> Shiyan
>>> 
>> 
> 



Re: [DISCUSS] Hudi data TTL

2022-10-20 Thread JerryYue
Looking forward to the RFC
It's a good idea, we also need hudi data TTL in some case
Do we have any plan or time to do this? We also had some simple designs to 
implement it
Maybe we can had a talk about it

在 2022/10/20 上午9:47,“Bingeng 
Huang” 
写入:

Looking forward to the RFC.
We can propose RFC about support TTL config using non-partition field after



sagar sumit  于2022年10月19日周三 14:42写道:

> +1 Very nice idea. Looking forward to the RFC!
>
> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
> wrote:
>
> > great proposal. Partition TTL is a good starting point. we can extend it
> to
> > other TTL strategies like column-based, and make it customizable and
> > pluggable. Looking forward to the RFC!
> >
> > On Wed, Oct 19, 2022 at 11:40 AM Jian Feng  >
> > wrote:
> >
> > > Good idea,
> > > this is definitely worth an  RFC
> > > btw should it only depend on Hudi's partition? I feel it should be a
> more
> > > common feature since sometimes customers' data can not update across
> > > partitions
> > >
> > >
> > > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com>
> wrote:
> > >
> > > > Hi all, we have implemented a partition based data ttl management,
> > which
> > > > we can manage ttl for hudi partition by size, expired time and
> > > > sub-partition count. When a partition is detected as outdated, we 
use
> > > > delete partition interface to delete it, which will generate a
> replace
> > > > commit to mark the data as deleted. The real deletion will then done
> by
> > > > clean service.
> > > >
> > > >
> > > > If community is interested in this idea, maybe we can propose a RFC
> to
> > > > discuss it in detail.
> > > >
> > > >
> > > > > On Oct 19, 2022, at 10:06, Vinoth Chandar 
> wrote:
> > > > >
> > > > > +1 love to discuss this on a RFC proposal.
> > > > >
> > > > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > > > wrote:
> > > > >
> > > > >> That's a very interesting idea.
> > > > >>
> > > > >> Do you want to take a stab at writing a full proposal (in the 
form
> > of
> > > > RFC)
> > > > >> for it?
> > > > >>
> > > > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang <
> hbgstc...@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >>> Hi all,
> > > > >>>
> > > > >>> Do we have plan to integrate data TTL into HUDI, so we don't 
have
> > to
> > > > >>> schedule a offline spark job to delete outdated data, just set a
> > TTL
> > > > >>> config, then writer or some offline service will delete old data
> as
> > > > >>> expected.
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> > > --
> > > *Jian Feng,冯健*
> > > Shopee | Engineer | Data Infrastructure
> > >
> >
> >
> > --
> > Best,
> > Shiyan
> >
>




Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread Bingeng Huang
Looking forward to the RFC.
We can propose RFC about support TTL config using non-partition field after



sagar sumit  于2022年10月19日周三 14:42写道:

> +1 Very nice idea. Looking forward to the RFC!
>
> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
> wrote:
>
> > great proposal. Partition TTL is a good starting point. we can extend it
> to
> > other TTL strategies like column-based, and make it customizable and
> > pluggable. Looking forward to the RFC!
> >
> > On Wed, Oct 19, 2022 at 11:40 AM Jian Feng  >
> > wrote:
> >
> > > Good idea,
> > > this is definitely worth an  RFC
> > > btw should it only depend on Hudi's partition? I feel it should be a
> more
> > > common feature since sometimes customers' data can not update across
> > > partitions
> > >
> > >
> > > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com>
> wrote:
> > >
> > > > Hi all, we have implemented a partition based data ttl management,
> > which
> > > > we can manage ttl for hudi partition by size, expired time and
> > > > sub-partition count. When a partition is detected as outdated, we use
> > > > delete partition interface to delete it, which will generate a
> replace
> > > > commit to mark the data as deleted. The real deletion will then done
> by
> > > > clean service.
> > > >
> > > >
> > > > If community is interested in this idea, maybe we can propose a RFC
> to
> > > > discuss it in detail.
> > > >
> > > >
> > > > > On Oct 19, 2022, at 10:06, Vinoth Chandar 
> wrote:
> > > > >
> > > > > +1 love to discuss this on a RFC proposal.
> > > > >
> > > > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > > > wrote:
> > > > >
> > > > >> That's a very interesting idea.
> > > > >>
> > > > >> Do you want to take a stab at writing a full proposal (in the form
> > of
> > > > RFC)
> > > > >> for it?
> > > > >>
> > > > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang <
> hbgstc...@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >>> Hi all,
> > > > >>>
> > > > >>> Do we have plan to integrate data TTL into HUDI, so we don't have
> > to
> > > > >>> schedule a offline spark job to delete outdated data, just set a
> > TTL
> > > > >>> config, then writer or some offline service will delete old data
> as
> > > > >>> expected.
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> > > --
> > > *Jian Feng,冯健*
> > > Shopee | Engineer | Data Infrastructure
> > >
> >
> >
> > --
> > Best,
> > Shiyan
> >
>


Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread stream2000
Since it is marked as outdated by ttl policy, we think that it’s better to 
delete it anyway. Compaction & Clustering should deal with the case that the 
source data is already marked as deleted by ttl, otherwise there will still 
left some unused data in the partition.  What do you think? 

> On Oct 19, 2022, at 15:09, Teng Huo  wrote:
> 
> Nice feature!
> @stream2000
> 
> Just one question, can it work with compaction logs? I mean, if there are 
> some log files already marked in a compaction plan, will they be deleted by 
> TTL?
> 
> From: sagar sumit 
> Sent: Wednesday, October 19, 2022 2:42:36 PM
> To: dev@hudi.apache.org 
> Subject: Re: [DISCUSS] Hudi data TTL
> 
> +1 Very nice idea. Looking forward to the RFC!
> 
> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
> wrote:
> 
>> great proposal. Partition TTL is a good starting point. we can extend it to
>> other TTL strategies like column-based, and make it customizable and
>> pluggable. Looking forward to the RFC!
>> 
>> On Wed, Oct 19, 2022 at 11:40 AM Jian Feng 
>> wrote:
>> 
>>> Good idea,
>>> this is definitely worth an  RFC
>>> btw should it only depend on Hudi's partition? I feel it should be a more
>>> common feature since sometimes customers' data can not update across
>>> partitions
>>> 
>>> 
>>> On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:
>>> 
>>>> Hi all, we have implemented a partition based data ttl management,
>> which
>>>> we can manage ttl for hudi partition by size, expired time and
>>>> sub-partition count. When a partition is detected as outdated, we use
>>>> delete partition interface to delete it, which will generate a replace
>>>> commit to mark the data as deleted. The real deletion will then done by
>>>> clean service.
>>>> 
>>>> 
>>>> If community is interested in this idea, maybe we can propose a RFC to
>>>> discuss it in detail.
>>>> 
>>>> 
>>>>> On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
>>>>> 
>>>>> +1 love to discuss this on a RFC proposal.
>>>>> 
>>>>> On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
>>>> wrote:
>>>>> 
>>>>>> That's a very interesting idea.
>>>>>> 
>>>>>> Do you want to take a stab at writing a full proposal (in the form
>> of
>>>> RFC)
>>>>>> for it?
>>>>>> 
>>>>>> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang >> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> Do we have plan to integrate data TTL into HUDI, so we don't have
>> to
>>>>>>> schedule a offline spark job to delete outdated data, just set a
>> TTL
>>>>>>> config, then writer or some offline service will delete old data as
>>>>>>> expected.
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> *Jian Feng,冯健*
>>> Shopee | Engineer | Data Infrastructure
>>> 
>> 
>> 
>> --
>> Best,
>> Shiyan
>> 



Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread Teng Huo
Nice feature!
@stream2000

Just one question, can it work with compaction logs? I mean, if there are some 
log files already marked in a compaction plan, will they be deleted by TTL?

From: sagar sumit 
Sent: Wednesday, October 19, 2022 2:42:36 PM
To: dev@hudi.apache.org 
Subject: Re: [DISCUSS] Hudi data TTL

+1 Very nice idea. Looking forward to the RFC!

On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
wrote:

> great proposal. Partition TTL is a good starting point. we can extend it to
> other TTL strategies like column-based, and make it customizable and
> pluggable. Looking forward to the RFC!
>
> On Wed, Oct 19, 2022 at 11:40 AM Jian Feng 
> wrote:
>
> > Good idea,
> > this is definitely worth an  RFC
> > btw should it only depend on Hudi's partition? I feel it should be a more
> > common feature since sometimes customers' data can not update across
> > partitions
> >
> >
> > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:
> >
> > > Hi all, we have implemented a partition based data ttl management,
> which
> > > we can manage ttl for hudi partition by size, expired time and
> > > sub-partition count. When a partition is detected as outdated, we use
> > > delete partition interface to delete it, which will generate a replace
> > > commit to mark the data as deleted. The real deletion will then done by
> > > clean service.
> > >
> > >
> > > If community is interested in this idea, maybe we can propose a RFC to
> > > discuss it in detail.
> > >
> > >
> > > > On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> > > >
> > > > +1 love to discuss this on a RFC proposal.
> > > >
> > > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > > wrote:
> > > >
> > > >> That's a very interesting idea.
> > > >>
> > > >> Do you want to take a stab at writing a full proposal (in the form
> of
> > > RFC)
> > > >> for it?
> > > >>
> > > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang  >
> > > >> wrote:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> Do we have plan to integrate data TTL into HUDI, so we don't have
> to
> > > >>> schedule a offline spark job to delete outdated data, just set a
> TTL
> > > >>> config, then writer or some offline service will delete old data as
> > > >>> expected.
> > > >>>
> > > >>
> > >
> > >
> >
> > --
> > *Jian Feng,冯健*
> > Shopee | Engineer | Data Infrastructure
> >
>
>
> --
> Best,
> Shiyan
>


Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread sagar sumit
+1 Very nice idea. Looking forward to the RFC!

On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
wrote:

> great proposal. Partition TTL is a good starting point. we can extend it to
> other TTL strategies like column-based, and make it customizable and
> pluggable. Looking forward to the RFC!
>
> On Wed, Oct 19, 2022 at 11:40 AM Jian Feng 
> wrote:
>
> > Good idea,
> > this is definitely worth an  RFC
> > btw should it only depend on Hudi's partition? I feel it should be a more
> > common feature since sometimes customers' data can not update across
> > partitions
> >
> >
> > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:
> >
> > > Hi all, we have implemented a partition based data ttl management,
> which
> > > we can manage ttl for hudi partition by size, expired time and
> > > sub-partition count. When a partition is detected as outdated, we use
> > > delete partition interface to delete it, which will generate a replace
> > > commit to mark the data as deleted. The real deletion will then done by
> > > clean service.
> > >
> > >
> > > If community is interested in this idea, maybe we can propose a RFC to
> > > discuss it in detail.
> > >
> > >
> > > > On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> > > >
> > > > +1 love to discuss this on a RFC proposal.
> > > >
> > > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > > wrote:
> > > >
> > > >> That's a very interesting idea.
> > > >>
> > > >> Do you want to take a stab at writing a full proposal (in the form
> of
> > > RFC)
> > > >> for it?
> > > >>
> > > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang  >
> > > >> wrote:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> Do we have plan to integrate data TTL into HUDI, so we don't have
> to
> > > >>> schedule a offline spark job to delete outdated data, just set a
> TTL
> > > >>> config, then writer or some offline service will delete old data as
> > > >>> expected.
> > > >>>
> > > >>
> > >
> > >
> >
> > --
> > *Jian Feng,冯健*
> > Shopee | Engineer | Data Infrastructure
> >
>
>
> --
> Best,
> Shiyan
>


Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Shiyan Xu
great proposal. Partition TTL is a good starting point. we can extend it to
other TTL strategies like column-based, and make it customizable and
pluggable. Looking forward to the RFC!

On Wed, Oct 19, 2022 at 11:40 AM Jian Feng 
wrote:

> Good idea,
> this is definitely worth an  RFC
> btw should it only depend on Hudi's partition? I feel it should be a more
> common feature since sometimes customers' data can not update across
> partitions
>
>
> On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:
>
> > Hi all, we have implemented a partition based data ttl management, which
> > we can manage ttl for hudi partition by size, expired time and
> > sub-partition count. When a partition is detected as outdated, we use
> > delete partition interface to delete it, which will generate a replace
> > commit to mark the data as deleted. The real deletion will then done by
> > clean service.
> >
> >
> > If community is interested in this idea, maybe we can propose a RFC to
> > discuss it in detail.
> >
> >
> > > On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> > >
> > > +1 love to discuss this on a RFC proposal.
> > >
> > > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> > wrote:
> > >
> > >> That's a very interesting idea.
> > >>
> > >> Do you want to take a stab at writing a full proposal (in the form of
> > RFC)
> > >> for it?
> > >>
> > >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang 
> > >> wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> Do we have plan to integrate data TTL into HUDI, so we don't have to
> > >>> schedule a offline spark job to delete outdated data, just set a TTL
> > >>> config, then writer or some offline service will delete old data as
> > >>> expected.
> > >>>
> > >>
> >
> >
>
> --
> *Jian Feng,冯健*
> Shopee | Engineer | Data Infrastructure
>


-- 
Best,
Shiyan


Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Jian Feng
Good idea,
this is definitely worth an  RFC
btw should it only depend on Hudi's partition? I feel it should be a more
common feature since sometimes customers' data can not update across
partitions


On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote:

> Hi all, we have implemented a partition based data ttl management, which
> we can manage ttl for hudi partition by size, expired time and
> sub-partition count. When a partition is detected as outdated, we use
> delete partition interface to delete it, which will generate a replace
> commit to mark the data as deleted. The real deletion will then done by
> clean service.
>
>
> If community is interested in this idea, maybe we can propose a RFC to
> discuss it in detail.
>
>
> > On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> >
> > +1 love to discuss this on a RFC proposal.
> >
> > On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> wrote:
> >
> >> That's a very interesting idea.
> >>
> >> Do you want to take a stab at writing a full proposal (in the form of
> RFC)
> >> for it?
> >>
> >> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang 
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Do we have plan to integrate data TTL into HUDI, so we don't have to
> >>> schedule a offline spark job to delete outdated data, just set a TTL
> >>> config, then writer or some offline service will delete old data as
> >>> expected.
> >>>
> >>
>
>

-- 
*Jian Feng,冯健*
Shopee | Engineer | Data Infrastructure


Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread stream2000
Hi all, we have implemented a partition based data ttl management, which we can 
manage ttl for hudi partition by size, expired time and sub-partition count. 
When a partition is detected as outdated, we use delete partition interface to 
delete it, which will generate a replace commit to mark the data as deleted. 
The real deletion will then done by clean service. 


If community is interested in this idea, maybe we can propose a RFC to discuss 
it in detail.


> On Oct 19, 2022, at 10:06, Vinoth Chandar  wrote:
> 
> +1 love to discuss this on a RFC proposal.
> 
> On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin  wrote:
> 
>> That's a very interesting idea.
>> 
>> Do you want to take a stab at writing a full proposal (in the form of RFC)
>> for it?
>> 
>> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang 
>> wrote:
>> 
>>> Hi all,
>>> 
>>> Do we have plan to integrate data TTL into HUDI, so we don't have to
>>> schedule a offline spark job to delete outdated data, just set a TTL
>>> config, then writer or some offline service will delete old data as
>>> expected.
>>> 
>> 



Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Vinoth Chandar
+1 love to discuss this on a RFC proposal.

On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin  wrote:

> That's a very interesting idea.
>
> Do you want to take a stab at writing a full proposal (in the form of RFC)
> for it?
>
> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang 
> wrote:
>
> > Hi all,
> >
> > Do we have plan to integrate data TTL into HUDI, so we don't have to
> > schedule a offline spark job to delete outdated data, just set a TTL
> > config, then writer or some offline service will delete old data as
> > expected.
> >
>


Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Alexey Kudinkin
That's a very interesting idea.

Do you want to take a stab at writing a full proposal (in the form of RFC)
for it?

On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang  wrote:

> Hi all,
>
> Do we have plan to integrate data TTL into HUDI, so we don't have to
> schedule a offline spark job to delete outdated data, just set a TTL
> config, then writer or some offline service will delete old data as
> expected.
>


[DISCUSS] Hudi data TTL

2022-10-18 Thread Bingeng Huang
Hi all,

Do we have plan to integrate data TTL into HUDI, so we don't have to
schedule a offline spark job to delete outdated data, just set a TTL
config, then writer or some offline service will delete old data as
expected.