stream2000 commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1211260244


##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period of time. The outdated data is useless 
and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the 
dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can 
config the policies by table config directly or by call commands. With proper 
configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already 
provides a delete_partition interface to delete outdated partitions. However, 
users still need to detect which partitions are outdated and call 
`delete_partition` manually, which means that users need to define and 
implement some kind of TTL policies and maintain proper statistics to find 
expired partitions by themself. As the scale of installations grew,  it's more 
important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by 
sub-partitions count and sub-partitions size. So we need to support the 
following three different TTL policy types.
+    1. **KEEP_BY_TIME**. Partitions will expire N days after their last 
modified time.

Review Comment:
   Maybe we can add the stash/restore mechanism to replace commit/clean process 
of hudi instead of dealing with it in TTL management? TTL management should 
only decide which partitions are outdated and call `delete_partition` to delete 
them. If we want to retain the deleted data we can add extra mechanism in the 
`delete_parrtition` method. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to