stream2000 commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1211346749


##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,163 @@
+## Proposers
+
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+
+## Approvers
+
+## Status
+
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+
+## Abstract
+
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period
+of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) 
management mechanism to prevent the
+dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can 
config the policies by table config
+directly or by call commands. With proper configs set, Hudi can find out which 
partitions are outdated and delete them.
+
+## Background
+
+TTL management mechanism is an important feature for databases. Hudi already 
provides a delete_partition interface to
+delete outdated partitions. However, users still need to detect which 
partitions are outdated and
+call `delete_partition` manually, which means that users need to define and 
implement some kind of TTL policies and
+maintain proper statistics to find expired partitions by themselves. As the 
scale of installations grew, it's more
+important to implement a user-friendly TTL management mechanism for hudi.
+
+## Implementation
+
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Gathering proper partition statistics for TTL management
+- Executing TTL management
+
+### TTL Policy Definition
+
+We have four main considerations when designing TTL policy:
+
+1. Usually user hopes to manage partition TTL by expired time, so we support 
the policy `KEEP_BY_TIME` that partitions
+   will expire N days after their last modified time. Note that the last 
modified time here means any inserts/updates to
+   the partition. Also, we will make it easy to extend new policies so that 
users can implement their own polices to
+   decide which partitions are expired.
+2. User need to set different policies for different partitions. For example, 
the hudi table is partitioned
+   by `user_id`. For partition(user_id='1'), we set the policy that this 
partition will expire 7 days after its last
+   modified time, while for
+   partition(user_id='2') we can set the policy that it will expire 30 days 
after its last modified time.
+3. It's possible that users have multi-fields partitioning, and they don't 
want set policy for all partition. So we need
+   to support partition regex when defining
+   partition TTL policies. For example, for a hudi table partitioned by 
(user_id, ts), we can first set a default
+   policy(which policy value contains `*`)
+   that for all partitions matched (user_id=*/) will expire in 7 days, and 
explicitly set a policy that for partition
+   matched user_id=3 will expire in 30 days. Explicit policy will override 
default polices, which means default policies
+   only takes effects on partitions that do not match any explicit ttl 
policies.
+4. We may need to support record level ttl policy in the future. So partition 
TTL policy should be an implementation of
+   HoodieTTLPolicy.
+
+So we have the TTL policy definition like this (may change when implementing):
+
+```java
+public class HoodiePartitionTTLPolicy implements HoodieTTLPolicy {
+  public enum TTLPolicy {
+    // We supports keep by last modified time at the first version
+    KEEP_BY_TIME
+  }
+
+  // Partition spec for which the policy takes effect, could be a regex or a 
static partition value
+  private String partitionSpec;

Review Comment:
   Yes, it supports regex expressions for partitions and static partition value



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to