[GitHub] [hudi] nbalajee commented on a diff in pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management

via GitHub Thu, 18 May 2023 12:08:07 -0700


nbalajee commented on code in PR #8062:
URL: https://github.com/apache/hudi/pull/8062#discussion_r1198202805



##########
rfc/rfc-65/rfc-65.md:
##########
@@ -0,0 +1,110 @@
+## Proposers
+- @stream2000
+- @hujincalrin
+- @huberylee
+- @YuweiXiao
+## Approvers
+## Status
+JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823)
+## Abstract
+In some classic hudi use cases, users partition hudi data by time and are only 
interested in data from a recent period of time. The outdated data is useless 
and costly,  we need a TTL(Time-To-Live) management mechanism to prevent the 
dataset from growing infinitely.
+This proposal introduces Partition TTL Management policies to hudi, people can 
config the policies by table config directly or by call commands. With proper 
configs set, Hudi can find out which partitions are outdated and delete them.
+## Background
+TTL management mechanism is an important feature for databases. Hudi already 
provides a delete_partition interface to delete outdated partitions. However, 
users still need to detect which partitions are outdated and call 
`delete_partition` manually, which means that users need to define and 
implement some kind of TTL policies and maintain proper statistics to find 
expired partitions by themself. As the scale of installations grew,  it's more 
important to implement a user-friendly TTL management mechanism for hudi.
+## Implementation
+There are 3 components to implement Partition TTL Management
+
+- TTL policy definition & storage
+- Partition statistics for TTL management
+- Appling policies
+### TTL Policy Definition
+We have three main considerations when designing TTL policy:
+
+1. User hopes to manage partition TTL not only by  expired time but also by 
sub-partitions count and sub-partitions size. So we need to support the 
following three different TTL policy types.
+    1. **KEEP_BY_TIME**. Partitions will expire N days after their last 
modified time.

Review Comment:
   When retiring the old/unused/not-accessed partitions, another approach we 
are taking internally is:
   (a) stash the partitions to be cleaned up in .stashedForDeletion folder (at 
.hoodie level).
   (b) partitions stashed for deletion will wait in the folder for a week (or 
time dictated by the policy) before actually getting deleted.  In cases, where 
we realize that something has been accidentally deleted (like a bad policy 
configuration,  TTL exclusion not configured etc), we can always move back from 
the stash to quickly recover from the TTL event.
   (c) We shall configure policies for .stashedForDeletion/<partition>/ 
subfolders to manage for appropriate tiering level (whether to be moved to a 
warm/cold tier etc)
   (d) in addition to the deletePartitions() API, which would stash the folder 
(instead of deleting) based on the configs, we would need a restore API to move 
the subfolder/files back to their original location. 
   (e) Metadata left by the delete operation to be synced with MDT to keep the 
file listing metadata in sync with the file system.  (In cases where 
replication to a different region is supported, this also would warrant 
applying the changes on the replicated copies of data).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nbalajee commented on a diff in pull request #8062: [HUDI-5823][RFC-65] RFC for Partition TTL Management

Reply via email to