[
https://issues.apache.org/jira/browse/HUDI-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-1276:
---------------------------------
Status: In Progress (was: Open)
> delete replaced file groups during clean
> ----------------------------------------
>
> Key: HUDI-1276
> URL: https://issues.apache.org/jira/browse/HUDI-1276
> Project: Apache Hudi
> Issue Type: Sub-task
> Reporter: satish
> Assignee: Vinoth Chandar
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 0.7.0
>
>
> We clean replaced file groups during archival as part of PR#2048. But we may
> want do this during clean stage to prevent storage overhead.
> Outstanding questions:
> 1) With KEEP_LATEST_VERSIONS, when is a replaced file eligible to clean?
> Assume file slice has f1_c1, f1_c2. After that 'f1' is replaced by some other
> file groups. If KEEP_LATEST_VERSIONS=2 When can we delete f1_c1, f1_c2?
> Options:
> * We can introduce new policy to delete replaced files. For example, we could
> fallback to KEEP_LATEST_COMMITS for replaced files
> * Build 'slice' across file groups. If we know the new files that are
> replacing 'f1', then we can treat as single slice and delete oldest versions.
> This can get really complicated because f1 can be replaced by multiple file
> groups which can then be replaced by some other file groups
> 2)If there is a savepoint on the fileId that is eligible to clean, can we
> delete it?
> Options:
> * Do not delete the file. Clean and archival cannot make progress. We need a
> mechanism to notify that clean and archival are blocked.
> * Ignore savepoints and delete the file. This is breaking contract. (This is
> current behavior with deleting files during archival)
> 3)If there is a pending/inflight compaction on the fileId that is eligible to
> clean, can we delete it? What happens to compaction scheduled if we delete it?
> * This is unlikely to happen because we dont replace files that have pending
> compaction. Also, after a file is replaced, it is not visible to compaction,
> so any further compaction cannot be scheduled. However, if for any reason,
> we see replaced files that have pending compaction, and are eligible to
> clean, its probably better to block clean and archival
--
This message was sent by Atlassian Jira
(v8.3.4#803005)