[
https://issues.apache.org/jira/browse/HUDI-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261437#comment-17261437
]
Vinoth Chandar commented on HUDI-1276:
--------------------------------------
[~satishkotha] Dumping my thoughts here. I spent sometime thinking about this.
> If KEEP_LATEST_VERSIONS=2 When can we delete f1_c1, f1_c2?
yes, we should be able to delete. By definition, KEEP_LATEST_VERSIONS is
designed for scenarios where the query engine should expect that the version
it's querying on, could be cleaned. So, once there is a replace, the next clean
should be able to delete the whole thing.
> If there is a savepoint on the fileId that is eligible to clean, can we
> delete it?
No we should not. We should fix savepoint logic such that when it's deleted we
get an instant that gives us the files that are not "Freed up" for cleaning. I
ll see how/if we can do this easily
> If there is a pending/inflight compaction on the fileId that is eligible to
> clean, can we delete it? What happens to compaction scheduled if we delete it?
Same answer. Once replaced, we should be able to delete based on cleaning
policy.
For CLEAN_BY_COMMITS: We already determine each cleaner run, the range of
commit instants, that are eligible for cleaning. For `REPLACE COMMIT`, this
means that we are okay cleaning even the newly written file groups, thus it
should be very safe to assume that we can delete the "replaced" /old file
groups replaced on that replace commit. It honors the contract with the user..
Let me know if you see any issues with this
> delete replaced file groups during clean
> ----------------------------------------
>
> Key: HUDI-1276
> URL: https://issues.apache.org/jira/browse/HUDI-1276
> Project: Apache Hudi
> Issue Type: Sub-task
> Reporter: satish
> Assignee: satish
> Priority: Blocker
> Fix For: 0.7.0
>
>
> We clean replaced file groups during archival as part of PR#2048. But we may
> want do this during clean stage to prevent storage overhead.
> Outstanding questions:
> 1) With KEEP_LATEST_VERSIONS, when is a replaced file eligible to clean?
> Assume file slice has f1_c1, f1_c2. After that 'f1' is replaced by some other
> file groups. If KEEP_LATEST_VERSIONS=2 When can we delete f1_c1, f1_c2?
> Options:
> * We can introduce new policy to delete replaced files. For example, we could
> fallback to KEEP_LATEST_COMMITS for replaced files
> * Build 'slice' across file groups. If we know the new files that are
> replacing 'f1', then we can treat as single slice and delete oldest versions.
> This can get really complicated because f1 can be replaced by multiple file
> groups which can then be replaced by some other file groups
> 2)If there is a savepoint on the fileId that is eligible to clean, can we
> delete it?
> Options:
> * Do not delete the file. Clean and archival cannot make progress. We need a
> mechanism to notify that clean and archival are blocked.
> * Ignore savepoints and delete the file. This is breaking contract. (This is
> current behavior with deleting files during archival)
> 3)If there is a pending/inflight compaction on the fileId that is eligible to
> clean, can we delete it? What happens to compaction scheduled if we delete it?
> * This is unlikely to happen because we dont replace files that have pending
> compaction. Also, after a file is replaced, it is not visible to compaction,
> so any further compaction cannot be scheduled. However, if for any reason,
> we see replaced files that have pending compaction, and are eligible to
> clean, its probably better to block clean and archival
--
This message was sent by Atlassian Jira
(v8.3.4#803005)