[ 
https://issues.apache.org/jira/browse/HUDI-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-1276:
-------------------------
    Priority: Blocker  (was: Major)

> delete replaced file groups during clean
> ----------------------------------------
>
>                 Key: HUDI-1276
>                 URL: https://issues.apache.org/jira/browse/HUDI-1276
>             Project: Apache Hudi
>          Issue Type: Sub-task
>            Reporter: satish
>            Assignee: satish
>            Priority: Blocker
>             Fix For: 0.7.0
>
>
> We clean replaced file groups during archival as part of PR#2048. But we may 
> want do this during clean stage to prevent storage overhead.
> Outstanding questions:
> 1) With KEEP_LATEST_VERSIONS, when is a replaced file eligible to clean? 
> Assume file slice has f1_c1, f1_c2. After that 'f1' is replaced by some other 
> file groups.   If KEEP_LATEST_VERSIONS=2 When can we delete f1_c1, f1_c2?
> Options:
> * We can introduce new policy to delete replaced files. For example, we could 
> fallback to KEEP_LATEST_COMMITS for replaced files
> * Build 'slice' across file groups. If we know the new files that are 
> replacing 'f1', then we can treat as single slice and delete oldest versions. 
> This can get really complicated because f1 can be replaced by multiple file 
> groups which can then be replaced by some other file groups
> 2)If there is a savepoint on the fileId that is eligible to clean, can we 
> delete it?
> Options: 
> * Do not delete the file. Clean and archival cannot make progress. We need a 
> mechanism to notify that clean and archival are blocked.
> * Ignore savepoints and delete the file. This is breaking contract. (This is 
> current behavior with deleting files during archival)
> 3)If there is a pending/inflight compaction on the fileId that is eligible to 
> clean, can we delete it? What happens to compaction scheduled if we delete it?
> * This is unlikely to happen because we dont replace files that have pending 
> compaction. Also, after a file is replaced, it is not visible to compaction, 
> so any further compaction cannot be scheduled.  However, if for any reason, 
> we see replaced files that have pending compaction, and are eligible to 
> clean, its probably better to block clean and archival



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to