[ 
https://issues.apache.org/jira/browse/HUDI-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260142#comment-17260142
 ] 

Prashant Wason commented on HUDI-1459:
--------------------------------------

The replace functionality is as follows: # When 
HoodieWriteClient::insertOverwrite() is called, the existing files in 
partitions will not be affected and all new inserted data will be written to 
new files instead
 # HoodieWriteClient::insertOverwrite() creates a .replaceCommit which is 
another type of commit
 # The “replaced” files are “ignored” via the code in FileSystemView handlers 
even though the files are still present on the partitions
 # During archival process, the replaced files older than the archival instant 
time are deleted through code in ReplaceArchivalHelper.

 
So till the time the archival happens, the replace files still exist on disk 
and restore API can restore them.
 
Form RFC-15 perspective, we want to ensure parity between files on disk and 
list within the metadata table. *So we need to keep the "replaced" files within 
the metadata until there are actually deleted during the archival process.* 
 
I am listing various options of doing this below (differences are highlighted 
in green):
 
*Option 1:* 
1. When processing a .replaceCommit, add the new files created to metadata 
(like .commit would have been handled)
{color:#00875a}2. During archival, require the metadata table to have been 
synced (similar to how we may require it for restore){color}
{color:#00875a}3. Perform a commit to metadata table with deleted files before 
completing the archival{color}
 
{color:#0747a6}PRO: none{color}
{color:#0747a6}CON: Require sync of metadata table{color}
{color:#0747a6}CON: Difficult to handle failed archival (partial files deleted, 
etc){color}
 
*Option 2:* 
1. When processing a .replaceCommit, add the new files created to metadata 
(like .commit would have been handled)
{color:#00875a}2. During archival, save the deleted files in a new 
HoodieCleanMetadata as a .clean instant (as files were deleted){color}
{color:#00875a}3. Extend HoodieCleanMetadata to add a flag dueToReplaceArchival 
(or some better name) so we can differentiate that this was not a real 
clean{color}
{color:#00875a}{color:#00875a}4.{color} Retain the 
cleanMetadata.getEarliestCommitToRetain() from the last actual clean to support 
incremental cleaning.{color}
{color:#0747a6}PRO: in-tune with how metadata table sync works {color}
{color:#0747a6}CON: Overloading of .commit instant {color}
 
*Option 3:*
{color:#00875a}1. When processing a .replaceCommit, add the new files created 
to metadata (like .commit would have been handled). In addition, add  the 
"replaced" fileIds with a "replace" flag (we currently have a single delete 
flag) and the "timeWhenReplaced".{color}
{color:#00875a}2. When merging instants (in-memory or when updating the 
metadata table), read the ArchivedTimeline to get the last archived instant. If 
the timeWhenReplaced is < lastArchivedInstat then consider these files 
deleted.{color}
{color:#0747a6}PRO: All complexity within the merge process of metadata 
code{color}
{color:#0747a6}CON: Need to save additional data (minor as the size of the 
table wont increase much as these are simply flags and ONLY for replace){color}
{color:#0747a6}CON: Only the replaced fileIds are saved within the 
HoodieCommitMetadata. We will need to convert them to the actual file paths to 
save within the metadata or during the in-memory merge. *This cannot be done by 
listing within the metadata table as there is a circular-dependency - we cannot 
list from metadata table until in-memory merge is complete .*{color}
 
 
 
I prefer option 3.
 

> Support for handling of REPLACE instants
> ----------------------------------------
>
>                 Key: HUDI-1459
>                 URL: https://issues.apache.org/jira/browse/HUDI-1459
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Writer Core
>            Reporter: Vinoth Chandar
>            Assignee: Prashant Wason
>            Priority: Blocker
>             Fix For: 0.7.0
>
>
> Once we rebase to master, we need to handle replace instants as well, as they 
> show up on the timeline. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to