[
https://issues.apache.org/jira/browse/HUDI-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260142#comment-17260142
]
Prashant Wason commented on HUDI-1459:
--------------------------------------
The replace functionality is as follows: # When
HoodieWriteClient::insertOverwrite() is called, the existing files in
partitions will not be affected and all new inserted data will be written to
new files instead
# HoodieWriteClient::insertOverwrite() creates a .replaceCommit which is
another type of commit
# The “replaced” files are “ignored” via the code in FileSystemView handlers
even though the files are still present on the partitions
# During archival process, the replaced files older than the archival instant
time are deleted through code in ReplaceArchivalHelper.
So till the time the archival happens, the replace files still exist on disk
and restore API can restore them.
Form RFC-15 perspective, we want to ensure parity between files on disk and
list within the metadata table. *So we need to keep the "replaced" files within
the metadata until there are actually deleted during the archival process.*
I am listing various options of doing this below (differences are highlighted
in green):
*Option 1:*
1. When processing a .replaceCommit, add the new files created to metadata
(like .commit would have been handled)
{color:#00875a}2. During archival, require the metadata table to have been
synced (similar to how we may require it for restore){color}
{color:#00875a}3. Perform a commit to metadata table with deleted files before
completing the archival{color}
{color:#0747a6}PRO: none{color}
{color:#0747a6}CON: Require sync of metadata table{color}
{color:#0747a6}CON: Difficult to handle failed archival (partial files deleted,
etc){color}
*Option 2:*
1. When processing a .replaceCommit, add the new files created to metadata
(like .commit would have been handled)
{color:#00875a}2. During archival, save the deleted files in a new
HoodieCleanMetadata as a .clean instant (as files were deleted){color}
{color:#00875a}3. Extend HoodieCleanMetadata to add a flag dueToReplaceArchival
(or some better name) so we can differentiate that this was not a real
clean{color}
{color:#00875a}{color:#00875a}4.{color} Retain the
cleanMetadata.getEarliestCommitToRetain() from the last actual clean to support
incremental cleaning.{color}
{color:#0747a6}PRO: in-tune with how metadata table sync works {color}
{color:#0747a6}CON: Overloading of .commit instant {color}
*Option 3:*
{color:#00875a}1. When processing a .replaceCommit, add the new files created
to metadata (like .commit would have been handled). In addition, add the
"replaced" fileIds with a "replace" flag (we currently have a single delete
flag) and the "timeWhenReplaced".{color}
{color:#00875a}2. When merging instants (in-memory or when updating the
metadata table), read the ArchivedTimeline to get the last archived instant. If
the timeWhenReplaced is < lastArchivedInstat then consider these files
deleted.{color}
{color:#0747a6}PRO: All complexity within the merge process of metadata
code{color}
{color:#0747a6}CON: Need to save additional data (minor as the size of the
table wont increase much as these are simply flags and ONLY for replace){color}
{color:#0747a6}CON: Only the replaced fileIds are saved within the
HoodieCommitMetadata. We will need to convert them to the actual file paths to
save within the metadata or during the in-memory merge. *This cannot be done by
listing within the metadata table as there is a circular-dependency - we cannot
list from metadata table until in-memory merge is complete .*{color}
I prefer option 3.
> Support for handling of REPLACE instants
> ----------------------------------------
>
> Key: HUDI-1459
> URL: https://issues.apache.org/jira/browse/HUDI-1459
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: Writer Core
> Reporter: Vinoth Chandar
> Assignee: Prashant Wason
> Priority: Blocker
> Fix For: 0.7.0
>
>
> Once we rebase to master, we need to handle replace instants as well, as they
> show up on the timeline.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)