[
https://issues.apache.org/jira/browse/HUDI-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718336#comment-17718336
]
sivabalan narayanan commented on HUDI-6153:
-------------------------------------------
thanks for writing a detailed one.
wrt restore:
case where cleaner is aggressive:
* Generally we can only restore to a commit which was never cleaned up. So,
what are the chances that Clean5 and clean8 in above example actually cleaned
up something prior to C3 (bcoz, all files pertaining to C3 should be intact)
which needs to be fixed? in other words, wouldn't clean5 and clean8 very likely
clean up files whose commit times > C3 and so we don't even need to worry about
re-syncing ?
* generally we will see more clean commits retained in the active timeline
compared to regular commits. So, I suspect cleans are archived more
aggressively compared to regular commits.
* wrt archival being more aggressive, out of the box archival stops at the
first savepointed commit. So, we don't need to solve this case.
but I in line w/ your proposal to do FS based listing and MD based listing and
apply any mis-matches. We need this even otherwise. As of now, in OSS if some
table enters some inconsistent state, we don't have any easy way to recover
other than suggesting users to completely delete MST and re-bootstrap. So, we
should expose this in cli as well.
Let me know wdyt. we can sync up f2f to discuss more.
> Change the rollback mechanism for MDT to actual rollbacks rather than
> appending revert blocks
> ---------------------------------------------------------------------------------------------
>
> Key: HUDI-6153
> URL: https://issues.apache.org/jira/browse/HUDI-6153
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Prashant Wason
> Assignee: Prashant Wason
> Priority: Major
>
> When rolling back completed commits for indexes like record-index, the list
> of all keys removed from the dataset is required. This information cannot be
> available during rollback processing in MDT since the files have already been
> deleted during the rollback inflight processing.
> Hence, the current MDT rollback mechanism of adding -files, -col_stats
> entries does not work for record index.
> This PR changes the rollback mechanism to actually rollback deltacommits on
> the MDT. This makes the rollback handing faster and keeps the MDT in sync
> with dataset.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)