[
https://issues.apache.org/jira/browse/HUDI-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397538#comment-17397538
]
ASF GitHub Bot commented on HUDI-2119:
--------------------------------------
prashantwason commented on a change in pull request #3210:
URL: https://github.com/apache/hudi/pull/3210#discussion_r687072435
##########
File path:
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java
##########
@@ -480,6 +482,131 @@ public void testRollbackUnsyncedCommit(HoodieTableType
tableType) throws Excepti
client.syncTableMetadata();
validateMetadata(client);
}
+
+ // If an unsynced commit is automatically rolled back during next commit,
the rollback commit gets a timestamp
+ // greater than than the new commit which is started. Ensure that in this
case the rollback is not processed
+ // as the earlier failed commit would not have been committed.
+ //
+ // Dataset: C1 C2 C3.inflight[failed] C4 R5[rolls
back C3]
+ // Metadata: C1.delta C2.delta
+ //
+ // When R5 completes, C3.xxx will be deleted. When C4 completes, C4 and R5
will be committed to Metadata Table in
Review comment:
Having a config is acceptable as there can be various use cases for HUDI.
We saw a production issue where the metadata table did not have the latest
files. If there was no exception:
1. Only an error would have been logged
2. Metadata table would have returned an older file listing
3. Data would have got written to older version of files
The above is a classic data lass scenario which would be very difficult to
catch once the logs have rolled over. Hence, I want to err on the side of
data-consistency.
To contrast it to another component in HUDI wherein we do something similar
- If the RemoteFileSystemView is unable to get the listing from TimelineServer,
it logs an error and then falls back to file listing. There are no data
consistency issues in this fallback so this makes sense to prevent user
interventions. But wrong results from metadata table can lead to data
consistency issues.
BTW, there may be a better way to sync the rollback. Also much of this
complexity will go away with the synchronous design.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Syncing of rollbacks to metadata table does not work in all cases
> -----------------------------------------------------------------
>
> Key: HUDI-2119
> URL: https://issues.apache.org/jira/browse/HUDI-2119
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Prashant Wason
> Assignee: Prashant Wason
> Priority: Blocker
> Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> This is an issue with inline automatic rollbacks.
> Metadata table assumes that a rollbacks is to be applied if the
> instant-being-rolled back has a timestamp less than the last deltacommit time
> on the metadata timeline. We do not explicitly check if the
> instant-being-rolled-back was actually written to metadata table.
> **A rollback adds a record to metadata table which "deletes" files from a
> failed/earlier commit. If the files being deleted were never actually
> committed to metadata table earlier, the deletes cannot be consolidated
> during metadata table reads. This leads to a HoodieMetadataException as we
> cannot differentiate this from a bug where we might have missed committing a
> commit to metadata table.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)