yihua commented on issue #6686:
URL: https://github.com/apache/hudi/issues/6686#issuecomment-1254272868
@asankadarshana007 The consistency check, when enabled, happens when
removing invalid data files: (1) check that all paths to delete exist, (2)
delete them, (3) wait for all paths to disappear after eventual consistency.
Note that this logic is not needed for strong consistency. As the invalid data
files are now determined based on the markers, there could be a case where a
marker is created, but the data file has not started being written, so that the
check (1) fails, which is okay. Given that there is no use case for the
eventual consistency atm, we don't maintain the logic.
Let me know if turning off `hoodie.consistency.check.enabled` solves your
problem. You can close the ticket if all good.
```
if (!invalidDataPaths.isEmpty()) {
LOG.info("Removing duplicate data files created due to task retries
before committing. Paths=" + invalidDataPaths);
Map<String, List<Pair<String, String>>> invalidPathsByPartition =
invalidDataPaths.stream()
.map(dp -> Pair.of(new Path(basePath,
dp).getParent().toString(), new Path(basePath, dp).toString()))
.collect(Collectors.groupingBy(Pair::getKey));
// Ensure all files in delete list is actually present. This is
mandatory for an eventually consistent FS.
// Otherwise, we may miss deleting such files. If files are not
found even after retries, fail the commit
if (consistencyCheckEnabled) {
// This will either ensure all files to be deleted are present.
waitForAllFiles(context, invalidPathsByPartition,
FileVisibility.APPEAR);
}
// Now delete partially written files
context.setJobStatus(this.getClass().getSimpleName(), "Delete all
partially written files: " + config.getTableName());
deleteInvalidFilesByPartitions(context, invalidPathsByPartition);
// Now ensure the deleted files disappear
if (consistencyCheckEnabled) {
// This will either ensure all files to be deleted are absent.
waitForAllFiles(context, invalidPathsByPartition,
FileVisibility.DISAPPEAR);
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]