yihua commented on code in PR #8782:
URL: https://github.com/apache/hudi/pull/8782#discussion_r1207126726
##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/HoodieTableFileSystemView.java:
##########
@@ -199,7 +201,7 @@ protected boolean
isPendingCompactionScheduledForFileId(HoodieFileGroupId fgId)
protected void resetPendingCompactionOperations(Stream<Pair<String,
CompactionOperation>> operations) {
// Build fileId to Pending Compaction Instants
this.fgIdToPendingCompaction =
createFileIdToPendingCompactionMap(operations.map(entry ->
- Pair.of(entry.getValue().getFileGroupId(), Pair.of(entry.getKey(),
entry.getValue()))).collect(Collectors.toMap(Pair::getKey, Pair::getValue)));
+ Pair.of(entry.getValue().getFileGroupId(), Pair.of(entry.getKey(),
entry.getValue()))).collect(Collectors.toConcurrentMap(Pair::getKey,
Pair::getValue)));
}
Review Comment:
Synced with @danny0405 offline and resolved the confusion.
`AbstractTableFileSystemView::sync()` syncs the file system view from storage
to memory by loading the latest active timeline. The file system view
consistency wrt the latest instant is guarded by the write lock in the
reentrant read-write lock in the file system view. However, the `sync()` does
not load file slices or bootstrap base file mapping
(`fgIdToBootstrapBaseFile`). Subsequent partition view calls load file slices
against the latest timeline. Note that, the subsequent partition view calls
can be concurrent from multiple Spark executors, which triggers the concurrent
execution of
`AbstractTableFileSystemView::ensurePartitionLoadedCorrectly(String
partition)`. Inside `ensurePartitionLoadedCorrectly`,
`addFilesToView(statuses)` is called to update the bootstrap base file mapping
(`fgIdToBootstrapBaseFile`). Since `fgIdToBootstrapBaseFile` is not a
concurrent hash map before, such concurrency leads to updating
`fgIdToBootstrapBaseFile` concurrently. Concurrently updating an ordinary
hash map with distinct map keys can still lead to corrupt data in the map.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]