codope commented on a change in pull request #4810:
URL: https://github.com/apache/hudi/pull/4810#discussion_r808806167
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java
##########
@@ -380,6 +380,19 @@ protected boolean
isBaseFileDueToPendingCompaction(HoodieBaseFile baseFile) {
&&
baseFile.getCommitTime().equals(compactionWithInstantTime.get().getKey());
}
+ /**
+ * With async clustering, it is possible to see partial/complete base-files
due to inflight-clustering, Ignore those
+ * base-files.
+ *
+ * @param baseFile base File
+ */
+ protected boolean isBaseFileDueToPendingClustering(HoodieBaseFile baseFile) {
+ List<String> pendingReplaceInstants =
+
metaClient.getActiveTimeline().filterPendingReplaceTimeline().getInstants().map(HoodieInstant::getTimestamp).collect(Collectors.toList());
Review comment:
Can we not reuse `isPendingClusteringScheduledForFileId()` or
`getPendingClusteringInstant()`? So, we maintain a map of
`fgIdToPendingClustering` which supports various methods. If we can reuse one
of them then we need to call active timeline.
##########
File path:
hudi-common/src/test/java/org/apache/hudi/common/table/view/TestHoodieTableFileSystemView.java
##########
@@ -1537,6 +1542,234 @@ public void testPendingClusteringOperations() throws
IOException {
assertFalse(fileIds.contains(fileId3));
}
+ /**
+ *
+ * create hoodie table like
+ * .
+ * ├── .hoodie
+ * │ ├── .aux
+ * │ │ └── .bootstrap
+ * │ │ ├── .fileids
+ * │ │ └── .partitions
+ * │ ├── .temp
+ * │ ├── 1.commit
+ * │ ├── 1.commit.requested
+ * │ ├── 1.inflight
+ * │ ├── 2.replacecommit
+ * │ ├── 2.replacecommit.inflight
+ * │ ├── 2.replacecommit.requested
+ * │ ├── 3.commit
+ * │ ├── 3.commit.requested
+ * │ ├── 3.inflight
+ * │ ├── archived
+ * │ └── hoodie.properties
+ * └── 2020
+ * └── 06
+ * └── 27
+ * ├── 5fe477d2-0150-46d4-833c-1e9cc8da9948_1-0-1_3.parquet
+ * ├── 7e3208c8-fdec-4254-9682-8fff1e51ee8d_1-0-1_2.parquet
+ * ├── e04b0e2d-1467-46b2-8ea6-f4fe950965a5_1-0-1_1.parquet
+ * └── f3936b66-b3db-4fc8-a6d0-b1a7559016e6_1-0-1_1.parquet
+ *
+ * First test fsView API with finished clustering:
+ * 1. getLatestBaseFilesBeforeOrOn
+ * 2. getBaseFileOn
+ * 3. getLatestBaseFilesInRange
+ * 4. getAllBaseFiles
+ * 5. getLatestBaseFiles
Review comment:
What about other base file related APIs like `fetchLatestBaseFiles`,
`fetchAllBaseFiles`? Are they all covered by this change?
PS: I think we should take a follow up task to make FSView APIs more
uniform.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]