vinishjail97 commented on code in PR #814:
URL: https://github.com/apache/incubator-xtable/pull/814#discussion_r2897259556


##########
xtable-core/src/main/java/org/apache/xtable/hudi/HudiPathUtils.java:
##########
@@ -28,4 +31,18 @@ public static String getPartitionPath(Path tableBasePath, 
Path filePath) {
     int endIndex = pathStr.length() - fileName.length() - 1;
     return endIndex <= startIndex ? "" : pathStr.substring(startIndex, 
endIndex);
   }
+
+  /** Filters out metadata/hidden directory paths like _delta_log and .hoodie. 
*/
+  public static List<String> filterMetadataPaths(List<String> partitionPaths) {
+    return partitionPaths.stream()
+        .filter(
+            p -> {
+              if (p.isEmpty()) {
+                return true;
+              }
+              String name = new Path(p).getName();
+              return !name.startsWith("_") && !name.startsWith(".");

Review Comment:
   The `startsWith("_")` filter is too broad. It would silently drop legitimate 
user partitions where the partition column *name* starts with `_`, e.g., 
`_status=active` or `_year=2024` — `getName()` returns the last path component, 
so a partition path like `region=US/_status=active` would be incorrectly 
filtered.
   
   Consider filtering only known metadata directory names (`.hoodie`, 
`_delta_log`) instead of any path starting with these characters? 



##########
xtable-core/src/test/java/org/apache/xtable/hudi/catalog/TestHudiCatalogPartitionSyncTool.java:
##########
@@ -120,6 +120,21 @@ private void setupCommonMocks() {
     mockHudiCatalogPartitionSyncTool = createMockHudiPartitionSyncTool();
   }
 
+  @Test
+  void testGetAllPartitionPathsOnStorageFiltersMetadataDirs() {
+    setupCommonMocks();
+    try (MockedStatic<FSUtils> mockFSUtils = mockStatic(FSUtils.class)) {
+      mockFSUtils
+          .when(() -> FSUtils.getAllPartitionPaths(any(), eq(TEST_BASE_PATH), 
eq(true), eq(false)))
+          .thenReturn(Arrays.asList("key1", "_delta_log", ".hoodie", "key2"));
+
+      List<String> result =
+          
mockHudiCatalogPartitionSyncTool.getAllPartitionPathsOnStorage(TEST_BASE_PATH);
+
+      assertEquals(Arrays.asList("key1", "key2"), result);

Review Comment:
   `filterMetadataPaths` is called in 3 places (`BaseFileUpdatesExtractor`, 
`HudiDataFileExtractor`, `HudiCatalogPartitionSyncTool`) but is only tested 
indirectly through one of them. There is no `TestHudiPathUtils` unit test for 
the method itself.
   
   Please add a dedicated unit test covering:
   1. Nested path with a metadata dir as the last segment (e.g., 
`year=2024/_delta_log` → filtered)
   2. Empty string → kept
   3. A partition whose column name starts with `_` (e.g., `_status=active`) to 
explicitly document whether this is expected to be filtered or not



##########
xtable-core/src/main/java/org/apache/xtable/hudi/HudiPathUtils.java:
##########
@@ -28,4 +31,18 @@ public static String getPartitionPath(Path tableBasePath, 
Path filePath) {
     int endIndex = pathStr.length() - fileName.length() - 1;
     return endIndex <= startIndex ? "" : pathStr.substring(startIndex, 
endIndex);
   }
+
+  /** Filters out metadata/hidden directory paths like _delta_log and .hoodie. 
*/
+  public static List<String> filterMetadataPaths(List<String> partitionPaths) {
+    return partitionPaths.stream()
+        .filter(
+            p -> {
+              if (p.isEmpty()) {
+                return true;
+              }
+              String name = new Path(p).getName();

Review Comment:
   Nit: `new Path(p).getName()` allocates a Hadoop `Path` object just to 
extract the last segment of a relative partition path string. Consider 
`p.substring(p.lastIndexOf('/') + 1)` to avoid the unnecessary allocation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to