xushiyan commented on code in PR #238:
URL: https://github.com/apache/hudi-rs/pull/238#discussion_r1921735319


##########
crates/core/src/table/fs_view.rs:
##########
@@ -73,34 +77,84 @@ impl FileSystemView {
             .collect();
         let mut partition_paths = Vec::new();
         for dir in top_level_dirs {
-            partition_paths.extend(get_leaf_dirs(storage, Some(&dir)).await?);
+            let leaf_paths = if partition_pruner.is_empty() {
+                // full dirs listing
+                get_leaf_dirs(storage, Some(&dir)).await?
+            } else {
+                // leveled pruning
+                Self::list_partition_paths_with_leveled_pruning(storage, dir, 
partition_pruner, 0)
+                    .await?
+            };
+            partition_paths.extend(leaf_paths);
         }
         if partition_paths.is_empty() {
+            // TODO: reconsider is it reasonable to add empty partition path? 
For partitioned table, we should return empty vec rather than vec with empty 
string
             partition_paths.push("".to_string())
         }
-        if partition_pruner.is_empty() {
-            return Ok(partition_paths);
-        }
 
-        Ok(partition_paths
-            .into_iter()
-            .filter(|path_str| partition_pruner.should_include(path_str))
-            .collect())
+        Ok(partition_paths.into_iter().collect())
+    }
+
+    #[async_recursion]
+    async fn list_partition_paths_with_leveled_pruning(
+        storage: &Storage,
+        path: String,
+        partition_pruner: &PartitionPruner,
+        current_level: usize,
+    ) -> Result<Vec<String>> {

Review Comment:
   can this be merged with the `get_leaf_dirs()`? so that we don't need to go 2 
paths that one does multi-level pruning and the other does not. We can by 
default use multi-level pruning and early return when possible.
   
   Check out this line 
https://github.com/apache/hudi-rs/pull/251/files#diff-a6c5a16b1d83ac9b5d186863a123fd1b273daf8710c12444c9f1b6f6fec96914R140
   
   we probably need to move `get_leaf_dirs()` into `FileLister` such that it's 
the lister's responsibility to use pruner to check partition segments.
   
   PartitionPruner owns the partition schema, so we should make use of it to 
understand which level is currently parsed and how many levels there should be. 
   
   To make code logic easy to understand, maybe iterative is better than 
recursive here - just loop through all the top level dirs and go down according 
to the schema.



##########
crates/core/src/table/fs_view.rs:
##########
@@ -73,34 +77,84 @@ impl FileSystemView {
             .collect();
         let mut partition_paths = Vec::new();
         for dir in top_level_dirs {
-            partition_paths.extend(get_leaf_dirs(storage, Some(&dir)).await?);
+            let leaf_paths = if partition_pruner.is_empty() {
+                // full dirs listing
+                get_leaf_dirs(storage, Some(&dir)).await?
+            } else {
+                // leveled pruning
+                Self::list_partition_paths_with_leveled_pruning(storage, dir, 
partition_pruner, 0)
+                    .await?
+            };
+            partition_paths.extend(leaf_paths);
         }
         if partition_paths.is_empty() {
+            // TODO: reconsider is it reasonable to add empty partition path? 
For partitioned table, we should return empty vec rather than vec with empty 
string
             partition_paths.push("".to_string())
         }

Review Comment:
   thanks for pointing this out. This should be improved. I've made changes for 
this in https://github.com/apache/hudi-rs/pull/251 . In short, we should return 
`""` for non-partitioned tables, this aligns with the convention we're using in 
timeline commit metadata where `""` is also used as a key for the partition 
write stats. We should return empty list like you said here for partitioned 
table if there is not yet any partition written.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to