xushiyan commented on code in PR #238:
URL: https://github.com/apache/hudi-rs/pull/238#discussion_r1921735319
##########
crates/core/src/table/fs_view.rs:
##########
@@ -73,34 +77,84 @@ impl FileSystemView {
.collect();
let mut partition_paths = Vec::new();
for dir in top_level_dirs {
- partition_paths.extend(get_leaf_dirs(storage, Some(&dir)).await?);
+ let leaf_paths = if partition_pruner.is_empty() {
+ // full dirs listing
+ get_leaf_dirs(storage, Some(&dir)).await?
+ } else {
+ // leveled pruning
+ Self::list_partition_paths_with_leveled_pruning(storage, dir,
partition_pruner, 0)
+ .await?
+ };
+ partition_paths.extend(leaf_paths);
}
if partition_paths.is_empty() {
+ // TODO: reconsider is it reasonable to add empty partition path?
For partitioned table, we should return empty vec rather than vec with empty
string
partition_paths.push("".to_string())
}
- if partition_pruner.is_empty() {
- return Ok(partition_paths);
- }
- Ok(partition_paths
- .into_iter()
- .filter(|path_str| partition_pruner.should_include(path_str))
- .collect())
+ Ok(partition_paths.into_iter().collect())
+ }
+
+ #[async_recursion]
+ async fn list_partition_paths_with_leveled_pruning(
+ storage: &Storage,
+ path: String,
+ partition_pruner: &PartitionPruner,
+ current_level: usize,
+ ) -> Result<Vec<String>> {
Review Comment:
can this be merged with the `get_leaf_dirs()`? so that we don't need to go 2
paths that one does multi-level pruning and the other does not. We can by
default use multi-level pruning and early return when possible.
Check out this line
https://github.com/apache/hudi-rs/pull/251/files#diff-a6c5a16b1d83ac9b5d186863a123fd1b273daf8710c12444c9f1b6f6fec96914R140
we probably need to move `get_leaf_dirs()` into `FileLister` such that it's
the lister's responsibility to use pruner to check partition segments.
PartitionPruner owns the partition schema, so we should make use of it to
understand which level is currently parsed and how many levels there should be.
To make code logic easy to understand, maybe iterative is better than
recursive here - just loop through all the top level dirs and go down according
to the schema.
##########
crates/core/src/table/fs_view.rs:
##########
@@ -73,34 +77,84 @@ impl FileSystemView {
.collect();
let mut partition_paths = Vec::new();
for dir in top_level_dirs {
- partition_paths.extend(get_leaf_dirs(storage, Some(&dir)).await?);
+ let leaf_paths = if partition_pruner.is_empty() {
+ // full dirs listing
+ get_leaf_dirs(storage, Some(&dir)).await?
+ } else {
+ // leveled pruning
+ Self::list_partition_paths_with_leveled_pruning(storage, dir,
partition_pruner, 0)
+ .await?
+ };
+ partition_paths.extend(leaf_paths);
}
if partition_paths.is_empty() {
+ // TODO: reconsider is it reasonable to add empty partition path?
For partitioned table, we should return empty vec rather than vec with empty
string
partition_paths.push("".to_string())
}
Review Comment:
thanks for pointing this out. This should be improved. I've made changes for
this in https://github.com/apache/hudi-rs/pull/251 . In short, we should return
`""` for non-partitioned tables, this aligns with the convention we're using in
timeline commit metadata where `""` is also used as a key for the partition
write stats. We should return empty list like you said here for partitioned
table if there is not yet any partition written.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]