Re: [PR] feat: Add substrait support for multiple files [datafusion]

via GitHub Tue, 24 Feb 2026 08:27:27 -0800


benbellick commented on code in PR #19729:
URL: https://github.com/apache/datafusion/pull/19729#discussion_r2848212825



##########
datafusion/substrait/src/logical_plan/consumer/rel/read_rel.rs:
##########
@@ -176,45 +176,100 @@ pub async fn from_read_rel(
             }))
         }
         Some(ReadType::LocalFiles(lf)) => {
-            fn extract_filename(name: &str) -> Option<String> {
-                let corrected_url =
-                    if name.starts_with("file://") && 
!name.starts_with("file:///") {
-                        name.replacen("file://", "file:///", 1)
-                    } else {
-                        name.to_string()
-                    };
-
-                Url::parse(&corrected_url).ok().and_then(|url| {
-                    let path = url.path();
-                    std::path::Path::new(path)
-                        .file_name()
-                        .map(|filename| filename.to_string_lossy().to_string())
-                })
+            /// Extracts the path string from a PathType variant.
+            /// Normalizes file:// URLs to file:/// format for proper URL 
parsing.
+            fn extract_path(path_type: Option<&PathType>) -> Option<String> {
+                let path_str = match path_type? {
+                    PathType::UriPath(p) => p.clone(),
+                    PathType::UriPathGlob(p) => p.clone(),
+                    PathType::UriFile(p) => p.clone(),
+                    PathType::UriFolder(p) => p.clone(),
+                };
+
+                // Normalize file:// to file:/// for proper URL parsing
+                let normalized = if path_str.starts_with("file://")
+                    && !path_str.starts_with("file:///")
+                {
+                    path_str.replacen("file://", "file:///", 1)
+                } else {
+                    path_str
+                };

Review Comment:
   Furthermore, shouldn't we pass around the `Url` instead of a string file 
path? We may need the scheme later. [Per the substrait 
spec](https://github.com/substrait-io/substrait/blob/7636cee7e5ebde47384e2da13c44f840b3fc288c/proto/substrait/algebra.proto#L176-L178),
 we may use the scheme to determine if we actually need to read the file from 
e.g. `s3`. 



##########
datafusion/substrait/src/logical_plan/consumer/rel/read_rel.rs:
##########
@@ -176,45 +176,100 @@ pub async fn from_read_rel(
             }))
         }
         Some(ReadType::LocalFiles(lf)) => {
-            fn extract_filename(name: &str) -> Option<String> {
-                let corrected_url =
-                    if name.starts_with("file://") && 
!name.starts_with("file:///") {
-                        name.replacen("file://", "file:///", 1)
-                    } else {
-                        name.to_string()
-                    };
-
-                Url::parse(&corrected_url).ok().and_then(|url| {
-                    let path = url.path();
-                    std::path::Path::new(path)
-                        .file_name()
-                        .map(|filename| filename.to_string_lossy().to_string())
-                })
+            /// Extracts the path string from a PathType variant.
+            /// Normalizes file:// URLs to file:/// format for proper URL 
parsing.
+            fn extract_path(path_type: Option<&PathType>) -> Option<String> {
+                let path_str = match path_type? {
+                    PathType::UriPath(p) => p.clone(),
+                    PathType::UriPathGlob(p) => p.clone(),
+                    PathType::UriFile(p) => p.clone(),
+                    PathType::UriFolder(p) => p.clone(),
+                };
+
+                // Normalize file:// to file:/// for proper URL parsing
+                let normalized = if path_str.starts_with("file://")
+                    && !path_str.starts_with("file:///")
+                {
+                    path_str.replacen("file://", "file:///", 1)
+                } else {
+                    path_str
+                };

Review Comment:
   Just a question, I realize that this is consistent with the previous code, 
but why is it the responsibility of this point in the code to "normalize" the 
url? 
   
   I put normalize in quotes because aren't we actually just taking an 
incorrect url and modifying it to be correct? I would consider that malformed 
substrait and throw an error personally. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Add substrait support for multiple files [datafusion]

Reply via email to