benbellick commented on code in PR #19729:
URL: https://github.com/apache/datafusion/pull/19729#discussion_r2848212825
##########
datafusion/substrait/src/logical_plan/consumer/rel/read_rel.rs:
##########
@@ -176,45 +176,100 @@ pub async fn from_read_rel(
}))
}
Some(ReadType::LocalFiles(lf)) => {
- fn extract_filename(name: &str) -> Option<String> {
- let corrected_url =
- if name.starts_with("file://") &&
!name.starts_with("file:///") {
- name.replacen("file://", "file:///", 1)
- } else {
- name.to_string()
- };
-
- Url::parse(&corrected_url).ok().and_then(|url| {
- let path = url.path();
- std::path::Path::new(path)
- .file_name()
- .map(|filename| filename.to_string_lossy().to_string())
- })
+ /// Extracts the path string from a PathType variant.
+ /// Normalizes file:// URLs to file:/// format for proper URL
parsing.
+ fn extract_path(path_type: Option<&PathType>) -> Option<String> {
+ let path_str = match path_type? {
+ PathType::UriPath(p) => p.clone(),
+ PathType::UriPathGlob(p) => p.clone(),
+ PathType::UriFile(p) => p.clone(),
+ PathType::UriFolder(p) => p.clone(),
+ };
+
+ // Normalize file:// to file:/// for proper URL parsing
+ let normalized = if path_str.starts_with("file://")
+ && !path_str.starts_with("file:///")
+ {
+ path_str.replacen("file://", "file:///", 1)
+ } else {
+ path_str
+ };
Review Comment:
Furthermore, shouldn't we pass around the `Url` instead of a string file
path? We may need the scheme later. [Per the substrait
spec](https://github.com/substrait-io/substrait/blob/7636cee7e5ebde47384e2da13c44f840b3fc288c/proto/substrait/algebra.proto#L176-L178),
we may use the scheme to determine if we actually need to read the file from
e.g. `s3`.
##########
datafusion/substrait/src/logical_plan/consumer/rel/read_rel.rs:
##########
@@ -176,45 +176,100 @@ pub async fn from_read_rel(
}))
}
Some(ReadType::LocalFiles(lf)) => {
- fn extract_filename(name: &str) -> Option<String> {
- let corrected_url =
- if name.starts_with("file://") &&
!name.starts_with("file:///") {
- name.replacen("file://", "file:///", 1)
- } else {
- name.to_string()
- };
-
- Url::parse(&corrected_url).ok().and_then(|url| {
- let path = url.path();
- std::path::Path::new(path)
- .file_name()
- .map(|filename| filename.to_string_lossy().to_string())
- })
+ /// Extracts the path string from a PathType variant.
+ /// Normalizes file:// URLs to file:/// format for proper URL
parsing.
+ fn extract_path(path_type: Option<&PathType>) -> Option<String> {
+ let path_str = match path_type? {
+ PathType::UriPath(p) => p.clone(),
+ PathType::UriPathGlob(p) => p.clone(),
+ PathType::UriFile(p) => p.clone(),
+ PathType::UriFolder(p) => p.clone(),
+ };
+
+ // Normalize file:// to file:/// for proper URL parsing
+ let normalized = if path_str.starts_with("file://")
+ && !path_str.starts_with("file:///")
+ {
+ path_str.replacen("file://", "file:///", 1)
+ } else {
+ path_str
+ };
Review Comment:
Just a question, I realize that this is consistent with the previous code,
but why is it the responsibility of this point in the code to "normalize" the
url?
I put normalize in quotes because aren't we actually just taking an
incorrect url and modifying it to be correct? I would consider that malformed
substrait and throw an error personally.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]