jccampagne opened a new issue, #238:
URL: https://github.com/apache/arrow-rs-object-store/issues/238
**Describe the bug**
The issue was discovered when trying to read a percent-encoded filename
using DataFusion.
Upon investigation, the issue maybe partly in `LocalFileSystem` function
`path_to_filesystem`: this function will percent-encoding the path when calling
`extend` on `parts`.
As such, a file such as `L%3ABC.parquet` cannot be accessed by
`LocalFileSystem`.
`L%3ABC.parquet` is the actual filename on the file system as reported by
the command line `ls`, for example.
Note: `L%3ABC` is the percentage-encoded version of `L:BC`.
**To Reproduce**
Create a data file (not using `LocalFileSystem`) named `L%3ABC.parquet`.
Creating the file on the file system outside of `LocalFileSystem` is easiest.
This file could be created from tools like Python/Pandas, a database, etc.
Try reading it with datafusion as such:
```
{
let options = ParquetReadOptions::default();
let fs_name = "L%3ABC.parquet"; // this is the name on the file system
let r = ctx.read_parquet(fs_name, options).await.unwrap();
}
```
This will fail with an error similar to (added newlines):
```
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value:
ObjectStore(NotFound { path: "...../data/L%253ABC.parquet",
source: Os { code: 2, kind: NotFound, message: "No such file or
directory" } })', src/main.rs:88:62
```
Trying other variation does not help: `L:BC.parquet` errors else where.
If the file is named on the filesystem `L%253ABC.parquet`, one can read the
file with:
```
let fs_name = "L%3ABC.parquet"; // this will read L%253ABC.parquet
let r = ctx.read_parquet(fs_name, options).await.unwrap();
```
**Expected behavior**
The file `L%3ABC.parquet` should be read by the context.
**Additional context**
After some investigation, there might be some double encoding going on in
Object Store's `LocalFileSystem`: in `Config:path_to_filesystem` here:
https://github.com/apache/arrow-rs/blob/38764c26db511ea13538042f229e817562f02f74/object_store/src/local.rs#L228
More specifically `extend()` will percent-encode the string again:
https://github.com/apache/arrow-rs/blob/38764c26db511ea13538042f229e817562f02f74/object_store/src/local.rs#L235
I worked out a tentative patch, but I came across this "illegal" path test:
https://github.com/apache/arrow-rs/blob/38764c26db511ea13538042f229e817562f02f74/object_store/src/local.rs#L1200
the patch was able to access such file and failed on the test:
https://github.com/apache/arrow-rs/blob/38764c26db511ea13538042f229e817562f02f74/object_store/src/local.rs#L1211
The patch was able to handle this kind of filenames (vs "Path").
Clarification on this kind of path would be helpful to adjust the patch or
change the test?
**Tests**
Below are some "cases" to show the behaviour.
The additional logs `seg = ...` come from the function `path_to_filesystem`
- I added a simple log for each segments.
https://github.com/apache/arrow-rs/blob/38764c26db511ea13538042f229e817562f02f74/object_store/src/local.rs#L228
Case 1 and 2 fail; but case 3 succeeds.
***Case 1***
File present:
```
L%3ABC.parquet
```
Code:
```
let ctx = SessionContext::new();
let options = ParquetReadOptions::default();
let fs_name = "data/L%3ABC.parquet"; // this is as on FS
let r = ctx.read_parquet(fs_name, options).await.unwrap();
```
Result:
```
==================================================
seg = PathPart { raw: "Users" }
seg = PathPart { raw: "jc" }
seg = PathPart { raw: "src" }
seg = PathPart { raw: "rust" }
seg = PathPart { raw: "timeseries" }
seg = PathPart { raw: "data" }
seg = PathPart { raw: "L%253ABC.parquet" }
url = Url { scheme: "file", cannot_be_a_base: false, username: "", password:
None, host: None, port: None, path:
"/Users/jc/src/rust/timeseries/data/L%25253ABC.parquet", query: None, fragment:
None }
res = Ok("/Users/jc/src/rust/timeseries/data/L%253ABC.parquet")
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value:
ObjectStore(NotFound { path:
"/Users/jc/src/rust/timeseries/data/L%253ABC.parquet", source: Os { code: 2,
kind: NotFound, message: "No such file or directory" } })', src/main.rs:85:54
```
***Case 2***
File present:
```
L%253ABC.parquet
```
Code:
```
let ctx = SessionContext::new();
let options = ParquetReadOptions::default();
let fs_name = "data/L%3ABC.parquet"; // this is as on FS
let r = ctx.read_parquet(fs_name, options).await.unwrap();
```
Result:
```
==================================================
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value:
IoError(Os { code: 2, kind: NotFound, message: "No such file or directory" })',
src/main.rs:85:54
```
***Case 3***
Files present (yes I had to have the 2 files on disk to make this work):
```
L%253ABC.parquet
L%3ABC.parquet
```
```
let ctx = SessionContext::new();
let options = ParquetReadOptions::default();
let fs_name = "data/L%3ABC.parquet"; // this is as on FS
let r = ctx.read_parquet(fs_name, options).await.unwrap();
```
Result:
```
==================================================
seg = PathPart { raw: "Users" }
seg = PathPart { raw: "jc" }
seg = PathPart { raw: "src" }
seg = PathPart { raw: "rust" }
seg = PathPart { raw: "timeseries" }
seg = PathPart { raw: "data" }
seg = PathPart { raw: "L%253ABC.parquet" }
url = Url { scheme: "file", cannot_be_a_base: false, username: "", password:
None, host: None, port: None, path:
"/Users/jc/src/rust/timeseries/data/L%25253ABC.parquet", query: None, fragment:
None }
res = Ok("/Users/jc/src/rust/timeseries/data/L%253ABC.parquet")
seg = PathPart { raw: "Users" }
seg = PathPart { raw: "jc" }
seg = PathPart { raw: "src" }
seg = PathPart { raw: "rust" }
seg = PathPart { raw: "timeseries" }
seg = PathPart { raw: "data" }
seg = PathPart { raw: "L%253ABC.parquet" }
url = Url { scheme: "file", cannot_be_a_base: false, username: "", password:
None, host: None, port: None, path:
"/Users/jc/src/rust/timeseries/data/L%25253ABC.parquet", query: None, fragment:
None }
res = Ok("/Users/jc/src/rust/timeseries/data/L%253ABC.parquet")
seg = PathPart { raw: "Users" }
seg = PathPart { raw: "jc" }
seg = PathPart { raw: "src" }
seg = PathPart { raw: "rust" }
seg = PathPart { raw: "timeseries" }
seg = PathPart { raw: "data" }
seg = PathPart { raw: "L%253ABC.parquet" }
url = Url { scheme: "file", cannot_be_a_base: false, username: "", password:
None, host: None, port: None, path:
"/Users/jc/src/rust/timeseries/data/L%25253ABC.parquet", query: None, fragment:
None }
res = Ok("/Users/jc/src/rust/timeseries/data/L%253ABC.parquet")
````
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]