schenksj opened a new issue, #4546:
URL: https://github.com/apache/datafusion-comet/issues/4546

   ## Describe the bug
   
   Comet's native scan (`native_datafusion`) fails to read local-filesystem 
files whose path contains a literal `%` (and, in combination, spaces). The read 
fails with:
   
   ```
   SparkException: Parquet error: External: Object at location
   /.../s p a r k 
%2a-<uuid>/_change_data/test%file%prefix-cdc-....snappy.parquet
   not found: No such file or directory (os error 2)
   ```
   
   even though the file exists at exactly that location on disk.
   
   ## To Reproduce
   
   Read a Parquet file from a local path that contains a literal `%` (e.g. a 
directory named `s p a r k %2a` or a file named `test%file%prefix-...`) with 
the native scan enabled.
   
   Surfaced by the Delta test suite under Comet: Delta's 
`DeletionVectorsTestUtils` forces temp dirs to the prefix `"s p a r k %2a"` to 
stress-test path handling, which makes `DeltaVacuumSuite`'s `vacuum for cdc - 
update/merge` and `vacuum for cdc - delete tombstones` fail (the `_change_data` 
files can't be read).
   
   ## Root cause
   
   `native/core/src/execution/planner.rs` builds each `PartitionedFile`'s 
object-store location from the Spark-supplied `file://` URL like this:
   
   ```rust
   // Spark sends the path over as URL-encoded, parse that first.
   let url = Url::parse(file.file_path.as_ref())?;
   let path = Path::from_url_path(url.path())?;
   partitioned_file.object_meta.location = path;
   ```
   
   `Path::from_url_path` percent-decodes the URL path. For a **local** file 
whose real name contains a literal `%`, the decode + `LocalFileSystem` 
round-trip does not reproduce the on-disk name, so the file is reported 
missing. (The same `Url::parse` + `Path::from_url_path` pattern is used in 
`parquet_support.rs::prepare_object_store_with_configs`.)
   
   For local files the correct object_store API is `Path::from_absolute_path` 
(applied to the decoded OS path), which is round-trip-safe with 
`LocalFileSystem`. Remote stores (S3/GCS/Azure/HDFS) must keep the existing 
URL-path handling.
   
   ## Expected behavior
   
   The native scan should read local files whose paths contain `%`, spaces, and 
other special characters — matching Spark's own reader and `LocalFileSystem` 
semantics.
   
   ## Additional context
   
   - object_store 0.13.2.
   - Affects local (`file://`) reads only; remote object stores use URL paths 
and are unaffected.
   - A fix is in progress: branch on the `file` scheme and use 
`url.to_file_path()` + `Path::from_absolute_path` for local files, keeping 
`from_url_path` for remote schemes. Will include a Rust unit test (object_store 
round-trip for a `%`/space path) and a Spark-level red/green test (reading a 
Parquet file at such a path).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to