schenksj opened a new issue, #4546: URL: https://github.com/apache/datafusion-comet/issues/4546
## Describe the bug Comet's native scan (`native_datafusion`) fails to read local-filesystem files whose path contains a literal `%` (and, in combination, spaces). The read fails with: ``` SparkException: Parquet error: External: Object at location /.../s p a r k %2a-<uuid>/_change_data/test%file%prefix-cdc-....snappy.parquet not found: No such file or directory (os error 2) ``` even though the file exists at exactly that location on disk. ## To Reproduce Read a Parquet file from a local path that contains a literal `%` (e.g. a directory named `s p a r k %2a` or a file named `test%file%prefix-...`) with the native scan enabled. Surfaced by the Delta test suite under Comet: Delta's `DeletionVectorsTestUtils` forces temp dirs to the prefix `"s p a r k %2a"` to stress-test path handling, which makes `DeltaVacuumSuite`'s `vacuum for cdc - update/merge` and `vacuum for cdc - delete tombstones` fail (the `_change_data` files can't be read). ## Root cause `native/core/src/execution/planner.rs` builds each `PartitionedFile`'s object-store location from the Spark-supplied `file://` URL like this: ```rust // Spark sends the path over as URL-encoded, parse that first. let url = Url::parse(file.file_path.as_ref())?; let path = Path::from_url_path(url.path())?; partitioned_file.object_meta.location = path; ``` `Path::from_url_path` percent-decodes the URL path. For a **local** file whose real name contains a literal `%`, the decode + `LocalFileSystem` round-trip does not reproduce the on-disk name, so the file is reported missing. (The same `Url::parse` + `Path::from_url_path` pattern is used in `parquet_support.rs::prepare_object_store_with_configs`.) For local files the correct object_store API is `Path::from_absolute_path` (applied to the decoded OS path), which is round-trip-safe with `LocalFileSystem`. Remote stores (S3/GCS/Azure/HDFS) must keep the existing URL-path handling. ## Expected behavior The native scan should read local files whose paths contain `%`, spaces, and other special characters — matching Spark's own reader and `LocalFileSystem` semantics. ## Additional context - object_store 0.13.2. - Affects local (`file://`) reads only; remote object stores use URL paths and are unaffected. - A fix is in progress: branch on the `file` scheme and use `url.to_file_path()` + `Path::from_absolute_path` for local files, keeping `from_url_path` for remote schemes. Will include a Rust unit test (object_store round-trip for a `%`/space path) and a Spark-level red/green test (reading a Parquet file at such a path). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
