greedAuguria opened a new pull request, #19651: URL: https://github.com/apache/datafusion/pull/19651
## Which issue does this PR close? - Closes #19650. ## Rationale for this change Currently, when DataFusion parses Hive-style partitioned paths (e.g., `s3://bucket/table/city=San%20Francisco/`), it extracts the partition value literally as `San%20Francisco`. Standard practice in tools like Apache Spark and Apache Hive is to URL-encode special characters in partition values when writing to object stores. This PR ensures DataFusion correctly decodes these values (to `San Francisco`) during the listing process, preventing data mismatches and ensuring consistent behavior with other engines. ## What changes are included in this PR? 1. **Logic Update**: Updated `parse_partitions_for_path` in `datafusion/catalog-listing/src/helpers.rs` to percent-decode partition values. 2. **Signature Change**: Changed the return type of `parse_partitions_for_path` from `Option<Vec<&str>>` to `Option<Vec<String>>` because decoded strings require new allocations. 3. **Call Site Updates**: Updated internal callers in `datafusion-catalog-listing` to accommodate the owned `String` return type. 4. **Dependencies**: Added `percent-encoding` crate to `datafusion-catalog-listing`. ## Are these changes tested? Yes, new unit tests were added to `datafusion/catalog-listing/src/helpers.rs` covering: - Standard URL-encoded characters (e.g., `%2F` for `/`). - Spaces encoded as `%20`. - Multi-byte UTF-8 characters. - Forgiving behavior for malformed encoding (matching `percent-encoding` crate defaults). ## Are there any user-facing changes? Yes: - **Data Behavior**: Partition values extracted from file paths will now be correctly decoded instead of remaining URL-encoded. - **API Change**: The public helper function `parse_partitions_for_path` now returns `Vec<String>` instead of `Vec<&str>`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
