greedAuguria opened a new pull request, #19651:
URL: https://github.com/apache/datafusion/pull/19651

   ## Which issue does this PR close?
   
   - Closes #19650.
   
   ## Rationale for this change
   
   Currently, when DataFusion parses Hive-style partitioned paths (e.g., 
`s3://bucket/table/city=San%20Francisco/`), it extracts the partition value 
literally as `San%20Francisco`. 
   
   Standard practice in tools like Apache Spark and Apache Hive is to 
URL-encode special characters in partition values when writing to object 
stores. This PR ensures DataFusion correctly decodes these values (to `San 
Francisco`) during the listing process, preventing data mismatches and ensuring 
consistent behavior with other engines.
   
   ## What changes are included in this PR?
   
   1.  **Logic Update**: Updated `parse_partitions_for_path` in 
`datafusion/catalog-listing/src/helpers.rs` to percent-decode partition values.
   2.  **Signature Change**: Changed the return type of 
`parse_partitions_for_path` from `Option<Vec<&str>>` to `Option<Vec<String>>` 
because decoded strings require new allocations.
   3.  **Call Site Updates**: Updated internal callers in 
`datafusion-catalog-listing` to accommodate the owned `String` return type.
   4.  **Dependencies**: Added `percent-encoding` crate to 
`datafusion-catalog-listing`.
   
   ## Are these changes tested?
   
   Yes, new unit tests were added to 
`datafusion/catalog-listing/src/helpers.rs` covering:
   - Standard URL-encoded characters (e.g., `%2F` for `/`).
   - Spaces encoded as `%20`.
   - Multi-byte UTF-8 characters.
   - Forgiving behavior for malformed encoding (matching `percent-encoding` 
crate defaults).
   
   ## Are there any user-facing changes?
   
   Yes:
   - **Data Behavior**: Partition values extracted from file paths will now be 
correctly decoded instead of remaining URL-encoded.
   - **API Change**: The public helper function `parse_partitions_for_path` now 
returns `Vec<String>` instead of `Vec<&str>`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to