[PR] feat: Use PartitionValueExtractor interface in Spark reader path [hudi]

via GitHub Tue, 13 Jan 2026 13:20:45 -0800


suryaprasanna opened a new pull request, #17850:
URL: https://github.com/apache/hudi/pull/17850


   ### Describe the issue this Pull Request addresses
   
   This PR enables the use of custom `PartitionValueExtractor` implementations 
when reading Hudi tables in Spark, allowing users to define custom logic for 
extracting partition values from partition paths.
   Previously, the `PartitionValueExtractor` interface was only used during 
write/sync operations but not during read operations.
   
   ### Summary and Changelog
   
   Users can now configure custom partition value extractors for read 
operations using the `hoodie.datasource.read.partition.value.extractor.class` 
option, enabling support for non-standard partition path formats.
   
     **Changes:**
     - Moved `PartitionValueExtractor` interface from `hudi-sync-common` to 
`hudi-common` for broader accessibility
     - Added `PARTITION_VALUE_EXTRACTOR_CLASS` config to `HoodieTableConfig` 
and `DataSourceOptions`
     - Updated `HoodieSparkUtils.createPartitionSchema()` to use 
`PartitionValueExtractor` for partition value extraction
     - Updated `HoodieFileIndex` and related classes to support custom 
partition value extractors
     - Added test implementation `TestCustomSlashPartitionValueExtractor` 
demonstrating date formatting from slash-separated paths (yyyy/mm/dd → 
yyyy-mm-dd)
     - Updated all existing `PartitionValueExtractor` implementations to use 
the relocated interface
   
   ### Impact
   
   **Public API Changes:**
    - New config option: 
`hoodie.datasource.read.partition.value.extractor.class` for Spark reads
    - `PartitionValueExtractor` interface relocated from 
`org.apache.hudi.sync.common.model` to `org.apache.hudi.hive.sync`
   
   **User-Facing Changes:**
   Users can now customize partition value extraction during read operations by 
providing a custom `PartitionValueExtractor` implementation
   
   ### Risk Level
   
   **Low** - This is an additive change that maintains backward compatibility. 
When no custom extractor is specified, the default behavior remains unchanged 
(standard slash-based splitting). The change has been tested with custom 
partition value extractor implementations.
   
   ### Documentation Update
   
     Documentation should be updated to include:
     - New config option 
`hoodie.datasource.read.partition.value.extractor.class` in the configuration 
reference (WIP)
     - Example usage of custom `PartitionValueExtractor` implementations for 
read operations (WIP)
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat: Use PartitionValueExtractor interface in Spark reader path [hudi]

Reply via email to