[PR] feat(reader): Adapt the HoodieFileGroupReader to read the native form… [hudi]

via GitHub Thu, 25 Jun 2026 21:03:40 -0700


cshuo opened a new pull request, #19072:
URL: https://github.com/apache/hudi/pull/19072


   …at log files
   
   ### Describe the issue this Pull Request addresses
   
   Hudi is introducing RFC-103 native-format log files (e.g. `.log.parquet`, 
`.deletes.parquet`), where the log file is itself a native columnar file 
carrying records plus footer metadata, rather than legacy inline log blocks. 
For compatibility and migration, the existing `HoodieFileGroupReader` read path 
must be able to consume both legacy inline logs and the new native logs within 
the same file slice — automatically and with no user-facing config or API 
change.
   
   This PR adapts the log-reading stack so native data and delete log files are 
detected and read transparently through the existing block-processing and merge 
machinery.
   
   Fixes https://github.com/apache/hudi/issues/19057.
   
   ### Summary and Changelog
   
   - Added a synthetic reader-side block type `NATIVE_FILE_DATA_BLOCK` in 
`HoodieLogBlock`, routed through the existing `HoodieDataBlock` path in 
`BaseHoodieLogRecordReader`.
   - Introduced `HoodieNativeLogFileReader`, which treats each native log file 
as a single block, recovers block-header metadata from the native file footer 
(`hoodie.log.format.metadata`) with a fallback to individually-stored entries, 
and dispatches to data vs. delete blocks (CDC explicitly unsupported for now).
   - Added `HoodieNativeFileDataBlock` and `HoodieNativeDeleteBlock`, which 
read records directly from the native file via 
`HoodieReaderContext`/`HoodieIOFactory` instead of inline block content; 
write/serialize paths throw `UnsupportedOperationException` (read-only).
   - Extended `HoodieLogFormatReader` with a `createReader()` factory that 
detects native files (`FSUtils.matchNativeLogFile`) and falls back to 
`HoodieLogFileReader` for legacy files; legacy constructor passes `null` 
context and fails fast with `HoodieNotSupportedException` when a native file is 
encountered without an FG reader context.
   - Added `RecordContext.getValueAsJava` / `getOrderingValueAsJava` (default 
returns native Java value) with engine overrides for Flink 
(`FlinkRecordContext`), Spark (`BaseSparkInternalRecordContext`), and Hive 
(`HiveRecordContext`) to keep `DeleteRecord` ordering values in native Java 
type and avoid double-conversion at merge time.
   - Updated `AbstractTableFileSystemView` real-time file filtering to 
recognize native log files.
   - Added unit tests (`TestHoodieNativeLogFileReader`, 
`TestFlinkRecordContext`, `TestAvroRecordContext`, FSV test) and an end-to-end 
`TestHoodieFileGroupReaderNativeLogs` exercising a mixed legacy + native file 
slice, plus supporting test harness utilities.
   
   ### Impact
   
   - **Functional impact**: The FG reader can now read native-format data and 
delete log files alongside legacy inline logs within the same file slice. No 
config or API changes; the adaptation is internal and automatic. Native CDC log 
reading is intentionally not yet supported and throws a clear exception.
   - **Maintainability**: Reuses the existing `HoodieDataBlock`/delete-block 
processing and merge paths via a synthetic block type, keeping the 
compaction/merge machinery untouched. Adds a small, well-scoped 
`getValueAsJava` extension point on `RecordContext`.
   
   ### Risk Level
   
   Low. The change is additive and gated on native-file detection, so legacy 
read paths are unaffected and fail-fast guards cover unsupported contexts. 
Mitigation: unit tests cover footer-header parsing/precedence/validation and 
per-engine `getValueAsJava` conversions, and an end-to-end test validates 
reading a mixed legacy + native parquet data/delete file slice through the 
`HoodieFileGroupReader`.
   
   ### Documentation Update
   
   None. This is an internal read-path adaptation with no new user-facing 
config, API, or storage-format change introduced by this PR (the native log 
format itself is covered by RFC-103).
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat(reader): Adapt the HoodieFileGroupReader to read the native form… [hudi]

Reply via email to