cshuo opened a new pull request, #19072: URL: https://github.com/apache/hudi/pull/19072
…at log files ### Describe the issue this Pull Request addresses Hudi is introducing RFC-103 native-format log files (e.g. `.log.parquet`, `.deletes.parquet`), where the log file is itself a native columnar file carrying records plus footer metadata, rather than legacy inline log blocks. For compatibility and migration, the existing `HoodieFileGroupReader` read path must be able to consume both legacy inline logs and the new native logs within the same file slice — automatically and with no user-facing config or API change. This PR adapts the log-reading stack so native data and delete log files are detected and read transparently through the existing block-processing and merge machinery. Fixes https://github.com/apache/hudi/issues/19057. ### Summary and Changelog - Added a synthetic reader-side block type `NATIVE_FILE_DATA_BLOCK` in `HoodieLogBlock`, routed through the existing `HoodieDataBlock` path in `BaseHoodieLogRecordReader`. - Introduced `HoodieNativeLogFileReader`, which treats each native log file as a single block, recovers block-header metadata from the native file footer (`hoodie.log.format.metadata`) with a fallback to individually-stored entries, and dispatches to data vs. delete blocks (CDC explicitly unsupported for now). - Added `HoodieNativeFileDataBlock` and `HoodieNativeDeleteBlock`, which read records directly from the native file via `HoodieReaderContext`/`HoodieIOFactory` instead of inline block content; write/serialize paths throw `UnsupportedOperationException` (read-only). - Extended `HoodieLogFormatReader` with a `createReader()` factory that detects native files (`FSUtils.matchNativeLogFile`) and falls back to `HoodieLogFileReader` for legacy files; legacy constructor passes `null` context and fails fast with `HoodieNotSupportedException` when a native file is encountered without an FG reader context. - Added `RecordContext.getValueAsJava` / `getOrderingValueAsJava` (default returns native Java value) with engine overrides for Flink (`FlinkRecordContext`), Spark (`BaseSparkInternalRecordContext`), and Hive (`HiveRecordContext`) to keep `DeleteRecord` ordering values in native Java type and avoid double-conversion at merge time. - Updated `AbstractTableFileSystemView` real-time file filtering to recognize native log files. - Added unit tests (`TestHoodieNativeLogFileReader`, `TestFlinkRecordContext`, `TestAvroRecordContext`, FSV test) and an end-to-end `TestHoodieFileGroupReaderNativeLogs` exercising a mixed legacy + native file slice, plus supporting test harness utilities. ### Impact - **Functional impact**: The FG reader can now read native-format data and delete log files alongside legacy inline logs within the same file slice. No config or API changes; the adaptation is internal and automatic. Native CDC log reading is intentionally not yet supported and throws a clear exception. - **Maintainability**: Reuses the existing `HoodieDataBlock`/delete-block processing and merge paths via a synthetic block type, keeping the compaction/merge machinery untouched. Adds a small, well-scoped `getValueAsJava` extension point on `RecordContext`. ### Risk Level Low. The change is additive and gated on native-file detection, so legacy read paths are unaffected and fail-fast guards cover unsupported contexts. Mitigation: unit tests cover footer-header parsing/precedence/validation and per-engine `getValueAsJava` conversions, and an end-to-end test validates reading a mixed legacy + native parquet data/delete file slice through the `HoodieFileGroupReader`. ### Documentation Update None. This is an internal read-path adaptation with no new user-facing config, API, or storage-format change introduced by this PR (the native log format itself is covered by RFC-103). ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Enough context is provided in the sections above - [ ] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
