danny0405 opened a new pull request, #18987: URL: https://github.com/apache/hudi/pull/18987
### Describe the issue this Pull Request addresses RFC-103 introduces an LSM tree file-group layout where base and log files are sorted by record key and merged with a streaming k-way merge. The reader side needs a dedicated implementation for that layout without changing the existing `HoodieFileGroupReader` path. The design also uses native parquet log files instead of Avro log files with embedded parquet data blocks. Native data logs use `<fileId>_<writeToken>_<instant>_<version>.parquet`, and native delete logs use `<fileId>_<writeToken>_<instant>_<version>.delete.parquet`, so common file-name parsing and file-system view classification need to recognize those files correctly. ### Summary and Changelog Adds a separate LSM file-group reader for native parquet log files and updates common log-file parsing to recognize RFC-style native parquet data/delete logs. #### Commit 1: feat:(DNM) add a lsm-tree based FG reader (`f0b63593dedd`) - Added `HoodieLsmFileGroupReader` as a separate reader entry point instead of modifying `HoodieFileGroupReader`. - Added `LsmFileGroupRecordIterator` to perform streaming sorted k-way merge over one active record per base/log file. - Implemented the k-way merge with a loser-tree state machine, deterministic same-key ordering, and existing `BufferedRecordMerger` semantics. - Preserved existing tie behavior for equal ordering values by processing sources in merge order: base file first, then log files ordered by instant/version/write token/suffix, so later log records win when ordering values are equal. - Read native parquet data logs directly through `HoodieReaderContext` and added reader-side handling for native delete parquet logs with the fixed delete schema. - Added native parquet log parsing in `FSUtils` and `HoodieLogFile`, including data log and `.delete.parquet` delete log names. - Updated `AbstractTableFileSystemView` so native parquet log files are classified as log files and excluded from base-file discovery. - Added `TestHoodieLogFile` coverage for native parquet data/delete log parsing and helper extraction. ### Impact This adds a new reader implementation for LSM file groups without changing the existing `HoodieFileGroupReader` behavior. It affects common file-name parsing and file-system view classification for native parquet log files, enabling readers to distinguish native log v2 files from regular parquet base files. No writer path, table config default, or existing Avro log reader behavior is changed. The main compatibility impact is that RFC-style native parquet log files are now recognized as Hudi log files by common utilities. ### Risk Level medium The change touches common file parsing and file-system view classification, which are core read-path utilities. The new LSM reader also implements merge ordering semantics that must stay consistent with existing file-group merge behavior. Risk is mitigated by keeping the LSM reader separate from `HoodieFileGroupReader`, preserving existing merge APIs, and validating with: - `mvn -pl hudi-common -DskipTests compile` - `mvn -pl hudi-common -DskipITs -Dtest=TestHoodieLogFile test` ### Documentation Update none This PR adds reader implementation and native log-file recognition but does not introduce a new user-facing config, default behavior change, or public documentation surface in this repo. The behavior follows the RFC-103 design. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Enough context is provided in the sections above - [ ] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
