danny0405 opened a new pull request, #18911:
URL: https://github.com/apache/hudi/pull/18911
### Describe the issue this Pull Request addresses
This closes #18907 .
Flink Lance base-file support was previously scoped away from merge-on-read
tables, so MOR write/read flows could not use Lance base files even when the
Flink path had Lance-specific readers available. This blocked users from
combining Lance base files with MOR log-file merging, CDC base-file reads, and
async compaction in Flink SQL pipelines.
This PR expands the Flink Lance path to support merge-on-read writes and
reads while keeping the existing schema-evolution restriction for Lance files.
### Summary and Changelog
This PR enables Lance base files for Flink merge-on-read tables, wires Lance
readers into MOR and CDC base-file reads, and updates tests to cover both
Parquet and Lance MOR base/log-file reads.
#### Working tree: Support Flink Lance MOR write and read path
- Allows `hoodie.table.base.file.format = LANCE` for Flink merge-on-read
tables by removing the previous MOR rejection in `HoodieTableFactory`.
- Adds Lance base-file handling in `MergeOnReadInputFormat` using
`HoodieRowDataLanceReader` and requested-schema projection.
- Adds Lance base-file handling in `HoodieCdcSplitReaderFunction` so CDC
split reads can load Lance base files.
- Narrows `FlinkRowDataReaderContext` schema-evolution rejection to only
fail when a non-empty merge schema is required, while still rejecting actual
Lance schema evolution.
- Updates the Lance unsupported-path error message to avoid saying Lance is
Spark-only.
#### Working tree: Tests and validation
- Adds `ITTestHoodieDataSource.testLanceFormatMergeOnReadUpsertWriteAndRead`
for Flink SQL MOR upsert/write/read with Lance base files and async compaction
enabled through SQL table options.
- Parameterizes `TestInputFormat.testReadBaseAndLogFiles` to run for both
`PARQUET` and `LANCE`.
- Updates table-factory and Hive catalog assertions for the new Lance
support boundary.
- Adds a test utility helper for checking completed compaction timeline
state.
- Validation run:
- `mvn -pl hudi-flink-datasource/hudi-flink -am -DskipITs -DskipIT
-Dcheckstyle.skip -Drat.skip=true -DfailIfNoTests=false
-Dsurefire.failIfNoSpecifiedTests=false
-Dtest=TestInputFormat#testReadBaseAndLogFiles test`
- `mvn -pl hudi-flink-datasource/hudi-flink -am -DskipITs=false
-DskipIT=false -Dcheckstyle.skip -Drat.skip=true -DfailIfNoTests=false
-Dsurefire.failIfNoSpecifiedTests=false
-Dtest=ITTestHoodieDataSource#testLanceFormatMergeOnReadUpsertWriteAndRead test`
### Impact
This expands Flink user-facing behavior by allowing Lance base files with
merge-on-read tables and by enabling MOR/CDC read paths to read Lance base
files. Schema evolution for Flink Lance base files remains unsupported. There
is no new public API, but the accepted configuration surface changes because
Flink MOR tables can now use `hoodie.table.base.file.format = LANCE`.
### Risk Level
medium
This touches Flink MOR read/write behavior, CDC split reads, table factory
validation, and a storage-format-specific reader path. Risk is mitigated by
targeted unit and integration coverage for Lance MOR SQL writes/reads, MOR
base/log-file reads, and table factory validation. One targeted IT run
completed successfully with Surefire retry after an initial transient row
assertion mismatch.
### Documentation Update
Required. The Flink/base-file-format support matrix or configuration
documentation should be updated to note that Lance base files are supported for
Flink merge-on-read tables, with schema evolution still unsupported.
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Enough context is provided in the sections above
- [ ] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]