danny0405 opened a new pull request, #18911:
URL: https://github.com/apache/hudi/pull/18911

   ### Describe the issue this Pull Request addresses
   
   This closes #18907 .
   
   Flink Lance base-file support was previously scoped away from merge-on-read 
tables, so MOR write/read flows could not use Lance base files even when the 
Flink path had Lance-specific readers available. This blocked users from 
combining Lance base files with MOR log-file merging, CDC base-file reads, and 
async compaction in Flink SQL pipelines.
   
   This PR expands the Flink Lance path to support merge-on-read writes and 
reads while keeping the existing schema-evolution restriction for Lance files.
   
   ### Summary and Changelog
   
   This PR enables Lance base files for Flink merge-on-read tables, wires Lance 
readers into MOR and CDC base-file reads, and updates tests to cover both 
Parquet and Lance MOR base/log-file reads.
   
   #### Working tree: Support Flink Lance MOR write and read path
   - Allows `hoodie.table.base.file.format = LANCE` for Flink merge-on-read 
tables by removing the previous MOR rejection in `HoodieTableFactory`.
   - Adds Lance base-file handling in `MergeOnReadInputFormat` using 
`HoodieRowDataLanceReader` and requested-schema projection.
   - Adds Lance base-file handling in `HoodieCdcSplitReaderFunction` so CDC 
split reads can load Lance base files.
   - Narrows `FlinkRowDataReaderContext` schema-evolution rejection to only 
fail when a non-empty merge schema is required, while still rejecting actual 
Lance schema evolution.
   - Updates the Lance unsupported-path error message to avoid saying Lance is 
Spark-only.
   
   #### Working tree: Tests and validation
   - Adds `ITTestHoodieDataSource.testLanceFormatMergeOnReadUpsertWriteAndRead` 
for Flink SQL MOR upsert/write/read with Lance base files and async compaction 
enabled through SQL table options.
   - Parameterizes `TestInputFormat.testReadBaseAndLogFiles` to run for both 
`PARQUET` and `LANCE`.
   - Updates table-factory and Hive catalog assertions for the new Lance 
support boundary.
   - Adds a test utility helper for checking completed compaction timeline 
state.
   - Validation run:
     - `mvn -pl hudi-flink-datasource/hudi-flink -am -DskipITs -DskipIT 
-Dcheckstyle.skip -Drat.skip=true -DfailIfNoTests=false 
-Dsurefire.failIfNoSpecifiedTests=false 
-Dtest=TestInputFormat#testReadBaseAndLogFiles test`
     - `mvn -pl hudi-flink-datasource/hudi-flink -am -DskipITs=false 
-DskipIT=false -Dcheckstyle.skip -Drat.skip=true -DfailIfNoTests=false 
-Dsurefire.failIfNoSpecifiedTests=false 
-Dtest=ITTestHoodieDataSource#testLanceFormatMergeOnReadUpsertWriteAndRead test`
   
   ### Impact
   
   This expands Flink user-facing behavior by allowing Lance base files with 
merge-on-read tables and by enabling MOR/CDC read paths to read Lance base 
files. Schema evolution for Flink Lance base files remains unsupported. There 
is no new public API, but the accepted configuration surface changes because 
Flink MOR tables can now use `hoodie.table.base.file.format = LANCE`.
   
   ### Risk Level
   
   medium
   
   This touches Flink MOR read/write behavior, CDC split reads, table factory 
validation, and a storage-format-specific reader path. Risk is mitigated by 
targeted unit and integration coverage for Lance MOR SQL writes/reads, MOR 
base/log-file reads, and table factory validation. One targeted IT run 
completed successfully with Surefire retry after an initial transient row 
assertion mismatch.
   
   ### Documentation Update
   
   Required. The Flink/base-file-format support matrix or configuration 
documentation should be updated to note that Lance base files are supported for 
Flink merge-on-read tables, with schema evolution still unsupported.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to