danny0405 opened a new pull request, #18862: URL: https://github.com/apache/hudi/pull/18862
### Describe the issue this Pull Request addresses Flink Lance base-file support was previously scoped to append-only COPY_ON_WRITE tables without primary keys. That prevented keyed COW tables and upsert workloads from using `hoodie.table.base.file.format=LANCE`, even though the Flink RowData Lance writer and reader path can participate in the generic COW write/merge flow. The Flink Lance reader also did not implement `filterRowKeys`, which is needed by keyed COW/index-related paths that need to find candidate record keys in existing base files. ### Summary and Changelog This PR extends Flink Lance support from append-only COW tables to keyed COPY_ON_WRITE tables, while keeping MOR and schema evolution unsupported. #### Commit 1: feat: add lance format support for Flink COW table (`43004d1b7ed`) - Relaxed `HoodieTableFactory` Lance validation so Flink Lance is scoped to COPY_ON_WRITE tables rather than append-only/no-primary-key tables. - Kept validation errors for MERGE_ON_READ Lance tables and schema evolution with Lance base files. - Implemented `HoodieRowDataLanceReader.filterRowKeys` by scanning projected record keys and returning matching key/position pairs. - Updated `HoodieRowDataLanceWriter` documentation from append-only base files to general Flink RowData base files. - Updated `TestHoodieTableFactory` to allow Lance COW tables with primary keys, upsert operation, and explicit record-key fields. - Added `ITTestHoodieDataSource.testLanceFormatCopyOnWriteUpsertWriteAndRead` to write initial records, apply updates, and read merged results from a Lance COW table. ### Impact Flink users can now configure Lance as the base file format for COPY_ON_WRITE tables that use primary keys and upsert writes. No new config keys or public APIs are added. Compatibility constraints remain: Flink Lance support is still rejected for MERGE_ON_READ tables and schema evolution. The change affects Flink table factory validation and Lance row-key filtering in the Flink reader path. ### Risk Level medium This touches storage-format behavior on the Flink COW read/write path and enables a previously rejected keyed/upsert mode. Risk is mitigated by targeted factory validation coverage and a batch SQL integration test for Lance COW upsert write/read behavior. Validation evidence: - `git diff --check` passed. - Attempted `mvn -pl hudi-flink-datasource/hudi-flink -am -DskipITs -DskipIT=true -Dtest=TestHoodieTableFactory -Dsurefire.failIfNoSpecifiedTests=false test`, but the reactor failed before reaching `hudi-flink` due to an unrelated existing compile error in `hudi-flink-client`: missing `DataTypeAdapter.variantParquetAnnotation()` referenced by `ParquetSchemaConverter.java`. ### Documentation Update none No new configuration is introduced. Existing Lance base-file-format documentation may need a follow-up update if the engine support matrix documents Flink append-only versus COW behavior. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Enough context is provided in the sections above - [ ] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
