danny0405 opened a new pull request, #18741: URL: https://github.com/apache/hudi/pull/18741
### Describe the issue this Pull Request addresses Flink currently does not have an append-only Lance base-file path for tables without primary keys. Lance support needs Flink-specific RowData writer and reader plumbing, table-format validation, and COW input-format handling so pk-less append-only tables can ingest and read Lance base files. This PR adds Lance support for Flink COPY_ON_WRITE append-only INSERT tables without primary keys. It explicitly rejects unsupported Lance combinations such as primary-key tables, MERGE_ON_READ tables, and non-INSERT write operations. This PR introduces user-visible storage-format support for Flink under those constraints. ### Summary and Changelog Flink can now write and read Lance base files for append-only COW tables without primary keys, including projected reads from Lance files. #### Working tree: Add Flink Lance append-only writer/reader support - Added `HoodieRowDataLanceWriter`, `HoodieFlinkLanceArrowUtils`, and `HoodieBloomFilterStringWriteSupport` for primitive RowData-to-Arrow Lance writes. - Updated `HoodieRowDataCreateHandle` and `HoodieRowDataFileWriterFactory` to dispatch writers by base-file extension and create Lance writers. - Added `HoodieRowDataLanceReader` and wired it through `HoodieRowDataFileReaderFactory`, `FlinkRowDataReaderContext`, and `CopyOnWriteInputFormat`. - Updated Lance reader resource ownership so iterator close releases Arrow reader, Lance reader, allocator, and parent metadata reader resources. - Reordered Lance reader vectors by requested field name so projected reads like `select name, uuid` return columns in Flink projection order. - Marked Lance files unsplittable in `CopyOnWriteInputFormat`. - Added `StreamerUtil.getLanceWriteConfig(...)` and persisted `hoodie.table.base.file.format` during table initialization. - Updated `HoodieTableFactory` validation to allow Lance only for append-only COPY_ON_WRITE INSERT tables without primary keys. - Replaced the old Flink Lance rejection IT with `ITTestHoodieDataSource#testLanceFormatAppendOnlyWriteAndRead`. - Added catalog/table-factory tests for append-only Lance table creation and unsupported keyed, MOR, and non-INSERT Lance writes. ### Impact This enables a new Flink write/read path for Lance base files, scoped to COPY_ON_WRITE append-only tables without primary keys. Existing Parquet behavior is preserved, with writer creation now dispatched by file extension instead of always creating a Parquet writer. Unsupported Lance table shapes fail early with explicit validation messages. Complex/nested logical types are not supported by the Flink Lance Arrow conversion helpers in this change; unsupported types throw `HoodieNotSupportedException`. ### Risk Level medium This touches storage-format writer/reader paths, Flink table validation, COW input-format reading, and table initialization. The main risks are projection correctness, resource cleanup, and accidental enablement for unsupported table modes. Mitigation includes focused compile and integration/unit coverage: - `mvn -pl hudi-flink-datasource/hudi-flink -am -DskipTests -DskipITs -Dscala-2.12 compile` - `mvn -pl hudi-flink-datasource/hudi-flink -am -Dscala-2.12 -Dtest=TestHoodieTableFactory#testLanceFormatSupportedForAppendOnlyTables,org.apache.hudi.table.catalog.TestHoodieCatalog#testCreateAppendOnlyLanceTableWithoutPrimaryKey,ITTestHoodieDataSource#testLanceFormatAppendOnlyWriteAndRead -Dsurefire.failIfNoSpecifiedTests=false test` The focused test run passed with `Tests run: 3, Failures: 0, Errors: 0`. ### Documentation Update Documentation update is recommended because this adds new user-facing Flink Lance support with important constraints: only COPY_ON_WRITE append-only INSERT tables without primary keys are supported, and primitive columns are supported by the current RowData/Arrow conversion path. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Enough context is provided in the sections above - [ ] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
