wombatu-kun opened a new pull request, #18341: URL: https://github.com/apache/hudi/pull/18341
### Describe the issue this Pull Request addresses Closes https://github.com/apache/hudi/issues/17684 ### Summary and Changelog Implement canWrite() in HoodieSparkLanceWriter analogously to HoodieBaseParquetWriter.canWrite() by tracking cumulative Arrow buffer sizes in the base class and adding periodic size-limit checks in the Spark writer. `HoodieStorageConfig`: Added `LANCE_MAX_FILE_SIZE` config property (key `hoodie.lance.max.file.size`, default 120 MB) and a `lanceMaxFileSize(long)` builder method, consistent with the existing Parquet/ORC/HFile config entries. `HoodieBaseLanceWriter`: Added `totalFlushedDataSize` field, and `getDataSize()` accessor. In `flushBatch()`, after `arrowWriter.finishBatch()` sets the row count, the method now iterates over `root.getFieldVectors()` and accumulates `vector.getBufferSize()` into `totalFlushedDataSize` before writing to Lance. This provides an uncompressed Arrow buffer size estimate analogous to `ParquetWriter.getDataSize()`. `HoodieSparkLanceWriter`: - Added `MIN_RECORDS_FOR_SIZE_CHECK` = 100 and `MAX_RECORDS_FOR_SIZE_CHECK` = 10000 constants (mirrors the Parquet constants). - Added `maxFileSize` and `recordCountForNextSizeCheck` fields. - Updated the main constructor to accept long `maxFileSize`; the no-arg secondary constructor now delegates with `Long.MAX_VALUE` (no limit); a new secondary constructor accepting explicit `maxFileSize` is added for use by `HoodieInternalRowFileWriterFactory`. - `canWrite()` implementation: checks periodically based on `recordCountForNextSizeCheck`, computes average record size from `getDataSize()/writtenCount`, returns false when within two average records of `maxFileSize`, and adaptively schedules the next check. `HoodieSparkFileWriterFactory`: Reads `LANCE_MAX_FILE_SIZE` from config and passes it to the `HoodieSparkLanceWriter` constructor. `HoodieInternalRowFileWriterFactory`: method `getInternalRowFileWriter` reads `LANCE_MAX_FILE_SIZE` and passes it (through `newLanceInternalRowFileWriter`) to the new `HoodieSparkLanceWriter` constructor. ### Impact track a proper implementation that checks to see if the file has reached some threshold in size and if so should roll over the write to a new file ### Risk Level none ### Documentation Update Need to add `LANCE_MAX_FILE_SIZE` config property (`hoodie.lance.max.file.size`, default 120 MB) ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Enough context is provided in the sections above - [ ] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
