wombatu-kun opened a new pull request, #18341:
URL: https://github.com/apache/hudi/pull/18341

   ### Describe the issue this Pull Request addresses
   
   Closes https://github.com/apache/hudi/issues/17684
   
   ### Summary and Changelog
   
   Implement canWrite() in HoodieSparkLanceWriter analogously to 
HoodieBaseParquetWriter.canWrite() by tracking cumulative Arrow buffer sizes in 
the base class and adding periodic size-limit checks in the Spark writer.  
   
   `HoodieStorageConfig`: Added `LANCE_MAX_FILE_SIZE` config property (key 
`hoodie.lance.max.file.size`, default 120 MB) and a `lanceMaxFileSize(long)` 
builder method, consistent with the existing Parquet/ORC/HFile config entries.  
   
   `HoodieBaseLanceWriter`: Added `totalFlushedDataSize` field, and 
`getDataSize()` accessor. In `flushBatch()`, after `arrowWriter.finishBatch()` 
sets the row count, the method now iterates over `root.getFieldVectors()` and 
accumulates `vector.getBufferSize()` into `totalFlushedDataSize` before writing 
to Lance. This provides an uncompressed Arrow buffer size estimate analogous to 
`ParquetWriter.getDataSize()`.
   
   `HoodieSparkLanceWriter`: 
   - Added `MIN_RECORDS_FOR_SIZE_CHECK` = 100 and `MAX_RECORDS_FOR_SIZE_CHECK` 
= 10000 constants (mirrors the Parquet constants).  
   - Added `maxFileSize` and `recordCountForNextSizeCheck` fields.  
   - Updated the main constructor to accept long `maxFileSize`; the no-arg 
secondary constructor now delegates with `Long.MAX_VALUE` (no limit); a new 
secondary constructor accepting explicit `maxFileSize` is added for use by 
`HoodieInternalRowFileWriterFactory`.  
   - `canWrite()` implementation: checks periodically based on 
`recordCountForNextSizeCheck`, computes average record size from 
`getDataSize()/writtenCount`, returns false when within two average records of 
`maxFileSize`, and adaptively schedules the next check.  
   
   `HoodieSparkFileWriterFactory`: Reads `LANCE_MAX_FILE_SIZE` from config and 
passes it to the `HoodieSparkLanceWriter` constructor.  
   
   `HoodieInternalRowFileWriterFactory`: method `getInternalRowFileWriter` 
reads `LANCE_MAX_FILE_SIZE` and passes it (through 
`newLanceInternalRowFileWriter`) to the new `HoodieSparkLanceWriter` 
constructor.
   
   ### Impact
   
   track a proper implementation that checks to see if the file has reached 
some threshold in size and if so should roll over the write to a new file
   
   ### Risk Level
   
   none
   
   ### Documentation Update
   
   Need to add `LANCE_MAX_FILE_SIZE` config property 
(`hoodie.lance.max.file.size`, default 120 MB) 
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to