Re: [PR] feat: Implement canWrite() with configurable max file size for Lance [hudi]

via GitHub Tue, 13 Jan 2026 19:04:14 -0800


KiteSoar commented on code in PR #17831:
URL: https://github.com/apache/hudi/pull/17831#discussion_r2688765041



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceWriter.java:
##########
@@ -151,13 +154,20 @@ protected void populateVectorSchemaRoot(List<InternalRow> 
records) {
 
   /**
    * Check if writer can accept more records based on file size.
-   * Uses filesystem-based size checking (similar to ORC/HFile approach).
+   * Checks the actual file size on storage and compares against the 
configured threshold.
    *
    * @return true if writer can accept more records, false if file size limit 
reached
    */
   public boolean canWrite() {
-    //TODO https://github.com/apache/hudi/issues/17684
-    return true;
+    try {
+      if (!storage.exists(path)) {

Review Comment:
   I looked into this and found that Hudi currently has two patterns for file 
size tracking:
   
   1. **FileSystem-level** (ORC/HFile): Uses 
`HoodieWrapperFileSystem.getBytesWritten()`
   2. **Writer-level** (Parquet): Uses the writer's internal `getDataSize()` API
   
   For Lance, option 1 won't work because it uses JNI to call native Rust code, 
completely bypassing the Java FileSystem layer. Option 2 won't work either 
since the Lance library doesn't expose any size tracking APIs.
   
   The most viable approach seems to be **record-based estimation** (similar to 
what Parquet does)
   Do you see any better alternatives?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Implement canWrite() with configurable max file size for Lance [hudi]

Reply via email to