Re: [PR] feat(lance): Implement canWrite() in HoodieSparkLanceWriter with configurable max file size for Lance [hudi]

via GitHub Thu, 26 Mar 2026 20:44:01 -0700


wombatu-kun commented on code in PR #18341:
URL: https://github.com/apache/hudi/pull/18341#discussion_r2998810674



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/io/storage/TestHoodieSparkLanceWriter.java:
##########
@@ -414,6 +414,92 @@ public void testWriteStructType() throws Exception {
     }
   }
 
+  /**
+   * canWrite() must return true before any records are written (written count 
is 0,
+   * below the initial check threshold of MIN_RECORDS_FOR_SIZE_CHECK=100).
+   */
+  @Test
+  public void testCanWriteTrueBeforeAnyWrite() throws Exception {
+    StructType schema = createSchemaWithoutMetaFields();
+    StoragePath path = new StoragePath(tempDir.getAbsolutePath() + 
"/test_canwrite_initial.lance");
+    try (HoodieSparkLanceWriter writer = new HoodieSparkLanceWriter(
+        path, schema, instantTime, taskContextSupplier, storage, false, 
Long.MAX_VALUE)) {
+      assertTrue(writer.canWrite(), "canWrite() must return true before any 
records are written");
+    }
+  }
+
+  /**
+   * canWrite() must always return true when maxFileSize is Long.MAX_VALUE,
+   * regardless of how many records are written and how many batch flushes 
occur.
+   */
+  @Test
+  public void testCanWriteAlwaysTrueWithNoLimit() throws Exception {
+    StructType schema = createSchemaWithoutMetaFields();
+    StoragePath path = new StoragePath(tempDir.getAbsolutePath() + 
"/test_canwrite_no_limit.lance");
+    // Write more than DEFAULT_BATCH_SIZE (1000) to trigger at least one batch 
flush
+    int recordCount = 1500;
+    try (HoodieSparkLanceWriter writer = new HoodieSparkLanceWriter(
+        path, schema, instantTime, taskContextSupplier, storage, false, 
Long.MAX_VALUE)) {
+      for (int i = 0; i < recordCount; i++) {
+        assertTrue(writer.canWrite(), "canWrite() should always be true when 
maxFileSize is unlimited");
+        writer.writeRow("key" + i, createRow(i, "User" + i, 20L + i));
+      }
+    }
+  }
+
+  /**
+   * canWrite() must return false once the cumulative Arrow buffer size of 
flushed batches
+   * exceeds maxFileSize. A tiny maxFileSize (100 bytes) is used so that a 
single flushed
+   * batch of 1000 records (≫ 100 bytes) is guaranteed to exceed the limit.
+   */
+  @Test
+  public void testCanWriteReturnsFalseAfterFileSizeLimitExceeded() throws 
Exception {
+    StructType schema = createSchemaWithoutMetaFields();
+    StoragePath path = new StoragePath(tempDir.getAbsolutePath() + 
"/test_canwrite_exceeded.lance");
+    // 100 bytes is far smaller than a single flushed batch of 1000 records
+    long tinyMaxFileSize = 100L;
+
+    try (HoodieSparkLanceWriter writer = new HoodieSparkLanceWriter(
+        path, schema, instantTime, taskContextSupplier, storage, false, 
tinyMaxFileSize)) {
+
+      assertTrue(writer.canWrite(), "canWrite() must be true before writing 
any records");
+
+      // Write exactly DEFAULT_BATCH_SIZE records to force the first batch 
flush
+      for (int i = 0; i < 1000; i++) {
+        writer.writeRow("key" + i, createRow(i, "User" + i, 20L + i));
+      }
+
+      // After the flush the accumulated data size >> tinyMaxFileSize, so 
canWrite() must be false
+      assertFalse(writer.canWrite(),
+          "canWrite() must return false after flushed data size exceeds 
maxFileSize");
+    }
+  }
+
+  /**
+   * A write loop that respects canWrite() must stop well before the 
artificial upper bound
+   * when maxFileSize is tiny, demonstrating that canWrite() correctly gates 
record writes.
+   */
+  @Test
+  public void testCanWriteStopsWriteLoop() throws Exception {
+    StructType schema = createSchemaWithoutMetaFields();

Review Comment:
   parametrized one test with populateMetaFields boolean



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceWriter.java:
##########
@@ -109,8 +119,56 @@ public HoodieSparkLanceWriter(StoragePath file,
   public HoodieSparkLanceWriter(StoragePath file,
                                 StructType sparkSchema,
                                 TaskContextSupplier taskContextSupplier,
-                                HoodieStorage storage) {
-    this(file, sparkSchema, null, taskContextSupplier, storage, false, 
Option.empty());
+                                HoodieStorage storage,
+                                long maxFileSize) {
+    this(file, sparkSchema, null, taskContextSupplier, storage, false, 
Option.empty(), maxFileSize);
+  }
+
+  /**
+   * Constructor for Spark Lance writer used for internal row writing with 
pre-embedded metadata
+   * and a configurable file size limit.
+   *
+   * @param file Path where Lance file will be written
+   * @param sparkSchema Spark schema for the data
+   * @param instantTime Instant time for the commit
+   * @param taskContextSupplier Task context supplier for partition ID
+   * @param storage HoodieStorage instance
+   * @param populateMetaFields Whether to populate Hudi metadata fields
+   * @param bloomFilterOpt Optional bloom filter for record key tracking
+   * @throws IOException if writer initialization fails
+   */
+  public HoodieSparkLanceWriter(StoragePath file,

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(lance): Implement canWrite() in HoodieSparkLanceWriter with configurable max file size for Lance [hudi]

Reply via email to