Re: [PR] feat(append): dynamically cap log-block buffer at per-task memory [hudi]

via GitHub Wed, 27 May 2026 12:58:15 -0700


hudi-agent commented on code in PR #18860:
URL: https://github.com/apache/hudi/pull/18860#discussion_r3313492630



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/IOUtils.java:
##########
@@ -98,6 +98,45 @@ public static long 
getMaxMemoryPerCompaction(TaskContextSupplier context, Map<St
     return getMaxMemoryAllowedForMerge(context, fraction);
   }
 
+  /**
+   * Returns {@code Option.of(per-task memory ceiling in bytes)} when the 
engine exposes
+   * {@link EngineProperty#TOTAL_MEMORY_AVAILABLE}, {@link 
EngineProperty#MEMORY_FRACTION_IN_USE},
+   * and {@link EngineProperty#TOTAL_CORES_PER_EXECUTOR} (Spark). Returns 
{@code Option.empty()}
+   * otherwise (engines that do not expose memory/cores) so callers can fall 
back to a static
+   * config-driven cap rather than the spillable-map 1GB default.
+   *
+   * <p>Formula mirrors {@link #getMaxMemoryAllowedForMerge}: {@code 
userAvailableMemory =
+   * totalExecutorMemory * (1 - memoryFraction) / executorTaskNum}, then 
multiplied by the
+   * supplied fraction. Honors {@link EngineProperty#SINGLE_TASK_CORES} when 
present (divides
+   * cores per task into total cores to get task slots).
+   *
+   * <p>Distinct from {@link #getMaxMemoryAllowedForMerge} in two ways: 
returns {@code Option}
+   * rather than collapsing the absent case to a default, and accepts the 
floor as a parameter
+   * so callers (e.g., {@code HoodieAppendHandle}) can use a smaller floor 
than the spillable-map
+   * 100MB.
+   */
+  public static Option<Long> getMaxMemoryAllowedForLogAppend(

Review Comment:
   🤖 nit: `minFloor` is redundant since a floor is already a minimum — could 
you rename to just `floor` or `minBytes`? Same applies to the docs above.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##########
@@ -615,12 +646,22 @@ protected boolean needsUpdateLocation() {
     return true;
   }
 
-  private void writeToBuffer(HoodieRecord<T> record) {
+  /**
+   * Buffers {@code record} for the next log block. Returns the record 
actually appended to
+   * {@link #recordList} — the post-{@code prepareRecord} clone of {@code 
populatedRecord}
+   * carrying a fully-materialized Avro {@code IndexedRecord} with prepended 
meta-fields —
+   * or {@code null} when nothing was appended to {@code recordList} (delete, 
ignored, or
+   * partition-mismatch failure). The returned reference is what {@link 
#flushToDiskIfRequired}
+   * sizes; sizing the incoming record under-counts heap because the incoming 
payload is
+   * typically still in its compact/deflated wire form, whereas the buffered 
record holds
+   * the fully-deserialized Avro graph plus meta-fields.

Review Comment:
   🤖 nit: this javadoc (and the similar ones on 
`bufferRecord`/`bufferInsertAndUpdate`/`flushToDiskIfRequired`) is quite long 
and leans on internal implementation details (`prepareRecord` clones, deflated 
payloads, etc.). Could you trim to a 1-2 line contract + a brief "why" note? 
Future refactors will silently drift from these descriptions.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##########
@@ -691,6 +752,42 @@ protected void flushToDiskIfRequired(HoodieRecord record, 
boolean appendDeleteBl
     }
   }
 
+  @VisibleForTesting
+  void setSizeEstimator(SizeEstimator<HoodieRecord> sizeEstimator) {
+    this.sizeEstimator = sizeEstimator;
+  }
+
+  @VisibleForTesting
+  long getAverageRecordSize() {
+    return averageRecordSize;
+  }
+
+  @VisibleForTesting
+  long getNumberOfRecords() {
+    return numberOfRecords;
+  }
+
+  @VisibleForTesting
+  long getEstimatedNumberOfBytesWritten() {
+    return estimatedNumberOfBytesWritten;
+  }
+
+  @VisibleForTesting
+  void simulateBufferedRecord(HoodieRecord bufferedRecord) {

Review Comment:
   🤖 nit: `simulateBufferedRecord` is purely a test fixture (increments 
`numberOfRecords` then calls the gate). Could you keep this logic inside the 
test's `TestableAppendHandle` subclass rather than the production class? The 
other `@VisibleForTesting` accessors here are simple getters/setters, but this 
one bakes test orchestration into the production API.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(append): dynamically cap log-block buffer at per-task memory [hudi]

Reply via email to