rahil-c commented on code in PR #18341:
URL: https://github.com/apache/hudi/pull/18341#discussion_r2995377049
##########
hudi-hadoop-common/src/main/java/org/apache/hudi/io/lance/HoodieBaseLanceWriter.java:
##########
@@ -214,6 +216,15 @@ public void close() throws IOException {
}
}
+ /**
+ * Returns the total number of bytes accumulated across all flushed Arrow
batches.
+ * Computed as the sum of each field vector's buffer size at flush time,
providing
+ * an uncompressed estimate analogous to {@code ParquetWriter.getDataSize()}.
+ */
+ protected long getDataSize() {
+ return totalFlushedDataSize;
Review Comment:
@wombatu-kun I am wondering in general if `getDataSize()1, should actually
be tracking the bytes accumulated in memory as opposed to flushed?
This is what I see for Parquet does
<img width="981" height="374" alt="Image"
src="https://github.com/user-attachments/assets/e3f4b210-1e04-4ae7-b6df-3002ba9b4ffc"
/> where it reports all buffered data.
The reason i bring this up is with your current impl, if the current batch
(in-progress, is not yet flushed) its invisible to the `getDataSize()`. With a
DEFAULT_BATCH_SIZE of 1000, there could be up to 999 records of unreported
data. For small `maxFileSize` values or large records, this could cause
overshoot.
Let me know what you think though?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]