KiteSoar commented on code in PR #17831:
URL: https://github.com/apache/hudi/pull/17831#discussion_r2688765041
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceWriter.java:
##########
@@ -151,13 +154,20 @@ protected void populateVectorSchemaRoot(List<InternalRow>
records) {
/**
* Check if writer can accept more records based on file size.
- * Uses filesystem-based size checking (similar to ORC/HFile approach).
+ * Checks the actual file size on storage and compares against the
configured threshold.
*
* @return true if writer can accept more records, false if file size limit
reached
*/
public boolean canWrite() {
- //TODO https://github.com/apache/hudi/issues/17684
- return true;
+ try {
+ if (!storage.exists(path)) {
Review Comment:
I looked into this and found that Hudi currently has two patterns for file
size tracking:
1. **FileSystem-level** (ORC/HFile): Uses
`HoodieWrapperFileSystem.getBytesWritten()`
2. **Writer-level** (Parquet): Uses the writer's internal `getDataSize()` API
For Lance, option 1 won't work because it uses JNI to call native Rust code,
completely bypassing the Java FileSystem layer. Option 2 won't work either
since the Lance library doesn't expose any size tracking APIs.
The most viable approach seems to be **record-based estimation** (similar to
what Parquet does)
Do you see any better alternatives?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]