voonhous opened a new issue, #19001:
URL: https://github.com/apache/hudi/issues/19001

   ### Describe the problem
   
   In `HoodieAppendHandle.close()`, for every produced `WriteStatus` the handle 
calls `storage.getPathInfo(<logFilePath>).getLength()` to record the final log 
file size. On object stores (S3/GCS) each call is a remote HEAD request, so a 
delta commit touching K file groups issues K extra round trips per append 
handle -- purely to read a size the handle already knows.
   
   The size is fully determined by the writes the handle just performed: 
`HoodieLogFormatWriter.appendBlocks` returns an `AppendResult` with the start 
offset and the total bytes appended (covering every on-disk byte: magic, 
header, content, footers, reverse-pointer long), and these already populate the 
delta write stat's `logOffset` and `fileSizeInBytes`. Appends within a handle 
are contiguous, so a log file's length equals `logOffset + fileSizeInBytes`.
   
   ### Proposed fix
   
   In `close()`, set each log file's final size to `stat.getLogOffset() + 
stat.getFileSizeInBytes()` instead of issuing a `getPathInfo`/HEAD per file. 
The value is byte-identical to `getFileStatus().getLength()` (any pre-block 
bytes are absorbed into `logOffset`, and `closeStream()` appends no trailer), 
and it removes one remote round trip per log file per file group on the MOR 
write path. Same class of change as the merged clean-path getPathInfo removal 
(#18963).
   
   Will raise a PR for this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to