umehrot2 commented on issue #2408:
URL: https://github.com/apache/hudi/issues/2408#issuecomment-758320870


   I took a deeper look at this. For you this seems to be happening in the 
archival code path:
   ```
    at 
org.apache.hudi.table.HoodieTimelineArchiveLog.writeToFile(HoodieTimelineArchiveLog.java:309)
    at 
org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:282)
    at 
org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:133)
    at 
org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:381)
   ```
   
   In `HoodieTimelineArchiveLog` where it needs to write log files with commit 
record, similar to how log files are written for MOR tables. However, in this 
code I notice a couple of issues:
   - The default maximum log block size of 256 MB defined 
[here](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L51),
 is not utilized for this class and is only used for the MOR log blocks writing 
case. As a result, there is no real control over the block size that it can end 
up writing which can potentially overflow `ByteArrayOutputStream` whose maximum 
size is `Integer.MAX_VALE - 8`. That is what seems to be happening in this 
scenario here because of an integer overflow following that code path inside 
`ByteArrayOutputStream`. So we need to use the maximum block size concept here 
as well.
   - In addition I see a bug in code 
[here](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L302)
 where even after flushing out the records into a file after a batch size of 10 
(default) it is not clearing the list and just goes on accumulating the 
records. This seems logically wrong as well (duplication), apart from the fact 
that it would keep increasing the log file blocks size it is writing. 
   
   I will open a jira for this issue to track this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to