[ 
https://issues.apache.org/jira/browse/HUDI-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1604:
--------------------------------------
    Labels: sev:triage user-support-issues  (was: user-support-issues)

> Fix archival max log size and potentially a bug in archival
> -----------------------------------------------------------
>
>                 Key: HUDI-1604
>                 URL: https://issues.apache.org/jira/browse/HUDI-1604
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Cleaner
>    Affects Versions: 0.7.0
>            Reporter: sivabalan narayanan
>            Priority: Major
>              Labels: sev:triage, user-support-issues
>
> Gist of the issue from Udit
>  
> I took a deeper look at this. For you this seems to be happening in the 
> archival code path:
>  
> {{ at 
> org.apache.hudi.table.HoodieTimelineArchiveLog.writeToFile(HoodieTimelineArchiveLog.java:309)
>  at 
> org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:282)
>  at 
> org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:133)
>  at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:381)}}
> In {{HoodieTimelineArchiveLog}} where it needs to write log files with commit 
> record, similar to how log files are written for MOR tables. However, in this 
> code I notice a couple of issues:
>  * The default maximum log block size of 256 MB defined 
> [here|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L51],
>  is not utilized for this class and is only used for the MOR log blocks 
> writing case. As a result, there is no real control over the block size that 
> it can end up writing which can potentially overflow 
> {{ByteArrayOutputStream}} whose maximum size is {{Integer.MAX_VALE - 8}}. 
> That is what seems to be happening in this scenario here because of an 
> integer overflow following that code path inside {{ByteArrayOutputStream}}. 
> So we need to use the maximum block size concept here as well.
>  * In addition I see a bug in code 
> [here|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L302]
>  where even after flushing out the records into a file after a batch size of 
> 10 (default) it is not clearing the list and just goes on accumulating the 
> records. This seems logically wrong as well (duplication), apart from the 
> fact that it would keep increasing the log file blocks size it is writing.
> Reference: https://github.com/apache/hudi/issues/2408#issuecomment-758320870



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to