[
https://issues.apache.org/jira/browse/HUDI-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan reassigned HUDI-1604:
-----------------------------------------
Assignee: sivabalan narayanan
> Fix archival max log size and potentially a bug in archival
> -----------------------------------------------------------
>
> Key: HUDI-1604
> URL: https://issues.apache.org/jira/browse/HUDI-1604
> Project: Apache Hudi
> Issue Type: Bug
> Components: Cleaner
> Affects Versions: 0.7.0
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Major
> Labels: sev:triage, user-support-issues
>
> Gist of the issue from Udit
>
> I took a deeper look at this. For you this seems to be happening in the
> archival code path:
>
> {{ at
> org.apache.hudi.table.HoodieTimelineArchiveLog.writeToFile(HoodieTimelineArchiveLog.java:309)
> at
> org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:282)
> at
> org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:133)
> at
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:381)}}
> In {{HoodieTimelineArchiveLog}} where it needs to write log files with commit
> record, similar to how log files are written for MOR tables. However, in this
> code I notice a couple of issues:
> * The default maximum log block size of 256 MB defined
> [here|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L51],
> is not utilized for this class and is only used for the MOR log blocks
> writing case. As a result, there is no real control over the block size that
> it can end up writing which can potentially overflow
> {{ByteArrayOutputStream}} whose maximum size is {{Integer.MAX_VALE - 8}}.
> That is what seems to be happening in this scenario here because of an
> integer overflow following that code path inside {{ByteArrayOutputStream}}.
> So we need to use the maximum block size concept here as well.
> * In addition I see a bug in code
> [here|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L302]
> where even after flushing out the records into a file after a batch size of
> 10 (default) it is not clearing the list and just goes on accumulating the
> records. This seems logically wrong as well (duplication), apart from the
> fact that it would keep increasing the log file blocks size it is writing.
> Reference: https://github.com/apache/hudi/issues/2408#issuecomment-758320870
--
This message was sent by Atlassian Jira
(v8.3.4#803005)