hudi-bot opened a new issue, #14750:
URL: https://github.com/apache/hudi/issues/14750

   Gist of the issue from Udit
   
    
   
   I took a deeper look at this. For you this seems to be happening in the 
archival code path:
   
    
   
   {{ at 
org.apache.hudi.table.HoodieTimelineArchiveLog.writeToFile(HoodieTimelineArchiveLog.java:309)
    at 
org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:282)
    at 
org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:133)
    at 
org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:381)}}
   
   In {{HoodieTimelineArchiveLog}} where it needs to write log files with 
commit record, similar to how log files are written for MOR tables. However, in 
this code I notice a couple of issues:
    * The default maximum log block size of 256 MB defined 
[here|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L51],
 is not utilized for this class and is only used for the MOR log blocks writing 
case. As a result, there is no real control over the block size that it can end 
up writing which can potentially overflow {{ByteArrayOutputStream}} whose 
maximum size is {{Integer.MAX_VALE - 8}}. That is what seems to be happening in 
this scenario here because of an integer overflow following that code path 
inside {{ByteArrayOutputStream}}. So we need to use the maximum block size 
concept here as well.
    * In addition I see a bug in code 
[here|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L302]
 where even after flushing out the records into a file after a batch size of 10 
(default) it is not clearing the list and just goes on accumulating the 
records. This seems logically wrong as well (duplication), apart from the fact 
that it would keep increasing the log file blocks size it is writing.
   
   Reference: https://github.com/apache/hudi/issues/2408#issuecomment-758320870
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-1604
   - Type: Bug
   - Affects version(s):
     - 0.7.0
   
   
   ---
   
   
   ## Comments
   
   09/Feb/21 11:46;shivnarayan;- don't think there is a bug wrt clearing up the 
records. Check 
https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L362.;;;
   
   ---
   
   16/Dec/21 01:43;shivnarayan;Hey [~uditme] : wrt 2nd issue, in latest master 
atleast, I see we do clear the list of records after each write. 
   {code:java}
   private void writeToFile(Schema wrapperSchema, List<IndexedRecord> records) 
throws Exception {
     if (records.size() > 0) {
       Map<HeaderMetadataType, String> header = new HashMap<>();
       header.put(HoodieLogBlock.HeaderMetadataType.SCHEMA, 
wrapperSchema.toString());
       final String keyField = 
table.getMetaClient().getTableConfig().getRecordKeyFieldProp();
       HoodieAvroDataBlock block = new HoodieAvroDataBlock(records, header, 
keyField);
       writer.appendBlock(block);
       records.clear();
     }
   } {code}
   So, assuming 2nd is not valid anymore, let's talk about 1st issue reported. 
   
   I get your problem that we don't honor the log block size. but given the 
archival batch size can be controlled via config, wondering do we really need 
to honor the log block size. Bcoz, to one log we send N records pertaining to 
archival batch size. So, unless users set the archival batch size to some 
incase number, we should not hit overlfow wrt log block size in my 
understanding. Please do correct if there is some gap in my understanding. 
   
   Reducing the priority to sev:high for now. lets  brainstorm and see whats 
the best way to take on this. 
   
    ;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to