[GitHub] [hudi] garyli1019 commented on a change in pull request #1602: [HUDI-494] fix incorrect record size estimation

GitBox Fri, 05 Jun 2020 11:19:12 -0700


garyli1019 commented on a change in pull request #1602:
URL: https://github.com/apache/hudi/pull/1602#discussion_r436089449




##########
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
##########
@@ -301,7 +301,7 @@ protected static long averageBytesPerRecord(HoodieTimeline 
commitTimeline, int d
               .fromBytes(commitTimeline.getInstantDetails(instant).get(), 
HoodieCommitMetadata.class);
           long totalBytesWritten = commitMetadata.fetchTotalBytesWritten();
           long totalRecordsWritten = commitMetadata.fetchTotalRecordsWritten();
-          if (totalBytesWritten > 0 && totalRecordsWritten > 0) {
+          if (totalBytesWritten > hoodieWriteConfig.getParquetSmallFileLimit() 
&& totalRecordsWritten > 0) {

Review comment:
       Got your point. Making a small commit is already an edge case for me so 
I didn't think of continuously making small commits. But agree we can use a new 
config to have the flexibility.
   
   We have discussed subtracting the bloom filter size before, but prefer to 
not doing it, IIUC. Major reasons:
   > * Even we deduct the size of the bloom filter, there will be other 
metadata and the `totalWriteBytes` is still not representing the total record 
size. When the situation we discussed above happens, it is possible that the 
small files will still be produced.
   > * This will increase the complexity when we handle other indexing like 
HbaseIndexing.
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] garyli1019 commented on a change in pull request #1602: [HUDI-494] fix incorrect record size estimation

Reply via email to