vinothchandar commented on a change in pull request #1602:
URL: https://github.com/apache/hudi/pull/1602#discussion_r435946857



##########
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
##########
@@ -301,7 +301,7 @@ protected static long averageBytesPerRecord(HoodieTimeline 
commitTimeline, int d
               .fromBytes(commitTimeline.getInstantDetails(instant).get(), 
HoodieCommitMetadata.class);
           long totalBytesWritten = commitMetadata.fetchTotalBytesWritten();
           long totalRecordsWritten = commitMetadata.fetchTotalRecordsWritten();
-          if (totalBytesWritten > 0 && totalRecordsWritten > 0) {
+          if (totalBytesWritten > hoodieWriteConfig.getParquetSmallFileLimit() 
&& totalRecordsWritten > 0) {

Review comment:
       I understand the issues that your are solving. It’s real and we are on 
the same page there
   
   > If there is at least one file in the timeline larger than a small file 
then it should be not using the default value
   
   Actually condition is more like - at least one commit with total writes 
greater than small file. Let’s understand how this may play out with an example.
   
   Let’s say the table is empty and we start doing small commits with 5 records 
each.
   
   Commit 1 : no candidates for small files. inserts will get grouped into  
(100k default) record buckets and written to file1_c1
   Commit 2 : prev file is a small file; inserts turned to updates to file1_c2, 
but since last commit wrote only 5 records, it will use default value
   ...
   Keeps happening until file1_cn when eventually rewritten to greater than 
small file. 
   Commit n+1: will use derived record size instead of default value. But there 
are no more small files. A new file group file 2_cn+1 is created
   Commit n+2: we will keep using commit n’s derived value as long as it’s on 
the active timeline..
   
   A corner case here is : if commit n is archived before we have another 
commit that got file2 expand beyond small file size, e we will fall back to 
default value again - after We read all the commits in the active timeline.. 
This is why I suggest we have a config record.size.estimation.threshold to 
control what fraction of small parquet file size we can use for estimation.. 
   
   Makes sense? (Alternatively we could look into somehow computing the bloom 
filter overhead based on configs and subtract that from totalBytesWritten.. 
it’s probably better? )
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to