[GitHub] [hudi] vinothchandar commented on a change in pull request #1602: [HUDI-494] fix incorrect record size estimation

GitBox Thu, 04 Jun 2020 17:55:12 -0700


vinothchandar commented on a change in pull request #1602:
URL: https://github.com/apache/hudi/pull/1602#discussion_r435631043




##########
File path: 
hudi-client/src/test/java/org/apache/hudi/client/TestHoodieClientOnCopyOnWriteStorage.java
##########
@@ -1041,10 +1041,12 @@ private HoodieWriteConfig getSmallInsertWriteConfig(int 
insertSplitSize, boolean
     HoodieWriteConfig.Builder builder = getConfigBuilder(useNullSchema ? 
NULL_SCHEMA : TRIP_EXAMPLE_SCHEMA);
     return builder
         .withCompactionConfig(
-            
HoodieCompactionConfig.newBuilder().compactionSmallFileSize(HoodieTestDataGenerator.SIZE_PER_RECORD
 * 15)
-                .insertSplitSize(insertSplitSize).build()) // tolerate upto 15 
records
+            HoodieCompactionConfig.newBuilder()
+                
.compactionSmallFileSize(HoodieTestDataGenerator.BYTES_PER_RECORD * 150 + 
HoodieTestDataGenerator.BLOOM_FILTER_BYTES)

Review comment:
       Can we keep a single variable in HoodieTestDataGenerator that sums this 
up? This code need not be aware of how this is computed really

##########
File path: 
hudi-client/src/test/java/org/apache/hudi/testutils/HoodieTestDataGenerator.java
##########
@@ -70,7 +70,9 @@
 public class HoodieTestDataGenerator {
 
   // based on examination of sample file, the schema produces the following 
per record size
-  public static final int SIZE_PER_RECORD = 50 * 1024;
+  public static final int BYTES_PER_RECORD = (int) (1.2 * 1024);

Review comment:
       Why did the size shrink that much? Was this wrong?

##########
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
##########
@@ -301,7 +301,7 @@ protected static long averageBytesPerRecord(HoodieTimeline 
commitTimeline, int d
               .fromBytes(commitTimeline.getInstantDetails(instant).get(), 
HoodieCommitMetadata.class);
           long totalBytesWritten = commitMetadata.fetchTotalBytesWritten();
           long totalRecordsWritten = commitMetadata.fetchTotalRecordsWritten();
-          if (totalBytesWritten > 0 && totalRecordsWritten > 0) {
+          if (totalBytesWritten > hoodieWriteConfig.getParquetSmallFileLimit() 
&& totalRecordsWritten > 0) {

Review comment:
       So the main change here is that the code will keep looking for a commit 
that wrote at least smallFileLimit bytes do we get as better value? Wonder if 
this will have any side effects. For eg if there are a bunch of small commutes 
like you are referring to, then it means we may fallback to the de fault vale 
lot more? 
   
   What do think about introducing another config here that controls this 
threshold as a fraction of parquetSmallFileLimit? I feel we can destroy it to 
0.5, as opposed to 1.0 that you have now.. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on a change in pull request #1602: [HUDI-494] fix incorrect record size estimation

Reply via email to