alexeykudinkin commented on code in PR #7362:
URL: https://github.com/apache/hudi/pull/7362#discussion_r1042915969


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -179,6 +179,20 @@ public class HoodieCompactionConfig extends HoodieConfig {
           + "record size estimate compute dynamically based on commit 
metadata. "
           + " This is critical in computing the insert parallelism and 
bin-packing inserts into small files.");
 
+  public static final ConfigProperty<String> 
COPY_ON_WRITE_RECORD_DYNAMIC_SAMPLE_MAXNUM = ConfigProperty

Review Comment:
   Let's also move these configs under HoodieWriteConfig



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##########
@@ -419,4 +427,23 @@ public Partitioner getLayoutPartitioner(WorkloadProfile 
profile, String layoutPa
   protected void 
runPrecommitValidators(HoodieWriteMetadata<HoodieData<WriteStatus>> 
writeMetadata) {
     SparkValidatorUtils.runValidators(config, writeMetadata, context, table, 
instantTime);
   }
+
+  private int dynamicSampleRecordSize(JavaRDD<HoodieRecord<T>> inputRecords) {
+    int dynamicSampleRecordSize = config.getCopyOnWriteRecordSizeEstimate();
+    long inputRecordsCount = inputRecords.count();

Review Comment:
   We can't invoke count here -- this will execute the whole RDD 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -179,6 +179,20 @@ public class HoodieCompactionConfig extends HoodieConfig {
           + "record size estimate compute dynamically based on commit 
metadata. "
           + " This is critical in computing the insert parallelism and 
bin-packing inserts into small files.");
 
+  public static final ConfigProperty<String> 
COPY_ON_WRITE_RECORD_DYNAMIC_SAMPLE_MAXNUM = ConfigProperty

Review Comment:
   I see that you're following a pattern there, but i don't think we should 
prefix COPY_ON_WRITE since there's no reason for us to limit usage of this 
config to just COW (we can re-use it for similar purposes for MOR as well)



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -179,6 +179,20 @@ public class HoodieCompactionConfig extends HoodieConfig {
           + "record size estimate compute dynamically based on commit 
metadata. "
           + " This is critical in computing the insert parallelism and 
bin-packing inserts into small files.");
 
+  public static final ConfigProperty<String> 
COPY_ON_WRITE_RECORD_DYNAMIC_SAMPLE_MAXNUM = ConfigProperty
+          .key("hoodie.copyonwrite.record.dynamic.sample.maxnum")
+          .defaultValue(String.valueOf(100))
+          .withDocumentation("Although dynamic sampling is adopted, if the 
record size assumed by the user is unreasonable during the first write 
execution, "

Review Comment:
   I think we'd need to revisit this description to make it more clear:
   ```
   Controls absolute value of the maximum number of records that will have to 
be sample to determine average record size
   ```
   
   Let's also make sure it's clear that Record size config, this one and 
Sampling ratio ones are mutually exclusive



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -179,6 +179,20 @@ public class HoodieCompactionConfig extends HoodieConfig {
           + "record size estimate compute dynamically based on commit 
metadata. "
           + " This is critical in computing the insert parallelism and 
bin-packing inserts into small files.");
 
+  public static final ConfigProperty<String> 
COPY_ON_WRITE_RECORD_DYNAMIC_SAMPLE_MAXNUM = ConfigProperty
+          .key("hoodie.copyonwrite.record.dynamic.sample.maxnum")

Review Comment:
   Let's make sure names of the ones related to record size estimation do share 
the prefix: 
   `hoodie.record.size.dynamic.sampling.max`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to