alexeykudinkin commented on code in PR #7362:
URL: https://github.com/apache/hudi/pull/7362#discussion_r1042915969
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -179,6 +179,20 @@ public class HoodieCompactionConfig extends HoodieConfig {
+ "record size estimate compute dynamically based on commit
metadata. "
+ " This is critical in computing the insert parallelism and
bin-packing inserts into small files.");
+ public static final ConfigProperty<String>
COPY_ON_WRITE_RECORD_DYNAMIC_SAMPLE_MAXNUM = ConfigProperty
Review Comment:
Let's also move these configs under HoodieWriteConfig
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##########
@@ -419,4 +427,23 @@ public Partitioner getLayoutPartitioner(WorkloadProfile
profile, String layoutPa
protected void
runPrecommitValidators(HoodieWriteMetadata<HoodieData<WriteStatus>>
writeMetadata) {
SparkValidatorUtils.runValidators(config, writeMetadata, context, table,
instantTime);
}
+
+ private int dynamicSampleRecordSize(JavaRDD<HoodieRecord<T>> inputRecords) {
+ int dynamicSampleRecordSize = config.getCopyOnWriteRecordSizeEstimate();
+ long inputRecordsCount = inputRecords.count();
Review Comment:
We can't invoke count here -- this will execute the whole RDD
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -179,6 +179,20 @@ public class HoodieCompactionConfig extends HoodieConfig {
+ "record size estimate compute dynamically based on commit
metadata. "
+ " This is critical in computing the insert parallelism and
bin-packing inserts into small files.");
+ public static final ConfigProperty<String>
COPY_ON_WRITE_RECORD_DYNAMIC_SAMPLE_MAXNUM = ConfigProperty
Review Comment:
I see that you're following a pattern there, but i don't think we should
prefix COPY_ON_WRITE since there's no reason for us to limit usage of this
config to just COW (we can re-use it for similar purposes for MOR as well)
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -179,6 +179,20 @@ public class HoodieCompactionConfig extends HoodieConfig {
+ "record size estimate compute dynamically based on commit
metadata. "
+ " This is critical in computing the insert parallelism and
bin-packing inserts into small files.");
+ public static final ConfigProperty<String>
COPY_ON_WRITE_RECORD_DYNAMIC_SAMPLE_MAXNUM = ConfigProperty
+ .key("hoodie.copyonwrite.record.dynamic.sample.maxnum")
+ .defaultValue(String.valueOf(100))
+ .withDocumentation("Although dynamic sampling is adopted, if the
record size assumed by the user is unreasonable during the first write
execution, "
Review Comment:
I think we'd need to revisit this description to make it more clear:
```
Controls absolute value of the maximum number of records that will have to
be sample to determine average record size
```
Let's also make sure it's clear that Record size config, this one and
Sampling ratio ones are mutually exclusive
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -179,6 +179,20 @@ public class HoodieCompactionConfig extends HoodieConfig {
+ "record size estimate compute dynamically based on commit
metadata. "
+ " This is critical in computing the insert parallelism and
bin-packing inserts into small files.");
+ public static final ConfigProperty<String>
COPY_ON_WRITE_RECORD_DYNAMIC_SAMPLE_MAXNUM = ConfigProperty
+ .key("hoodie.copyonwrite.record.dynamic.sample.maxnum")
Review Comment:
Let's make sure names of the ones related to record size estimation do share
the prefix:
`hoodie.record.size.dynamic.sampling.max`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]