vinothchandar commented on a change in pull request #1602:
URL: https://github.com/apache/hudi/pull/1602#discussion_r435631043
##########
File path:
hudi-client/src/test/java/org/apache/hudi/client/TestHoodieClientOnCopyOnWriteStorage.java
##########
@@ -1041,10 +1041,12 @@ private HoodieWriteConfig getSmallInsertWriteConfig(int
insertSplitSize, boolean
HoodieWriteConfig.Builder builder = getConfigBuilder(useNullSchema ?
NULL_SCHEMA : TRIP_EXAMPLE_SCHEMA);
return builder
.withCompactionConfig(
-
HoodieCompactionConfig.newBuilder().compactionSmallFileSize(HoodieTestDataGenerator.SIZE_PER_RECORD
* 15)
- .insertSplitSize(insertSplitSize).build()) // tolerate upto 15
records
+ HoodieCompactionConfig.newBuilder()
+
.compactionSmallFileSize(HoodieTestDataGenerator.BYTES_PER_RECORD * 150 +
HoodieTestDataGenerator.BLOOM_FILTER_BYTES)
Review comment:
Can we keep a single variable in HoodieTestDataGenerator that sums this
up? This code need not be aware of how this is computed really
##########
File path:
hudi-client/src/test/java/org/apache/hudi/testutils/HoodieTestDataGenerator.java
##########
@@ -70,7 +70,9 @@
public class HoodieTestDataGenerator {
// based on examination of sample file, the schema produces the following
per record size
- public static final int SIZE_PER_RECORD = 50 * 1024;
+ public static final int BYTES_PER_RECORD = (int) (1.2 * 1024);
Review comment:
Why did the size shrink that much? Was this wrong?
##########
File path:
hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
##########
@@ -301,7 +301,7 @@ protected static long averageBytesPerRecord(HoodieTimeline
commitTimeline, int d
.fromBytes(commitTimeline.getInstantDetails(instant).get(),
HoodieCommitMetadata.class);
long totalBytesWritten = commitMetadata.fetchTotalBytesWritten();
long totalRecordsWritten = commitMetadata.fetchTotalRecordsWritten();
- if (totalBytesWritten > 0 && totalRecordsWritten > 0) {
+ if (totalBytesWritten > hoodieWriteConfig.getParquetSmallFileLimit()
&& totalRecordsWritten > 0) {
Review comment:
So the main change here is that the code will keep looking for a commit
that wrote at least smallFileLimit bytes do we get as better value? Wonder if
this will have any side effects. For eg if there are a bunch of small commutes
like you are referring to, then it means we may fallback to the de fault vale
lot more?
What do think about introducing another config here that controls this
threshold as a fraction of parquetSmallFileLimit? I feel we can destroy it to
0.5, as opposed to 1.0 that you have now..
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]