garyli1019 commented on a change in pull request #1602:
URL: https://github.com/apache/hudi/pull/1602#discussion_r436089449
##########
File path:
hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
##########
@@ -301,7 +301,7 @@ protected static long averageBytesPerRecord(HoodieTimeline
commitTimeline, int d
.fromBytes(commitTimeline.getInstantDetails(instant).get(),
HoodieCommitMetadata.class);
long totalBytesWritten = commitMetadata.fetchTotalBytesWritten();
long totalRecordsWritten = commitMetadata.fetchTotalRecordsWritten();
- if (totalBytesWritten > 0 && totalRecordsWritten > 0) {
+ if (totalBytesWritten > hoodieWriteConfig.getParquetSmallFileLimit()
&& totalRecordsWritten > 0) {
Review comment:
Got your point. Making a small commit is already an edge case for me so
I didn't think of continuously making small commits. But agree we can use a new
config to have the flexibility.
We have discussed subtracting the bloom filter size before, but prefer to
not doing it, IIUC. Major reasons:
> * Even we deduct the size of the bloom filter, there will be other
metadata and the `totalWriteBytes` is still not representing the total record
size. When the situation we discussed above happens, it is possible that the
small files will still be produced.
> * This will increase the complexity when we handle other indexing like
HbaseIndexing.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]