vinothchandar commented on a change in pull request #1602:
URL: https://github.com/apache/hudi/pull/1602#discussion_r435946857
##########
File path:
hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
##########
@@ -301,7 +301,7 @@ protected static long averageBytesPerRecord(HoodieTimeline
commitTimeline, int d
.fromBytes(commitTimeline.getInstantDetails(instant).get(),
HoodieCommitMetadata.class);
long totalBytesWritten = commitMetadata.fetchTotalBytesWritten();
long totalRecordsWritten = commitMetadata.fetchTotalRecordsWritten();
- if (totalBytesWritten > 0 && totalRecordsWritten > 0) {
+ if (totalBytesWritten > hoodieWriteConfig.getParquetSmallFileLimit()
&& totalRecordsWritten > 0) {
Review comment:
I understand the issues that your are solving. It’s real and we are on
the same page there
> If there is at least one file in the timeline larger than a small file
then it should be not using the default value
Actually condition is more like - at least one commit with total writes
greater than small file. Let’s understand how this may play out with an example.
Let’s say the table is empty and we start doing small commits with 5 records
each.
Commit 1 : no candidates for small files. inserts will get grouped into
(100k default) record buckets and written to file1_c1
Commit 2 : prev file is a small file; inserts turned to updates to file1_c2,
but since last commit wrote only 5 records, it will use default value
...
Keeps happening until file1_cn when eventually rewritten to greater than
small file.
Commit n+1: will use derived record size instead of default value. But there
are no more small files. A new file group file 2_cn+1 is created
Commit n+2: we will keep using commit n’s derived value as long as it’s on
the active timeline..
A corner case here is : if commit n is archived before we have another
commit that got file2 expand beyond small file size, e we will fall back to
default value again - after We read all the commits in the active timeline..
This is why I suggest we have a config record.size.estimation.threshold to
control what fraction of small parquet file size we can use for estimation..
Makes sense? (Alternatively we could look into somehow computing the bloom
filter overhead based on configs and subtract that from totalBytesWritten..
it’s probably better? )
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]