honeyaya opened a new pull request, #7255:
URL: https://github.com/apache/hudi/pull/7255
Using the default value of estimate record size at the
averageBytesPerRecord() when estimation threshold is less than 0
### Change Logs
Currently, hudi obtains the average record size based on records written
during previous commits. Used for estimating how many records pack into one
file, and the code is about UpsertPartitioner.averageBytesPerRecord().
But we found that the single data file could become 600~700M and most other
files are less than 200M.
- Reason
1. the result of totalBytesWritten/totalRecordsWritten is very small when
the last commit, but the next commit record is very large, then the data files
will become very large.
- Solve plan
1. Plan1: calculate avgSize of the past several commit not just only one,
but the getCommitMetadata costs a lot of time, then this function might be
slow, so we did not choose this.
1. Plan2: Use the estimated record size considering our data size is fixed
in some sense, more the hudi community did not encourage adding a more boolean
variable to control whether to use the last commit avgSize, then we use the
estimation threshold, when it is less than 0, we use the default estimate
record size.
### Impact
UpsertPartitioner.averageBytesPerRecord(), small
### Risk level (write none, low medium or high below)
low, this feature works only when estimation threshold is less than 0
### Documentation Update
_Describe any necessary documentation update if there is any new feature,
config, or user-facing change_
- _The config description must be updated_
> We use the previous commits' metadata to calculate the estimated record
size and use it "
+ " to bin pack records into partitions. If the previous commit is
too small to make an accurate estimation, "
+ " Hudi will search commits in the reverse order, until we find a
commit that has totalBytesWritten "
+ " larger than (PARQUET_SMALL_FILE_LIMIT_BYTES * this_threshold).
Will use hoodie.copyonwrite.record.size.estimate value when this value is less
than 0.");
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]