scxwhite opened a new pull request #4400:
URL: https://github.com/apache/hudi/pull/4400
Brief change log
- compact improve
I found that when the compact plan is generated, the delta log files under
each filegroup are arranged in the natural order of instant time. in the
majority of cases,We can think that the latest data is in the latest delta log
file, so we sort it from large to small according to the instance time, which
can largely avoid rewriting the data in the compact process, and then optimize
the compact time.
In addition, when reading the delta log file, we compare the data in the
external spillablemap with the delta log data. If oldrecord is selected, there
is no need to rewrite the data in the external spillablemap. Rewriting data
will waste a lot of resources when data is spill to disk
This pull request is already covered by existing tests, such as *(please
describe tests)*.
Committer checklist
- [*] Has a corresponding
[JIRA](https://issues.apache.org/jira/browse/HUDI-3069) in PR title & commit()
- [*] Commit message is descriptive of the change
- [ ] CI is green
- [ ] Necessary doc changes done or have another open PR
- [ ] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]