[
https://issues.apache.org/jira/browse/HUDI-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882200#comment-17882200
]
sivabalan narayanan commented on HUDI-7975:
-------------------------------------------
If we characterize different workloads, they fall into below categories.
1. pure bulk inserts
2. bulk inserts + clustering.
3. bulk insert once and then insert.
4. bulk insert once and then upserts.
5. inserts.
6. upserts.
- we want to add some marker in the timeline that we took care of cleaning up
until X instant in the timeline.
Why not we make an intelligent guess based on last few commits.
say someone configured num_commits based cleaning and set the config to say 25.
We can check the last 50 (2x of num_commits config value).
- If all latest 50 are bulk inserts w/o any other operation types, we can
assume its purely bulk insert pipeline. and ignore clean scheduling only.
- If at all we find any diff write operation other than bulk insert, then we do
regular clean planning.
If the total active entries in the timeline is < 50, again, we trigger regular
clean scheduling.
2x is mainly to account for "bulk insert + clustering". bcox, file groups
replaced by the clustering are not immediately available to be cleaned up. We
can only clean after 25 commits (in this context). So, we consider last 2X (or
2X + 5) commits to determine if we really need to trigger clean schedule or
not.
so, above logic will pan out as below for above 6 scenarios.
1: for first 50 commits, clean planning will kick in as usual. After that, no
clean scheduling will trigger.
2: for first 50 commits, clean planning will kick in as usual. and then an
actual clean might be seen for the clustering commit. once we have a clean in
the timeline, then incremental cleaner will hold on to the boundary. Depending
on the cadence of clustering, cleans will be added to timeline.
3,4,5,6: regular cleans will happen.
> Transfer extrametada to new commits when new data is not ingeested to trigger
> table services on the dataset
> -----------------------------------------------------------------------------------------------------------
>
> Key: HUDI-7975
> URL: https://issues.apache.org/jira/browse/HUDI-7975
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Surya Prasanna Yalla
> Assignee: Surya Prasanna Yalla
> Priority: Major
> Labels: pull-request-available
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)