[
https://issues.apache.org/jira/browse/HUDI-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
voon reassigned HUDI-8970:
--------------------------
Assignee: voon
> Improve RunCompaction Procedure does not run for all pending compactions when
> op is scheduleAndExecute
> ------------------------------------------------------------------------------------------------------
>
> Key: HUDI-8970
> URL: https://issues.apache.org/jira/browse/HUDI-8970
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: voon
> Assignee: voon
> Priority: Major
>
> The current op modes for RunCompactionProcedure are as follows:
>
> # schedule - schedule a new plan
> # execute - if specific instants exist, execute them, otherwise execute all
> pending plans
> # scheduleandexecute - schedule a new plan and then execute it, if no plan
> is generated during schedule, execute all pending plans
>
> While the current implementation of the code holds true to above
> specification, it is not very user friendly.
>
> There is no option to schedule a new compaction plan, and execute all pending
> compaction plans. If a previous `scheduleandexecute` fails the user's
> scheduled job retries, it will generate a new compaction plan, leaving a
> pending compaction plan on the table.
>
> If there is no user intervention, some log files may grow exponentially,
> making compaction more computationally expensive.
>
> So, this ticket is proposing to change the `scheduleandexecute` op to the
> following implementation:
> # schedule a new plan and then execute {color:#de350b}*ALL*{color} pending
> plans, if no plan is generated during schedule, execute all pending plans
>
> This would make the RunCompactionProcedure more user-friendly under workflows
> that are triggered on a set frequency.
>
> If user would like to control the number of pending plans (to 1) that they
> would like to execute during ad-hoc runs, they can still do so by using the
> LIMIT keyword.
>
> Take note that LIMITING to 1 will not execute the compaction that is just
> scheduled, instead, it will execute the oldest pending plan.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)