voon created HUDI-8970:
--------------------------
Summary: Improve RunCompaction Procedure does not run for all
pending compactions when op is scheduleAndExecute
Key: HUDI-8970
URL: https://issues.apache.org/jira/browse/HUDI-8970
Project: Apache Hudi
Issue Type: Improvement
Reporter: voon
The current op modes for RunCompactionProcedure are as follows:
# schedule - schedule a new plan
# execute - if specific instants exist, execute them, otherwise execute all
pending plans
# scheduleandexecute - schedule a new plan and then execute it, if no plan is
generated during schedule, execute all pending plans
While the current implementation of the code holds true to above specification,
it is not very user friendly.
There is no option to schedule a new compaction plan, and execute all pending
compaction plans. If a previous `scheduleandexecute` fails the user's scheduled
job retries, it will generate a new compaction plan, leaving a pending
compaction plan on the table.
If there is no user intervention, some log files may grow exponentially, making
compaction more computationally expensive.
So, this ticket is proposing to change the `scheduleandexecute` op to the
following implementation:
# schedule a new plan and then execute {color:#de350b}*ALL*{color} pending
plans, if no plan is generated during schedule, execute all pending plans
This would make the RunCompactionProcedure more user-friendly under workflows
that are triggered on a set frequency.
If user would like to control the number of pending plans (to 1) that they
would like to execute during ad-hoc runs, they can still do so by using the
LIMIT keyword.
Take note that LIMITING to 1 will not execute the compaction that is just
scheduled, instead, it will execute the oldest pending plan.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)