[
https://issues.apache.org/jira/browse/HUDI-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
voon updated HUDI-8970:
-----------------------
Description:
The current op modes for RunCompactionProcedure are as follows:
# schedule - schedule a new plan
# execute - if specific instants exist, execute them, otherwise execute all
pending plans
# scheduleandexecute - schedule a new plan and then execute it, if no plan is
generated during schedule, execute all pending plans
While the current implementation of the code holds true to above specification,
it is not very user friendly.
There is no option to schedule a new compaction plan, and execute all pending
compaction plans. If a previous `scheduleandexecute` fails the user's job
retries, it will generate a new compaction plan, leaving a pending compaction
plan on the table.
If there is no user intervention, some log files may grow exponentially, making
compaction more computationally expensive.
So, this ticket is proposing to change the `scheduleandexecute` op to the
following implementation:
# schedule a new plan and then execute {color:#de350b}*ALL*{color} pending
plans, if no plan is generated during schedule, execute all pending plans
This would make the RunCompactionProcedure more user-friendly under workflows
that are triggered on a set frequency.
If user would like to control the number of pending plans (to 1) that they
would like to execute during ad-hoc runs, they can still do so by using the
LIMIT keyword.
Take note that LIMITING to 1 will not execute the compaction that is just
scheduled, instead, it will execute the oldest pending plan.
was:
The current op modes for RunCompactionProcedure are as follows:
# schedule - schedule a new plan
# execute - if specific instants exist, execute them, otherwise execute all
pending plans
# scheduleandexecute - schedule a new plan and then execute it, if no plan is
generated during schedule, execute all pending plans
While the current implementation of the code holds true to above specification,
it is not very user friendly.
There is no option to schedule a new compaction plan, and execute all pending
compaction plans. If a previous `scheduleandexecute` fails the user's scheduled
job retries, it will generate a new compaction plan, leaving a pending
compaction plan on the table.
If there is no user intervention, some log files may grow exponentially, making
compaction more computationally expensive.
So, this ticket is proposing to change the `scheduleandexecute` op to the
following implementation:
# schedule a new plan and then execute {color:#de350b}*ALL*{color} pending
plans, if no plan is generated during schedule, execute all pending plans
This would make the RunCompactionProcedure more user-friendly under workflows
that are triggered on a set frequency.
If user would like to control the number of pending plans (to 1) that they
would like to execute during ad-hoc runs, they can still do so by using the
LIMIT keyword.
Take note that LIMITING to 1 will not execute the compaction that is just
scheduled, instead, it will execute the oldest pending plan.
> Improve RunCompaction Procedure does not run for all pending compactions when
> op is scheduleAndExecute
> ------------------------------------------------------------------------------------------------------
>
> Key: HUDI-8970
> URL: https://issues.apache.org/jira/browse/HUDI-8970
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: voon
> Assignee: voon
> Priority: Minor
>
> The current op modes for RunCompactionProcedure are as follows:
>
> # schedule - schedule a new plan
> # execute - if specific instants exist, execute them, otherwise execute all
> pending plans
> # scheduleandexecute - schedule a new plan and then execute it, if no plan
> is generated during schedule, execute all pending plans
>
> While the current implementation of the code holds true to above
> specification, it is not very user friendly.
>
> There is no option to schedule a new compaction plan, and execute all pending
> compaction plans. If a previous `scheduleandexecute` fails the user's job
> retries, it will generate a new compaction plan, leaving a pending compaction
> plan on the table.
>
> If there is no user intervention, some log files may grow exponentially,
> making compaction more computationally expensive.
>
> So, this ticket is proposing to change the `scheduleandexecute` op to the
> following implementation:
> # schedule a new plan and then execute {color:#de350b}*ALL*{color} pending
> plans, if no plan is generated during schedule, execute all pending plans
>
> This would make the RunCompactionProcedure more user-friendly under workflows
> that are triggered on a set frequency.
>
> If user would like to control the number of pending plans (to 1) that they
> would like to execute during ad-hoc runs, they can still do so by using the
> LIMIT keyword.
>
> Take note that LIMITING to 1 will not execute the compaction that is just
> scheduled, instead, it will execute the oldest pending plan.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)