[jira] [Updated] (HUDI-8970) Improve RunCompaction Procedure does not run for all pending compactions when op is scheduleAndExecute

voon (Jira) Thu, 06 Feb 2025 02:50:56 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


voon updated HUDI-8970:
-----------------------
    Description: 
The current op modes for RunCompactionProcedure are as follows:

 
 # schedule - schedule a new plan
 # execute - if specific instants exist, execute them, otherwise execute all 
pending plans
 # scheduleandexecute - schedule a new plan and then execute it, if no plan is 
generated during schedule, execute all pending plans

 

While the current implementation of the code holds true to above specification, 
it is not very user friendly.

 

There is no option to schedule a new compaction plan, and execute all pending 
compaction plans. If a previous `scheduleandexecute` fails the user's job 
retries, it will generate a new compaction plan, leaving a pending compaction 
plan on the table. 

 

If there is no user intervention, some log files may grow exponentially, making 
compaction more computationally expensive. 

 

So, this ticket is proposing to change the `scheduleandexecute` op to the 
following implementation:
 # schedule a new plan and then execute {color:#de350b}*ALL*{color} pending 
plans, if no plan is generated during schedule, execute all pending plans

 

This would make the RunCompactionProcedure more user-friendly under workflows 
that are triggered on a set frequency.

 

If user would like to control the number of pending plans (to 1) that they 
would like to execute during ad-hoc runs, they can still do so by using the 
LIMIT keyword.

 

Take note that LIMITING to 1 will not execute the compaction that is just 
scheduled, instead, it will execute the oldest pending plan.

 

  was:
The current op modes for RunCompactionProcedure are as follows:

 
 # schedule - schedule a new plan
 # execute - if specific instants exist, execute them, otherwise execute all 
pending plans
 # scheduleandexecute - schedule a new plan and then execute it, if no plan is 
generated during schedule, execute all pending plans

 

While the current implementation of the code holds true to above specification, 
it is not very user friendly.

 

There is no option to schedule a new compaction plan, and execute all pending 
compaction plans. If a previous `scheduleandexecute` fails the user's scheduled 
job retries, it will generate a new compaction plan, leaving a pending 
compaction plan on the table. 

 

If there is no user intervention, some log files may grow exponentially, making 
compaction more computationally expensive. 

 

So, this ticket is proposing to change the `scheduleandexecute` op to the 
following implementation:
 # schedule a new plan and then execute {color:#de350b}*ALL*{color} pending 
plans, if no plan is generated during schedule, execute all pending plans

 

This would make the RunCompactionProcedure more user-friendly under workflows 
that are triggered on a set frequency.

 

If user would like to control the number of pending plans (to 1) that they 
would like to execute during ad-hoc runs, they can still do so by using the 
LIMIT keyword.

 

Take note that LIMITING to 1 will not execute the compaction that is just 
scheduled, instead, it will execute the oldest pending plan.

 


> Improve RunCompaction Procedure does not run for all pending compactions when 
> op is scheduleAndExecute
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-8970
>                 URL: https://issues.apache.org/jira/browse/HUDI-8970
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: voon
>            Assignee: voon
>            Priority: Minor
>
> The current op modes for RunCompactionProcedure are as follows:
>  
>  # schedule - schedule a new plan
>  # execute - if specific instants exist, execute them, otherwise execute all 
> pending plans
>  # scheduleandexecute - schedule a new plan and then execute it, if no plan 
> is generated during schedule, execute all pending plans
>  
> While the current implementation of the code holds true to above 
> specification, it is not very user friendly.
>  
> There is no option to schedule a new compaction plan, and execute all pending 
> compaction plans. If a previous `scheduleandexecute` fails the user's job 
> retries, it will generate a new compaction plan, leaving a pending compaction 
> plan on the table. 
>  
> If there is no user intervention, some log files may grow exponentially, 
> making compaction more computationally expensive. 
>  
> So, this ticket is proposing to change the `scheduleandexecute` op to the 
> following implementation:
>  # schedule a new plan and then execute {color:#de350b}*ALL*{color} pending 
> plans, if no plan is generated during schedule, execute all pending plans
>  
> This would make the RunCompactionProcedure more user-friendly under workflows 
> that are triggered on a set frequency.
>  
> If user would like to control the number of pending plans (to 1) that they 
> would like to execute during ad-hoc runs, they can still do so by using the 
> LIMIT keyword.
>  
> Take note that LIMITING to 1 will not execute the compaction that is just 
> scheduled, instead, it will execute the oldest pending plan.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-8970) Improve RunCompaction Procedure does not run for all pending compactions when op is scheduleAndExecute

Reply via email to