voon created HUDI-8970:
--------------------------

             Summary: Improve RunCompaction Procedure does not run for all 
pending compactions when op is scheduleAndExecute
                 Key: HUDI-8970
                 URL: https://issues.apache.org/jira/browse/HUDI-8970
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: voon


The current op modes for RunCompactionProcedure are as follows:

 
 # schedule - schedule a new plan
 # execute - if specific instants exist, execute them, otherwise execute all 
pending plans
 # scheduleandexecute - schedule a new plan and then execute it, if no plan is 
generated during schedule, execute all pending plans

 

While the current implementation of the code holds true to above specification, 
it is not very user friendly.

 

There is no option to schedule a new compaction plan, and execute all pending 
compaction plans. If a previous `scheduleandexecute` fails the user's scheduled 
job retries, it will generate a new compaction plan, leaving a pending 
compaction plan on the table. 

 

If there is no user intervention, some log files may grow exponentially, making 
compaction more computationally expensive. 

 

So, this ticket is proposing to change the `scheduleandexecute` op to the 
following implementation:
 # schedule a new plan and then execute {color:#de350b}*ALL*{color} pending 
plans, if no plan is generated during schedule, execute all pending plans

 

This would make the RunCompactionProcedure more user-friendly under workflows 
that are triggered on a set frequency.

 

If user would like to control the number of pending plans (to 1) that they 
would like to execute during ad-hoc runs, they can still do so by using the 
LIMIT keyword.

 

Take note that LIMITING to 1 will not execute the compaction that is just 
scheduled, instead, it will execute the oldest pending plan.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to