[
https://issues.apache.org/jira/browse/FLINK-37411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora updated FLINK-37411:
-------------------------------
Fix Version/s: (was: kubernetes-operator-1.14.0)
> Introduce the rollback mechanism for Autoscaler
> ------------------------------------------------
>
> Key: FLINK-37411
> URL: https://issues.apache.org/jira/browse/FLINK-37411
> Project: Flink
> Issue Type: New Feature
> Reporter: Rui Fan
> Priority: Major
>
> h1. Background & Motivation
> In some cases, job becomes unhealthy(cannot running normally) after job is
> scaled by autoscaler.
> One option is rolling back job when job cannot running normally after scaling.
> h1. Examples (Which scenarios need rollback mechanism?)
> h2. Example1: The network memory is insufficient after scaling up.
> Flink task will request more network memories after scaling up. Flink job
> cannot be started(failover infinitely) if network memory is insufficient.
> The job may have lag before scaling up, but it cannot run after scaling. We
> have 2 solutions for this case:
> * Autotuning is enabled : increasing the TM network memory and restart a
> flink cluster
> * Autotuning is disabled(In-place rescaling): Failover(retry) infinitely
> will be useless, it's better to rollback job to the last parallelisms or the
> first parallelisms.
> h2. Example2: GC-pressure or heap-usage is high
> Currently, Autoscaling will be paused if the GC pressure exceeds this limit
> or the heap usage exceeds this threshold. (Checking
> job.autoscaler.memory.gc-pressure.threshold and
> job.autoscaler.memory.heap-usage.threshold options to get more details.)
> This case might happens after scaling down, there are 2 solutions as well:
> * Autotuning is enabled : increasing the TM Heap memory (The TM total memory
> may also need to be increased, currently Autotuning never increase the TM
> total memory, only decrease it)
> * Autotuning is disabled(In-place rescaling): Rollback job to the last
> parallelisms or the first parallelisms.
> h1. Proposed change
> Note: the autotuning could be integrated with these examples in the future.
> This Jira introduces the JobUnrecoverableErrorChecker plugins(interfaces),
> and we could defines 2 build-in customized checkers in the first
> version(case1 and case2).
> {code:java}
> /**
> * Check whether the job encountered an unrecoverable error.
> *
> * @param <KEY> The job key.
> * @param <Context> Instance of JobAutoScalerContext.
> */
> @Experimental
> public interface JobUnrecoverableErrorChecker<KEY, Context extends
> JobAutoScalerContext<KEY>> {
> /**
> * @return True means job encountered an unrecoverable error, the scaling
> will be rolled back.
> * Otherwise, the job ran normally or encountered a recoverable error.
> */
> boolean check(Context context, EvaluatedMetrics evaluatedMetrics);
> } {code}
> Rolling back job when any checker rule is true, and the scaling will be
> paused until cluster is restarted.
> h2. What needs to be discussed is:
> should the job be rolled back to the parallelism initially set by the user,
> or to the last parallelism before scaling?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)