[ 
https://issues.apache.org/jira/browse/FLINK-34131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Fraison updated FLINK-34131:
------------------------------------
    Description: 
When enabling checkpoint progress check 
(kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to 
define cluster health the operator rely detect if a checkpoint has been 
performed during the 
kubernetes.operator.cluster.health-check.checkpoint-progress.window

As indicated in the doc it must be bigger to checkpointing interval.

But this is a manual configuration which can leads to misconfiguration and 
unwanted restart of the flink cluster if the checkpointing interval is bigger 
than the window one.

The operator must check that the config is healthy before to rely on this 
check. If it is not well set it should not execute the check (return true on 
[evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
 and log a WARN message.

Also flink jobs have other checkpointing parameters that should be taken in 
account for this window configuration which are execution.checkpointing.timeout 
and execution.checkpointing.tolerable-failed-checkpoints

The idea would be to check that 
kubernetes.operator.cluster.health-check.checkpoint-progress.window >= 
max(execution.checkpointing.interval, execution.checkpointing.timeout * 
execution.checkpointing.tolerable-failed-checkpoints)

  was:
When enabling checkpoint progress check 
(kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to 
define cluster health the operator rely detect if a checkpoint has been 
performed during the 
kubernetes.operator.cluster.health-check.checkpoint-progress.window

As indicated in the doc it must be bigger to checkpointing interval.

But this is a manual configuration which can leads to misconfiguration and 
unwanted restart of the flink cluster if the checkpointing interval is bigger 
than the window one.

The operator must check that the config is healthy before to rely on this 
check. If it is not well set it should not execute the check (return true on 
[evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
 and log a WARN message.

Also flink jobs have other checkpointing parameters that should be taken in 
account for this window configuration which are execution.checkpointing.timeout 
and execution.checkpointing.tolerable-failed-checkpoints

The idea would be to check that 
kubernetes.operator.cluster.health-check.checkpoint-progress.window is at >= to 
(execution.checkpointing.interval + execution.checkpointing.timeout) * 
execution.checkpointing.tolerable-failed-checkpoints


> Checkpoint check window should take in account checkpoint job configuration
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-34131
>                 URL: https://issues.apache.org/jira/browse/FLINK-34131
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Nicolas Fraison
>            Priority: Minor
>
> When enabling checkpoint progress check 
> (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to 
> define cluster health the operator rely detect if a checkpoint has been 
> performed during the 
> kubernetes.operator.cluster.health-check.checkpoint-progress.window
> As indicated in the doc it must be bigger to checkpointing interval.
> But this is a manual configuration which can leads to misconfiguration and 
> unwanted restart of the flink cluster if the checkpointing interval is bigger 
> than the window one.
> The operator must check that the config is healthy before to rely on this 
> check. If it is not well set it should not execute the check (return true on 
> [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
>  and log a WARN message.
> Also flink jobs have other checkpointing parameters that should be taken in 
> account for this window configuration which are 
> execution.checkpointing.timeout and 
> execution.checkpointing.tolerable-failed-checkpoints
> The idea would be to check that 
> kubernetes.operator.cluster.health-check.checkpoint-progress.window >= 
> max(execution.checkpointing.interval, execution.checkpointing.timeout * 
> execution.checkpointing.tolerable-failed-checkpoints)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to