[
https://issues.apache.org/jira/browse/FLINK-34131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nicolas Fraison updated FLINK-34131:
------------------------------------
Priority: Minor (was: Major)
> Checkpoint check window should take in account checkpoint job configuration
> ---------------------------------------------------------------------------
>
> Key: FLINK-34131
> URL: https://issues.apache.org/jira/browse/FLINK-34131
> Project: Flink
> Issue Type: Improvement
> Components: Kubernetes Operator
> Reporter: Nicolas Fraison
> Priority: Minor
>
> When enabling checkpoint progress check
> (kubernetes.operator.cluster.health-check.checkpoint-progress.enabled) to
> define cluster health the operator rely detect if a checkpoint has been
> performed during the
> kubernetes.operator.cluster.health-check.checkpoint-progress.window
> As indicated in the doc it must be bigger to checkpointing interval.
> But this is a manual configuration which can leads to misconfiguration and
> unwanted restart of the flink cluster if the checkpointing interval is bigger
> than the window one.
> The operator must check that the config is healthy before to rely on this
> check. If it is not well set it should not execute the check (return true on
> [evaluateCheckpoints|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/ClusterHealthEvaluator.java#L197C1-L199C50])
> and log a WARN message.
> Also flink jobs have other checkpointing parameters that should be taken in
> account for this window configuration which are
> execution.checkpointing.timeout andÂ
> execution.checkpointing.tolerable-failed-checkpoints
> The idea would be to check that
> kubernetes.operator.cluster.health-check.checkpoint-progress.window is at >=
> to (execution.checkpointing.interval + execution.checkpointing.timeout) *
> execution.checkpointing.tolerable-failed-checkpoints
--
This message was sent by Atlassian Jira
(v8.20.10#820010)