gaborgsomogyi opened a new pull request, #513:
URL: https://github.com/apache/flink-kubernetes-operator/pull/513
## What is the purpose of the change
There are workloads which stuck in such a way that they're in RUNNING state
most of the time but not able to proceed and make checkpoints. Such cases must
be detected by the operator. In this PR I've added the possibility to ask the
operator to watch the number of successful checkpoints. If the feature is
enabled by `cluster.health-check.completed-checkpoints.enabled` and there are
no successful checkpoint within the defined window in
`cluster.health-check.completed-checkpoints.window` then the operator considers
it as unhealthy deployment and re-creates it.
## Brief change log
* Added config `cluster.health-check.completed-checkpoints.enabled`
* Added config `cluster.health-check.completed-checkpoints.window`
* Added number of successful checkpoints watching
## Verifying this change
Changed/added automated tests + manually on Minikube (stateless job w/o
checkpoint restarted all the time).
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changes to the `CustomResourceDescriptors`:
no
- Core observer or reconciler logic that is regularly executed: no
## Documentation
- Does this pull request introduce a new feature? yes
- If yes, how is the feature documented? docs
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]