gaborgsomogyi opened a new pull request, #394: URL: https://github.com/apache/flink-kubernetes-operator/pull/394
## What is the purpose of the change Flink has its own restart strategies which are working fine. But there are certain circumstances when Flink can stuck in a restart loop. A good example is when checkpointing is activated and the restart strategy has not been configured, then the fixed-delay strategy is used with `Integer.MAX_VALUE` restart attempts. When the JobManager (JM from now on) is able to solve its temporary issue, it can be that a permanent issue appears on TaskManager (TM from now on) side. A good example is that TM has a memory leak and just crashes. Such case the Flink job requires a restart from the outside, which can be done by the Flink k8s operator. In this PR I've added job health check feature. Please be aware that the implementation is simple and has the following caveats: * Restart count is watched in a normal non-sliding window * When the last valid observed health info timestamp is outside of the watched window then the algorithm assumes even restart count distribution ## Brief change log * Added `kubernetes.operator.job.health-check.enabled` config (default: false) * Added `kubernetes.operator.job.health-check.duration-window` config (default: 2 minutes) * Added `kubernetes.operator.job.health-check.threshold` config (default: 64) * Added `JobHealthInfo` field to `JobStatus` as string field (internal structure may change suddenly at this stage) * Added `JobHealthObserver` which is responsible to fetch job health information from the submitted job * Added `JobHealthChecker` which is responsible to decide whether the job is healthy or not * Added job restart functionality (with the same spec) when job considered unhealthy * Added several unit tests * Some simplifications/refactors ## Verifying this change * Existing unit tests * Additional unit tests * Manually with my newly created [chaos monkey](https://github.com/gaborgsomogyi/flink-chaos-monkey-java-job) job * Job submitted with health check * Executed in the TM shell: `touch /tmp/throwExceptionInUDF` * Waited for job recovery ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changes to the `CustomResourceDescriptors`: yes, new configs added - Core observer or reconciler logic that is regularly executed: yes ## Documentation - Does this pull request introduce a new feature? yes - If yes, how is the feature documented? newly added configs documented -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
