gaborgsomogyi opened a new pull request, #394:
URL: https://github.com/apache/flink-kubernetes-operator/pull/394

   ## What is the purpose of the change
   
   Flink has its own restart strategies which are working fine. But there are 
certain circumstances when Flink can stuck in a restart loop. A good example is 
when checkpointing is activated and the restart strategy has not been 
configured, then the fixed-delay strategy is used with `Integer.MAX_VALUE` 
restart attempts. When the JobManager (JM from now on) is able to solve its 
temporary issue, it can be that a permanent issue appears on TaskManager (TM 
from now on) side. A good example is that TM has a memory leak and just 
crashes. Such case the Flink job requires a restart from the outside, which can 
be done by the Flink k8s operator.
   
   In this PR I've added job health check feature. Please be aware that the 
implementation is simple and has the following caveats:
   * Restart count is watched in a normal non-sliding window
   * When the last valid observed health info timestamp is outside of the 
watched window then the algorithm assumes even restart count distribution
   
   ## Brief change log
   
   * Added `kubernetes.operator.job.health-check.enabled` config (default: 
false)
   * Added `kubernetes.operator.job.health-check.duration-window` config 
(default: 2 minutes)
   * Added `kubernetes.operator.job.health-check.threshold` config (default: 64)
   * Added `JobHealthInfo` field to `JobStatus` as string field (internal 
structure may change suddenly at this stage)
   * Added `JobHealthObserver` which is responsible to fetch job health 
information from the submitted job
   * Added `JobHealthChecker` which is responsible to decide whether the job is 
healthy or not
   * Added job restart functionality (with the same spec) when job considered 
unhealthy
   * Added several unit tests
   * Some simplifications/refactors
   
   ## Verifying this change
   
   * Existing unit tests
   * Additional unit tests
   * Manually with my newly created [chaos 
monkey](https://github.com/gaborgsomogyi/flink-chaos-monkey-java-job) job
     * Job submitted with health check
     * Executed in the TM shell: `touch /tmp/throwExceptionInUDF`
     * Waited for job recovery
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
yes, new configs added
     - Core observer or reconciler logic that is regularly executed: yes
   
   ## Documentation
   
     - Does this pull request introduce a new feature? yes
     - If yes, how is the feature documented? newly added configs documented
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to