[GitHub] [flink-kubernetes-operator] pvary commented on a diff in pull request #407: [FLINK-29713] Kubernetes operator should restart failed jobs

GitBox Mon, 24 Oct 2022 00:50:07 -0700


pvary commented on code in PR #407:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/407#discussion_r1002975203



##########
docs/content/docs/custom-resource/job-management.md:
##########
@@ -241,6 +241,21 @@ In order this feature to work one must enable [recovery of 
missing job deploymen
 At the moment deployment is considered unhealthy when Flink's restarts count 
reaches `kubernetes.operator.cluster.health-check.restarts.threshold` (default: 
`64`)
 within time window of 
`kubernetes.operator.cluster.health-check.restarts.window` (default: 2 minutes).
 
+## Restart failed job deployments
+
+The operator can restart a failed Flink cluster deployment. This could be 
useful in cases when the job main task is
+able to reconfigure the job to handle these failures.
+
+For example a job could dynamically create the DAG based on some job 
configuration which job configuration could
+change over time. When a task detects a record which could not be handled with 
the current configuration then the task
+should throw a `SuppressRestartsException` to fail the job. If 
`kubernetes.operator.cluster.restart.failed` is set to 
+`true` (default: `false`) then the operator detects the failed job and 
restarts it. When the job restarts then it reads
+the new job configuration and creates the new DAG based on this new 
configuration. The new deployment could handle the
+incoming records and no manual intervention is needed.

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink-kubernetes-operator] pvary commented on a diff in pull request #407: [FLINK-29713] Kubernetes operator should restart failed jobs

Reply via email to