Gyula Fora created FLINK-27500:
----------------------------------
Summary: Validation error handling inside controller blocks
reconciliation
Key: FLINK-27500
URL: https://issues.apache.org/jira/browse/FLINK-27500
Project: Flink
Issue Type: Improvement
Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.0.0
Reporter: Gyula Fora
Currently when using the operator without the Webhook (validating only within
the controller) , the way we handle validation errors completely blocks
reconciliation.
The reason for this is that validation happens between observe and
reconciliation and an error short-circuits the controller flow thus skipping
the reconciler which would be able to execute actions such as rollbacks,
deployment-recovery etc.
We also return an UpdateControl without reschedule after an error which makes
this even worse.
There are a few ways to get around this some are more complex than the other.
One possible solution:
If a validation error occurs simply use the "old" FlinkDeployment option in the
rest of the controller loop. We can restore the old valid deployment from the
lastReconciledSpec field, we just need to make sure to only update the status
at the end. This would work from the observer/reconciler's perspective as if
the new broken spec was never submitted.
Going this way we have to avoid repeatedly reporting the error caused by
validation as we reschedule again and again.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)