Gyula Fora created FLINK-27500:
----------------------------------

             Summary: Validation error handling inside controller blocks 
reconciliation
                 Key: FLINK-27500
                 URL: https://issues.apache.org/jira/browse/FLINK-27500
             Project: Flink
          Issue Type: Improvement
          Components: Kubernetes Operator
    Affects Versions: kubernetes-operator-1.0.0
            Reporter: Gyula Fora


Currently when using the operator without the Webhook (validating only within 
the controller) , the way we handle validation errors completely blocks 
reconciliation.

The reason for this is that validation happens between observe and 
reconciliation and an error short-circuits the controller flow thus skipping 
the reconciler which would be able to execute actions such as rollbacks, 
deployment-recovery etc.

We also return an UpdateControl without reschedule after an error which makes 
this even worse.

There are a few ways to get around this some are more complex than the other. 
One possible solution:

If a validation error occurs simply use the "old" FlinkDeployment option in the 
rest of the controller loop. We can restore the old valid deployment from the 
lastReconciledSpec field, we just need to make sure to only update the status 
at the end. This would work from the observer/reconciler's perspective as if 
the new broken spec was never submitted.

Going this way we have to avoid repeatedly reporting the error caused by 
validation as we reschedule again and again.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to