[ 
https://issues.apache.org/jira/browse/FLINK-29959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633948#comment-17633948
 ] 

Gyula Fora edited comment on FLINK-29959 at 11/14/22 5:10 PM:
--------------------------------------------------------------

There are 2 main problems that we target here:
 # Status updates by zombie operators who has last lost leadership but not 
realized/dead yet.
 # Stale status received when a new leader starts

*Why would these happen?*

Zombie operator:
It could in theory happen that an operator loses leadership in a middle of 
reconciliation due to a very long GC pause (or some network issue or whatever) 
and the current CR reconcile loop continues while the new leader already 
started to reconcile this resource. This is very unlikely but can happen with 
leader election and a standby operator. In these cases we don't want to allow 
the old operator who lost leadership to be able to make any status updates. The 
new logic guarantees that if the new leader made any status update the old 
would never be able to do so again.

Stale status:
When the new leader starts processing (if it was on standby) there is no 
guarantee that the status/spec reconciled at the first time is up to date. This 
can happen because due to some unlucky cache update timing or even a zombie 
operator submitting late status updates. The current operator logic very much 
relies on seeing the last status otherwise we can have some very weird 
cornercases that would definitely cause problems for the resources.

*How the new logic tackles this in a safe way*

What the new logic does is that it basically only allows status updates to go 
through when the operator has the latest status information. So it's sort of a 
locking on the current status. If anyone else changed the status in the 
meantime, we simply throw an error and retrigger the reconciliation. This is 
actually safe to do as the operator reconcile logic already runs with the 
assumption that the operator can fail at any time before status update, and we 
always use the status as a "write-ahead-log" of the actions we are taking. In 
these cases zombie operators who have already lost leadership would never 
reconcile again (the leader election guarantees that), and in other cases this 
would give us the latest version of the resource.


was (Author: gyfora):
There are 2 main problems that we target here:
 # Status updates by zombie operators who has last lost leadership but not 
realized/dead yet.
 # Stale status received when a new leader starts

*Why would these happen?*

Zombie operator:
It could in theory happen that an operator loses leadership in a middle of 
reconciliation due to a very long GC pause (or some network issue or whatever) 
and the current CR reconcile loop continues while the new leader already 
started to reconcile this resource. This is very unlikely but can happen with 
leader election and a standby operator. In these cases we don't want to allow 
the old operator who lost leadership to be able to make any status updates. The 
new logic guarantees that if the new leader made any status update the old 
would never be able to do so again.

Stale status:
When the new leader starts processing (if it was on standby) there is no 
guarantee that the status/spec reconciled at the first time is up to date. This 
can happen because due to some unlucky cache update timing or even a zombie 
operator submitting late status updates. The current operator logic very much 
relies on seeing the last status otherwise we can have some very weird 
cornercases that would definitely cause problems for the resources.

*How the new logic tackles this in a safe way*

What the new logic does is that it basically only allows status updates to go 
through when the operator has the latest status information. So it's sort of a 
locking on the current status. If anyone else changed the status in the 
meantime, we simply throw an error and retrigger the reconciliation. This is 
actually safe to do as the operator reconcile logic already runs with the 
assumption that the operator can fail at any time before status update, and we 
always use the status as a "write-ahead-log" of the actions we are taking. In 
these cases zombie operators who have already lost leadership would never 
reconcile again (the leader election guarantees that), and in other cases this 
would give us the latest version of the resource.

Additional notes:
When the upstream PR is open, I will loop in Attila from the JOSDK team to help 
review it. I was discussing these cases with him previously and we came to this 
solution.

> Use optimistic locking when patching resource status
> ----------------------------------------------------
>
>                 Key: FLINK-29959
>                 URL: https://issues.apache.org/jira/browse/FLINK-29959
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Gyula Fora
>            Assignee: Gyula Fora
>            Priority: Critical
>
> The operator currently does not use optimistic locking on the CR when 
> patching status. This worked because we always wanted to overwrite the status.
> With leader election and potentially two operators running at the same time, 
> we are now exposed to some race conditions that were not previously present 
> with the status update logic.
> To ensure that the operator always sees the latest status we should change 
> our logic to optimistic locking with retries. If we get a lock error 
> (resource updated) we check if only the spec changed and then retry locking 
> on the new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to