markap14 commented on PR #6779:
URL: https://github.com/apache/nifi/pull/6779#issuecomment-1353410957

   So I crated a two-node nifi cluster using GKE to test this. On startup, 
things work well. Both nodes join the cluster. I can see that state is getting 
stored/recovered properly using ListGCSBucket. If I then disconnect the node 
that is Primary/Coordinator, I see that the other node is elected. But if I 
then reconnect the disconnected node, it gets into a bad state.
   Running `bin/nifi.sh diagnostics diag1.txt` on both nodes shows that both 
nodes actually believe that they are both the Cluster Coordinator AND the 
Primary Node.
   Looking at the logs of the disconnected node, I see:
   ```
   2022-12-15 16:50:42,065 ERROR [KubernetesLeaderElectionManager] 
i.f.k.c.e.leaderelection.LeaderElector Exception occurred while releasing lock 
'LeaseLock: nifi - cluster-coordinator (10.31.1.4:4423)'
   
io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException:
 Unable to update LeaseLock
           at 
io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LeaseLock.update(LeaseLock.java:102)
           at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
           at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)
           at 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.cancel(Unknown 
Source)
           at 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$null$0(LeaderElector.java:92)
           at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
           at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)
           at 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.cancel(Unknown 
Source)
           at 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.run(LeaderElector.java:70)
           at 
org.apache.nifi.kubernetes.leader.election.command.LeaderElectionCommand.run(LeaderElectionCommand.java:78)
           at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
           at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
           at 
io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
           at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:517)
           at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:551)
           at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleUpdate(OperationSupport.java:347)
           at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleUpdate(BaseOperation.java:680)
           at 
io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$replace$0(HasMetadataOperation.java:167)
           at 
io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:172)
           at 
io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:113)
           at 
io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:41)
           at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.replace(BaseOperation.java:1043)
           at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.replace(BaseOperation.java:88)
           at 
io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LeaseLock.update(LeaseLock.java:100)
           ... 19 common frames omitted
           at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:709)
           at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:689)
           at 
java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown 
Source)
           at 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.complete(Unknown 
Source)
           at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)
           at 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.complete(Unknown 
Source)
           at 
io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$4.onResponse(OkHttpClientImpl.java:277)
           at 
okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)
           ... 3 common frames omitted
   2022-12-15 16:50:42,066 ERROR [KubernetesLeaderElectionManager] 
i.f.k.c.e.leaderelection.LeaderElector Exception occurred while releasing lock 
'LeaseLock: nifi - primary-node (10.31.1.4:4423)'
   
io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException:
 Unable to update LeaseLock
           at 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.stopLeading(LeaderElector.java:120)
           at 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$null$1(LeaderElector.java:94)
           at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
           at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)
           at 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.cancel(Unknown 
Source)
           at 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$null$0(LeaderElector.java:92)
           at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
           at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)
           at 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.cancel(Unknown 
Source)
           at 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.run(LeaderElector.java:70)
           at 
org.apache.nifi.kubernetes.leader.election.command.LeaderElectionCommand.run(LeaderElectionCommand.java:78)
           at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
           at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
           at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
           at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
           at java.base/java.lang.Thread.run(Unknown Source)
   Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
executing: PUT at: 
https://10.31.128.1/apis/coordination.k8s.io/v1/namespaces/nifi/leases/primary-node.
 Message: Operation cannot be fulfilled on leases.coordination.k8s.io 
"primary-node": the object has been modified; please apply your changes to the 
latest version and try again. Received status: Status(apiVersion=v1, code=409, 
details=StatusDetails(causes=[], group=coordination.k8s.io, kind=leases, 
name=primary-node, retryAfterSeconds=null, uid=null, additionalProperties={}), 
kind=Status, message=Operation cannot be fulfilled on 
leases.coordination.k8s.io "primary-node": the object has been modified; please 
apply your changes to the latest version and try again, 
metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, 
status=Failure, additionalProperties={}).
           at 
io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
           at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:517)
           at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:551)
           at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleUpdate(OperationSupport.java:347)
           at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleUpdate(BaseOperation.java:680)
           at 
io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$replace$0(HasMetadataOperation.java:167)
           at 
io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:172)
           at 
io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:113)
           at 
io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:41)
           at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.replace(BaseOperation.java:1043)
           at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.replace(BaseOperation.java:88)
           at 
io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LeaseLock.update(LeaseLock.java:100)
           ... 19 common frames omitted
   Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
executing: PUT at: 
https://10.31.128.1/apis/coordination.k8s.io/v1/namespaces/nifi/leases/primary-node.
 Message: Operation cannot be fulfilled on leases.coordination.k8s.io 
"primary-node": the object has been modified; please apply your changes to the 
latest version and try again. Received status: Status(apiVersion=v1, code=409, 
details=StatusDetails(causes=[], group=coordination.k8s.io, kind=leases, 
name=primary-node, retryAfterSeconds=null, uid=null, additionalProperties={}), 
kind=Status, message=Operation cannot be fulfilled on 
leases.coordination.k8s.io "primary-node": the object has been modified; please 
apply your changes to the latest version and try again, 
metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, 
status=Failure, additionalProperties={}).
           at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:709)
           at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:689)
           at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:640)
           at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:576)
           at 
java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown 
Source)
           at 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.complete(Unknown 
Source)
           at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$retryWithExponentialBackoff$2(OperationSupport.java:618)
           at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
           at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)
           at 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.complete(Unknown 
Source)
           at 
io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$4.onResponse(OkHttpClientImpl.java:277)
           at 
okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)
           ... 3 common frames omitted
   ```
   
   So looks like it is not properly relinquishing the ownership of the lease. I 
presume this is what causes both nodes to believe that they are the 
coordinator/primary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to