[jira] [Commented] (NIFI-12232) Frequent "failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption"

Joe Witt (Jira) Fri, 16 Feb 2024 12:21:04 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818091#comment-17818091
 ]


Joe Witt commented on NIFI-12232:
---------------------------------

Also hit by

https://apachenifi.slack.com/archives/C0L9VCD47/p1708113098305609

Roman Wesołowski
  29 minutes ago
Hi all,
I have 3 nodes Nifi cluster with 2.0.0-M1 version. Till today everything was 
working correctly, during my development something strange happeded. For some 
reason 2 nodes disconnected from cluster, and I am not able to reconnect them 
to the cluster. I have resterted nodes but without successes. All machines are 
up but can not connect each other.  Any help would be appreciated.
2024-02-16 15:14:36,663 ERROR [Reconnect to Cluster] 
o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 10.120.8.252:8080 -- 
Node disconnected from cluster due to 
org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed 
to connect node to cluster because local flow controller partially updated. 
Administrator should disconnect node andreview flow for corruption.

> Frequent "failed to connect node to cluster because local flow controller 
> partially updated. Administrator should disconnect node and review flow for 
> corruption"
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-12232
>                 URL: https://issues.apache.org/jira/browse/NIFI-12232
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Configuration Management
>    Affects Versions: 1.23.2
>            Reporter: John Joseph
>            Assignee: Mark Payne
>            Priority: Major
>         Attachments: image-2023-10-16-16-12-31-027.png, 
> image-2024-02-14-13-33-44-354.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is an issue that we have been observing in the 1.23.2 version of NiFi 
> when we try upgrade,
> Since Rolling upgrade is not supported in NiFi, we scale out the revision 
> that is running and {_}run a helm upgrade{_}.
> We have NIFI running in k8s cluster mode, there is a post job that call the 
> Tenants and policies API. On a successful run it would run like this
> {code:java}
> set_policies() Action: 'read' Resource: '/flow' entity_id: 
> 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' 
> entity_type: 'USER'
> set_policies() status: '200'
> 'read' '/flow' policy already exists. It will be updated...
> set_policies() fetching policy inside -eq 200 status: '200'
> set_policies() after update PUT: '200'
> set_policies() Action: 'read' Resource: '/tenants' entity_id: 
> 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' 
> entity_type: 'USER'
> set_policies() status: '200'{code}
> *_This job was running fine in 1.23.0, 1.22 and other previous versions._* In 
> {*}{{1.23.2}}{*}, we are noticing that the job is failing very frequently 
> with the error logs;
> {code:java}
> set_policies() Action: 'read' Resource: '/flow' entity_id: 
> 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' 
> entity_type: 'USER'
> set_policies() status: '200'
> 'read' '/flow' policy already exists. It will be updated...
> set_policies() fetching policy inside -eq 200 status: '200'
> set_policies() after update PUT: '400'
> An error occurred getting 'read' '/flow' policy: 'This node is disconnected 
> from its configured cluster. The requested change will only be allowed if the 
> flag to acknowledge the disconnected node is set.'{code}
> {{_*'This node is disconnected from its configured cluster. The requested 
> change will only be allowed if the flag to acknowledge the disconnected node 
> is set.'*_}}
> The job is configured to run only after all the pods are up and running. 
> Though the pods are up we see exception is the inside pods
> {code:java}
> org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed 
> to connect node to cluster because local flow controller partially updated. 
> Administrator should disconnect node and review flow for corruption.
> at 
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1059)
> at 
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:667)
> at 
> org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:107)
> at 
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:396)
> at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: 
> org.apache.nifi.controller.serialization.FlowSynchronizationException: 
> java.lang.IllegalStateException: Cannot change destination of Connection 
> because the current destination is running
> at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:448)
> at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:206)
> at 
> org.apache.nifi.controller.serialization.StandardFlowSynchronizer.sync(StandardFlowSynchronizer.java:42)
> at 
> org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1530)
> at 
> org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:104)
> at 
> org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:817)
> at 
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1028)
> ... 4 common frames omitted
> Caused by: java.lang.IllegalStateException: Cannot change destination of 
> Connection because the current destination is running
> at 
> org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:310)
> at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:700)
> at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:405)
> at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:543)
> at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:427)
> at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:266)
> at 
> org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:550)
> at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:261)
> at 
> org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3977)
> at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:439)
> ... 10 common frames omitted{code}
> Attaching screenshots of the UI as well. this issue is observed a lot 
> checking with CLI command.
> {code:java}
> ./cli.sh nifi cluster-summary -u 
> https://nifi-headless.doc-norc.svc.cluster.local:9443 -ts 
> /opt/nifi/cert_mgr/truststore.jks -tst jks -tsp changeit -ks 
> /opt/nifi/cert_mgr/keystore.j
> ks -kst jks -ksp changeit
> Total node count: 0
> Connected node count: 0
> Clustered: true
> Connected to cluster: false{code}
>  
> We tried Workaround
> {code:java}
> 1.Exec to the pod that has the flow file issue, delete the flow file so that 
> it deletes from the PVC 
> 2. Exit from pod
> 3. Delete the pod that had the problem{code}
> Pod will respwan, cluster coordinator will recreate the flowfile from the 
> connected nodes
> This connected all the nodes. But this does not feel like an ideal solution 
> as we're seeing this issue quite often and cannot run this WA every time
> !image-2023-10-16-16-12-31-027.png!
>  
> we also see this Exception sometimes 
> {code:java}
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /nifi/leaders/Cluster Coordinator
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2480)
>         at 
> org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:243)
>         at 
> org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:232)
>         at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:94)
>         at 
> org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:229)
>         at 
> org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:220)
>         at 
> org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:42)
>         at 
> org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:155)
>         at 
> org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:135)
>         at 
> org.apache.curator.framework.recipes.locks.InterProcessMutex.getParticipantNodes(InterProcessMutex.java:170)
>         at 
> org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:336)
>         at 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.getLeader(CuratorLeaderElectionManager.java:281)
>         at 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.verifyLeader(CuratorLeaderElectionManager.java:572)
>         at 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.isLeader(CuratorLeaderElectionManager.java:526)
>         at 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$LeaderRole.isLeader(CuratorLeaderElectionManager.java:467)
>         at 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.isLeader(CuratorLeaderElectionManager.java:262)
>         at 
> org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.isActiveClusterCoordinator(NodeClusterCoordinator.java:824)
>         at 
> org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor.monitorHeartbeats(AbstractHeartbeatMonitor.java:132)
>         at 
> org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor$1.run(AbstractHeartbeatMonitor.java:84)
>         at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NIFI-12232) Frequent "failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption"

Reply via email to