[ https://issues.apache.org/jira/browse/NIFI-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817377#comment-17817377 ]
René Zeidler commented on NIFI-12232: ------------------------------------- I've encountered the same issue. It's happened since at least 1.23.2, and I can realiably reproduce it on 1.25.0 and 2.0.0-M2 as well. I've been able to create minimal reproduction steps that do not require and non-standard setup. The issue is independent of any specific processors or any complicated flow setup. It _always_ occurs when a node disconnects from the cluster which contains a process group that hasn't been "fully synced". I'll explain what that means in the reproduction steps. h2. Minimal Reproduction Steps # Setup a NiFi {*}cluster with at least 3 nodes{*}, using all default settings. You may adjust {{nifi.cluster.flow.election.max.wait.time}} and {{nifi.cluster.flow.election.max.candidates}} to make the node connection process faster, but this isn't necessary to reproduce the bug. # I'll call the nodes {*}Node A{*}, {*}Node B{*}, and {*}Node C{*}. Open the web interface for Node A and Node B. # On {*}Node A{*}, create a new {*}process group{*}. In that process group, create a very simple flow: GenerateFlowFile going into UpdateAttribute going into a funnel. Start the UpdateAttribute processor. Like this: !image-2024-02-14-13-33-44-354.png! The exact flow doesn't matter, all that's necessary to produce the bug is a *running processor* with an {*}ingoing and outgoing connection{*}. # On {*}Node B{*}, observe that the process group has _automatically synced_ (Right click -> Refresh if you don't want to wait). # On {*}Node A{*}, go to *Menu -> Cluster* (top right hamburger menu). {*}Disconnect Node B{*}. Click refresh (bottom left) until the node has disconnected. # Right after it was disconnected, *connect Node B* again. Click refresh to see the status change. It will change to CONNECTING and quickly back to DISCONNECTED. Check the log file for Node B. You will see the following exception: {{o.a.nifi.controller.StandardFlowService Handling reconnection request failed due to: org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption.}} [...] {{Caused by: org.apache.nifi.controller.serialization.FlowSynchronizationException: java.lang.IllegalStateException: Cannot change destination of Connection because the current destination is running}} # On {*}Node B{*}, you will get the warning that the node is disconnected from the cluster ({_}This node is currently not connected to the cluster. Any modifications to the data flow made here will not replicate across the cluster.{_}) Go into the process group. Observe that the UpdateAttribute processor is {*}running{*}, which is the direct cause of the exception. h3. Temporary fix # On {*}Node B{*}, *stop* the UpdateAttribute processor. # On {*}Node A{*}, *connect Node B* again. This time it will work and Node B successfully reconnects to the cluster. # However, this only allows Node B to reconnect once. The process group on Node B is still in an inconsistent state and will fail to reconnect the next time. Repeat steps 5 - 7 above to confirm that the issue persists. h3. Permanent fix # On {*}Node B{*}, stop the UpdateAttribute processor and then {*}delete the whole processor group{*}. Since Node B is currently disconnected from the cluster, this will only delete the process group locally on this node. # On {*}Node A{*}, *connect Node B* again. The reconnection will be successfull and the deleted process group will sync back to Node B. Since the whole process group was missing, this will now be a "full sync". # This specific process group on this specific node (Node B) is now "fixed". It will not cause this issue anymore. To confirm, repeat steps 5 and 6 above. You can disconnect and reconnect Node B without issues. h2. Further notes * Instead of deleting the process group, you can also stop the disconnected node completely, delete the flow.json/flow.xml, and start it again. It will join the cluster again, and all process groups will be "fully synced". This fix was described in previous comments, but is not necessary to reproduce the issue. * The fix applies per process group and per node. After fixing the issue for Node B with the "permanent fix" above, it will still affect Node C. If you disconnect and try to reconnect Node C it will throw the same exception. * Also, the group where you initially created the flow (in this example Node A) is _not_ exempt. If you go to Node C, disconnect and try to reconnect Node A, it will throw the same exception. h2. Full error log {code:java} 2024-02-14 12:49:40,487 INFO [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Processing reconnection request from cluster coordinator. 2024-02-14 12:49:40,487 INFO [Process Cluster Protocol Request-13] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 1b2d4350-0982-4548-8aa7-10df3d50ced7 (type=RECONNECTION_REQUEST, length=19212 bytes) from nifi-2b:8443 in 7 millis 2024-02-14 12:49:40,487 INFO [Reconnect to Cluster] o.a.n.c.c.node.NodeClusterCoordinator Resetting cluster node statuses from {685d125a-67a1-4f49-b2ea-1062c99bcafd=NodeConnectionStatus[nodeId=nifi-2a:8443, state=CONNECTED, updateId=12], fab3e7e2-3b39-444a-b37e-4924dbd74999=NodeConnectionStatus[nodeId=nifi-2c:8443, state=CONNECTED, updateId=39], 17414631-7ac5-425e-8ce0-d962186017f5=NodeConnectionStatus[nodeId=nifi-2b:8443, state=CONNECTING, updateId=48]} to {nifi-2c:8443=NodeConnectionStatus[nodeId=nifi-2c:8443, state=CONNECTED, updateId=39], nifi-2b:8443=NodeConnectionStatus[nodeId=nifi-2b:8443, state=CONNECTING, updateId=48], nifi-2a:8443=NodeConnectionStatus[nodeId=nifi-2a:8443, state=CONNECTED, updateId=12]} 2024-02-14 12:49:40,488 INFO [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Setting Flow Controller's Node ID: nifi-2b:8443 2024-02-14 12:49:40,488 INFO [Reconnect to Cluster] o.a.n.c.s.VersionedFlowSynchronizer Synchronizing FlowController with proposed flow: Controller Already Synchronized = true 2024-02-14 12:49:40,501 INFO [Reconnect to Cluster] o.a.n.c.s.VersionedFlowSynchronizer In order to inherit proposed dataflow, will stop any components that will be affected by the update 2024-02-14 12:49:40,501 INFO [Reconnect to Cluster] o.a.n.c.s.AffectedComponentSet Stopping the following components: AffectedComponentSet[inputPorts=[], outputPorts=[], remoteInputPorts=[], remoteOutputPorts=[], processors=[], parameterProviders=[], flowRegistryClients=[], controllerServices=[], reportingTasks=[], flowAnalysisRules=[], statelessProcessGroups=[]] 2024-02-14 12:49:40,501 INFO [Reconnect to Cluster] o.a.n.c.s.AffectedComponentSet Successfully stopped all components in 0 milliseconds 2024-02-14 12:49:40,501 INFO [Reconnect to Cluster] o.apache.nifi.controller.FlowController [Timer Driven] Maximum Thread Count updated [10] previous [10] 2024-02-14 12:49:40,502 INFO [Reconnect to Cluster] o.a.n.f.s.StandardVersionedComponentSynchronizer No differences between current flow and proposed flow for StandardProcessGroup[identifier=a734e715-018d-1000-6784-b2e925615966,name=NiFi Flow] 2024-02-14 12:49:40,502 INFO [Reconnect to Cluster] o.a.nifi.groups.StandardProcessGroup StandardFunnel[id=a7a9bd0e-018d-1000-0000-00003ac8000f-temp-funnel] added to StandardProcessGroup[identifier=a7a9bd0e-018d-1000-0000-00003ac8000f,name=Cluster Reconnect Bug Test] 2024-02-14 12:49:40,503 INFO [Reconnect to Cluster] o.a.n.c.s.AffectedComponentSet Starting the following components: AffectedComponentSet[inputPorts=[], outputPorts=[], remoteInputPorts=[], remoteOutputPorts=[], processors=[], parameterProviders=[], flowRegistryClients=[], controllerServices=[], reportingTasks=[], flowAnalysisRules=[], statelessProcessGroups=[]] 2024-02-14 12:49:40,503 ERROR [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Handling reconnection request failed due to: org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption. org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption. at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:985) at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:655) at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:384) at java.base/java.lang.Thread.run(Thread.java:1583) Caused by: org.apache.nifi.controller.serialization.FlowSynchronizationException: java.lang.IllegalStateException: Cannot change destination of Connection because the current destination is running at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:472) at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:223) at org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1740) at org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:91) at org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:805) at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:954) ... 3 common frames omitted Caused by: java.lang.IllegalStateException: Cannot change destination of Connection because the current destination is running at org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:295) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:705) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:423) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:549) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:445) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:248) at org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:638) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:243) at org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3860) at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:464) ... 8 common frames omitted 2024-02-14 12:49:40,503 INFO [Reconnect to Cluster] o.a.n.c.c.node.NodeClusterCoordinator nifi-2b:8443 requested disconnection from cluster due to org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption. 2024-02-14 12:49:40,503 INFO [Reconnect to Cluster] o.a.n.c.c.node.NodeClusterCoordinator Status of nifi-2b:8443 changed from NodeConnectionStatus[nodeId=nifi-2b:8443, state=CONNECTING, updateId=48] to NodeConnectionStatus[nodeId=nifi-2b:8443, state=DISCONNECTED, Disconnect Code=Node's Flow did not Match Cluster Flow, Disconnect Reason=org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption., updateId=48] {code} > Frequent "failed to connect node to cluster because local flow controller > partially updated. Administrator should disconnect node and review flow for > corruption" > ----------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: NIFI-12232 > URL: https://issues.apache.org/jira/browse/NIFI-12232 > Project: Apache NiFi > Issue Type: Bug > Components: Configuration Management > Affects Versions: 1.23.2 > Reporter: John Joseph > Priority: Major > Attachments: image-2023-10-16-16-12-31-027.png, > image-2024-02-14-13-33-44-354.png > > > This is an issue that we have been observing in the 1.23.2 version of NiFi > when we try upgrade, > Since Rolling upgrade is not supported in NiFi, we scale out the revision > that is running and {_}run a helm upgrade{_}. > We have NIFI running in k8s cluster mode, there is a post job that call the > Tenants and policies API. On a successful run it would run like this > {code:java} > set_policies() Action: 'read' Resource: '/flow' entity_id: > 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' > entity_type: 'USER' > set_policies() status: '200' > 'read' '/flow' policy already exists. It will be updated... > set_policies() fetching policy inside -eq 200 status: '200' > set_policies() after update PUT: '200' > set_policies() Action: 'read' Resource: '/tenants' entity_id: > 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' > entity_type: 'USER' > set_policies() status: '200'{code} > *_This job was running fine in 1.23.0, 1.22 and other previous versions._* In > {*}{{1.23.2}}{*}, we are noticing that the job is failing very frequently > with the error logs; > {code:java} > set_policies() Action: 'read' Resource: '/flow' entity_id: > 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' > entity_type: 'USER' > set_policies() status: '200' > 'read' '/flow' policy already exists. It will be updated... > set_policies() fetching policy inside -eq 200 status: '200' > set_policies() after update PUT: '400' > An error occurred getting 'read' '/flow' policy: 'This node is disconnected > from its configured cluster. The requested change will only be allowed if the > flag to acknowledge the disconnected node is set.'{code} > {{_*'This node is disconnected from its configured cluster. The requested > change will only be allowed if the flag to acknowledge the disconnected node > is set.'*_}} > The job is configured to run only after all the pods are up and running. > Though the pods are up we see exception is the inside pods > {code:java} > org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed > to connect node to cluster because local flow controller partially updated. > Administrator should disconnect node and review flow for corruption. > at > org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1059) > at > org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:667) > at > org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:107) > at > org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:396) > at java.base/java.lang.Thread.run(Thread.java:833) > Caused by: > org.apache.nifi.controller.serialization.FlowSynchronizationException: > java.lang.IllegalStateException: Cannot change destination of Connection > because the current destination is running > at > org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:448) > at > org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:206) > at > org.apache.nifi.controller.serialization.StandardFlowSynchronizer.sync(StandardFlowSynchronizer.java:42) > at > org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1530) > at > org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:104) > at > org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:817) > at > org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1028) > ... 4 common frames omitted > Caused by: java.lang.IllegalStateException: Cannot change destination of > Connection because the current destination is running > at > org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:310) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:700) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:405) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:543) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:427) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:266) > at > org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:550) > at > org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:261) > at > org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3977) > at > org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:439) > ... 10 common frames omitted{code} > Attaching screenshots of the UI as well. this issue is observed a lot > checking with CLI command. > {code:java} > ./cli.sh nifi cluster-summary -u > https://nifi-headless.doc-norc.svc.cluster.local:9443 -ts > /opt/nifi/cert_mgr/truststore.jks -tst jks -tsp changeit -ks > /opt/nifi/cert_mgr/keystore.j > ks -kst jks -ksp changeit > Total node count: 0 > Connected node count: 0 > Clustered: true > Connected to cluster: false{code} > > We tried Workaround > {code:java} > 1.Exec to the pod that has the flow file issue, delete the flow file so that > it deletes from the PVC > 2. Exit from pod > 3. Delete the pod that had the problem{code} > Pod will respwan, cluster coordinator will recreate the flowfile from the > connected nodes > This connected all the nodes. But this does not feel like an ideal solution > as we're seeing this issue quite often and cannot run this WA every time > !image-2023-10-16-16-12-31-027.png! > > we also see this Exception sometimes > {code:java} > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /nifi/leaders/Cluster Coordinator > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2480) > at > org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:243) > at > org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:232) > at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:94) > at > org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:229) > at > org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:220) > at > org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:42) > at > org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:155) > at > org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:135) > at > org.apache.curator.framework.recipes.locks.InterProcessMutex.getParticipantNodes(InterProcessMutex.java:170) > at > org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:336) > at > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.getLeader(CuratorLeaderElectionManager.java:281) > at > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.verifyLeader(CuratorLeaderElectionManager.java:572) > at > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.isLeader(CuratorLeaderElectionManager.java:526) > at > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$LeaderRole.isLeader(CuratorLeaderElectionManager.java:467) > at > org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.isLeader(CuratorLeaderElectionManager.java:262) > at > org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.isActiveClusterCoordinator(NodeClusterCoordinator.java:824) > at > org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor.monitorHeartbeats(AbstractHeartbeatMonitor.java:132) > at > org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor$1.run(AbstractHeartbeatMonitor.java:84) > at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)