[jira] [Commented] (NIFI-12232) Frequent "failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption"

Jira Wed, 14 Feb 2024 05:11:04 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817377#comment-17817377
 ]


René Zeidler commented on NIFI-12232:
-------------------------------------

I've encountered the same issue. It's happened since at least 1.23.2, and I can 
realiably reproduce it on 1.25.0 and 2.0.0-M2 as well.

I've been able to create minimal reproduction steps that do not require and 
non-standard setup. The issue is independent of any specific processors or any 
complicated flow setup. It _always_ occurs when a node disconnects from the 
cluster which contains a process group that hasn't been "fully synced". I'll 
explain what that means in the reproduction steps.
h2. Minimal Reproduction Steps
 # Setup a NiFi {*}cluster with at least 3 nodes{*}, using all default settings.
You may adjust {{nifi.cluster.flow.election.max.wait.time}} and 
{{nifi.cluster.flow.election.max.candidates}} to make the node connection 
process faster, but this isn't necessary to reproduce the bug.
 # I'll call the nodes {*}Node A{*}, {*}Node B{*}, and {*}Node C{*}.
Open the web interface for Node A and Node B.
 # On {*}Node A{*}, create a new {*}process group{*}. In that process group, 
create a very simple flow: GenerateFlowFile going into UpdateAttribute going 
into a funnel. Start the UpdateAttribute processor. Like this:
!image-2024-02-14-13-33-44-354.png!
The exact flow doesn't matter, all that's necessary to produce the bug is a 
*running processor* with an {*}ingoing and outgoing connection{*}.
 # On {*}Node B{*}, observe that the process group has _automatically synced_ 
(Right click -> Refresh if you don't want to wait).
 # On {*}Node A{*}, go to *Menu -> Cluster* (top right hamburger menu). 
{*}Disconnect Node B{*}. Click refresh (bottom left) until the node has 
disconnected.
 # Right after it was disconnected, *connect Node B* again. Click refresh to 
see the status change. It will change to CONNECTING and quickly back to 
DISCONNECTED. Check the log file for Node B. You will see the following 
exception:
{{o.a.nifi.controller.StandardFlowService Handling reconnection request failed 
due to: org.apache.nifi.controller.serialization.FlowSynchronizationException: 
Failed to connect node to cluster because local flow controller partially 
updated. Administrator should disconnect node and review flow for corruption.}}
[...]
{{Caused by: 
org.apache.nifi.controller.serialization.FlowSynchronizationException: 
java.lang.IllegalStateException: Cannot change destination of Connection 
because the current destination is running}}
 # On {*}Node B{*}, you will get the warning that the node is disconnected from 
the cluster ({_}This node is currently not connected to the cluster. Any 
modifications to the data flow made here will not replicate across the 
cluster.{_})
Go into the process group. Observe that the UpdateAttribute processor is 
{*}running{*}, which is the direct cause of the exception.

h3. Temporary fix
 # On {*}Node B{*}, *stop* the UpdateAttribute processor.
 # On {*}Node A{*}, *connect Node B* again. This time it will work and Node B 
successfully reconnects to the cluster.
 # However, this only allows Node B to reconnect once. The process group on 
Node B is still in an inconsistent state and will fail to reconnect the next 
time. Repeat steps 5 - 7 above to confirm that the issue persists.

h3. Permanent fix
 # On {*}Node B{*}, stop the UpdateAttribute processor and then {*}delete the 
whole processor group{*}. Since Node B is currently disconnected from the 
cluster, this will only delete the process group locally on this node.
 # On {*}Node A{*}, *connect Node B* again. The reconnection will be 
successfull and the deleted process group will sync back to Node B. Since the 
whole process group was missing, this will now be a "full sync".
 # This specific process group on this specific node (Node B) is now "fixed". 
It will not cause this issue anymore.
To confirm, repeat steps 5 and 6 above. You can disconnect and reconnect Node B 
without issues.

h2. Further notes
 * Instead of deleting the process group, you can also stop the disconnected 
node completely, delete the flow.json/flow.xml, and start it again. It will 
join the cluster again, and all process groups will be "fully synced". This fix 
was described in previous comments, but is not necessary to reproduce the issue.
 * The fix applies per process group and per node. After fixing the issue for 
Node B with the "permanent fix" above, it will still affect Node C. If you 
disconnect and try to reconnect Node C it will throw the same exception.
 * Also, the group where you initially created the flow (in this example Node 
A) is _not_ exempt. If you go to Node C, disconnect and try to reconnect Node 
A, it will throw the same exception.

h2. Full error log

 
{code:java}
2024-02-14 12:49:40,487 INFO [Reconnect to Cluster] 
o.a.nifi.controller.StandardFlowService Processing reconnection request from 
cluster coordinator.
2024-02-14 12:49:40,487 INFO [Process Cluster Protocol Request-13] 
o.a.n.c.p.impl.SocketProtocolListener Finished processing request 
1b2d4350-0982-4548-8aa7-10df3d50ced7 (type=RECONNECTION_REQUEST, length=19212 
bytes) from nifi-2b:8443 in 7 millis
2024-02-14 12:49:40,487 INFO [Reconnect to Cluster] 
o.a.n.c.c.node.NodeClusterCoordinator Resetting cluster node statuses from 
{685d125a-67a1-4f49-b2ea-1062c99bcafd=NodeConnectionStatus[nodeId=nifi-2a:8443, 
state=CONNECTED, updateId=12], 
fab3e7e2-3b39-444a-b37e-4924dbd74999=NodeConnectionStatus[nodeId=nifi-2c:8443, 
state=CONNECTED, updateId=39], 
17414631-7ac5-425e-8ce0-d962186017f5=NodeConnectionStatus[nodeId=nifi-2b:8443, 
state=CONNECTING, updateId=48]} to 
{nifi-2c:8443=NodeConnectionStatus[nodeId=nifi-2c:8443, state=CONNECTED, 
updateId=39], nifi-2b:8443=NodeConnectionStatus[nodeId=nifi-2b:8443, 
state=CONNECTING, updateId=48], 
nifi-2a:8443=NodeConnectionStatus[nodeId=nifi-2a:8443, state=CONNECTED, 
updateId=12]}
2024-02-14 12:49:40,488 INFO [Reconnect to Cluster] 
o.a.nifi.controller.StandardFlowService Setting Flow Controller's Node ID: 
nifi-2b:8443
2024-02-14 12:49:40,488 INFO [Reconnect to Cluster] 
o.a.n.c.s.VersionedFlowSynchronizer Synchronizing FlowController with proposed 
flow: Controller Already Synchronized = true
2024-02-14 12:49:40,501 INFO [Reconnect to Cluster] 
o.a.n.c.s.VersionedFlowSynchronizer In order to inherit proposed dataflow, will 
stop any components that will be affected by the update
2024-02-14 12:49:40,501 INFO [Reconnect to Cluster] 
o.a.n.c.s.AffectedComponentSet Stopping the following components: 
AffectedComponentSet[inputPorts=[], outputPorts=[], remoteInputPorts=[], 
remoteOutputPorts=[], processors=[], parameterProviders=[], 
flowRegistryClients=[], controllerServices=[], reportingTasks=[], 
flowAnalysisRules=[], statelessProcessGroups=[]]
2024-02-14 12:49:40,501 INFO [Reconnect to Cluster] 
o.a.n.c.s.AffectedComponentSet Successfully stopped all components in 0 
milliseconds
2024-02-14 12:49:40,501 INFO [Reconnect to Cluster] 
o.apache.nifi.controller.FlowController [Timer Driven] Maximum Thread Count 
updated [10] previous [10]
2024-02-14 12:49:40,502 INFO [Reconnect to Cluster] 
o.a.n.f.s.StandardVersionedComponentSynchronizer No differences between current 
flow and proposed flow for 
StandardProcessGroup[identifier=a734e715-018d-1000-6784-b2e925615966,name=NiFi 
Flow]
2024-02-14 12:49:40,502 INFO [Reconnect to Cluster] 
o.a.nifi.groups.StandardProcessGroup 
StandardFunnel[id=a7a9bd0e-018d-1000-0000-00003ac8000f-temp-funnel] added to 
StandardProcessGroup[identifier=a7a9bd0e-018d-1000-0000-00003ac8000f,name=Cluster
 Reconnect Bug Test]
2024-02-14 12:49:40,503 INFO [Reconnect to Cluster] 
o.a.n.c.s.AffectedComponentSet Starting the following components: 
AffectedComponentSet[inputPorts=[], outputPorts=[], remoteInputPorts=[], 
remoteOutputPorts=[], processors=[], parameterProviders=[], 
flowRegistryClients=[], controllerServices=[], reportingTasks=[], 
flowAnalysisRules=[], statelessProcessGroups=[]]
2024-02-14 12:49:40,503 ERROR [Reconnect to Cluster] 
o.a.nifi.controller.StandardFlowService Handling reconnection request failed 
due to: org.apache.nifi.controller.serialization.FlowSynchronizationException: 
Failed to connect node to cluster because local flow controller partially 
updated. Administrator should disconnect node and review flow for corruption.
org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed 
to connect node to cluster because local flow controller partially updated. 
Administrator should disconnect node and review flow for corruption.
        at 
org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:985)
        at 
org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:655)
        at 
org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:384)
        at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: 
org.apache.nifi.controller.serialization.FlowSynchronizationException: 
java.lang.IllegalStateException: Cannot change destination of Connection 
because the current destination is running
        at 
org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:472)
        at 
org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:223)
        at 
org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1740)
        at 
org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:91)
        at 
org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:805)
        at 
org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:954)
        ... 3 common frames omitted
Caused by: java.lang.IllegalStateException: Cannot change destination of 
Connection because the current destination is running
        at 
org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:295)
        at 
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:705)
        at 
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:423)
        at 
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:549)
        at 
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:445)
        at 
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:248)
        at 
org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:638)
        at 
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:243)
        at 
org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3860)
        at 
org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:464)
        ... 8 common frames omitted
2024-02-14 12:49:40,503 INFO [Reconnect to Cluster] 
o.a.n.c.c.node.NodeClusterCoordinator nifi-2b:8443 requested disconnection from 
cluster due to 
org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed 
to connect node to cluster because local flow controller partially updated. 
Administrator should disconnect node and review flow for corruption.
2024-02-14 12:49:40,503 INFO [Reconnect to Cluster] 
o.a.n.c.c.node.NodeClusterCoordinator Status of nifi-2b:8443 changed from 
NodeConnectionStatus[nodeId=nifi-2b:8443, state=CONNECTING, updateId=48] to 
NodeConnectionStatus[nodeId=nifi-2b:8443, state=DISCONNECTED, Disconnect 
Code=Node's Flow did not Match Cluster Flow, Disconnect 
Reason=org.apache.nifi.controller.serialization.FlowSynchronizationException: 
Failed to connect node to cluster because local flow controller partially 
updated. Administrator should disconnect node and review flow for corruption., 
updateId=48] {code}
 

> Frequent "failed to connect node to cluster because local flow controller 
> partially updated. Administrator should disconnect node and review flow for 
> corruption"
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-12232
>                 URL: https://issues.apache.org/jira/browse/NIFI-12232
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Configuration Management
>    Affects Versions: 1.23.2
>            Reporter: John Joseph
>            Priority: Major
>         Attachments: image-2023-10-16-16-12-31-027.png, 
> image-2024-02-14-13-33-44-354.png
>
>
> This is an issue that we have been observing in the 1.23.2 version of NiFi 
> when we try upgrade,
> Since Rolling upgrade is not supported in NiFi, we scale out the revision 
> that is running and {_}run a helm upgrade{_}.
> We have NIFI running in k8s cluster mode, there is a post job that call the 
> Tenants and policies API. On a successful run it would run like this
> {code:java}
> set_policies() Action: 'read' Resource: '/flow' entity_id: 
> 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' 
> entity_type: 'USER'
> set_policies() status: '200'
> 'read' '/flow' policy already exists. It will be updated...
> set_policies() fetching policy inside -eq 200 status: '200'
> set_policies() after update PUT: '200'
> set_policies() Action: 'read' Resource: '/tenants' entity_id: 
> 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' 
> entity_type: 'USER'
> set_policies() status: '200'{code}
> *_This job was running fine in 1.23.0, 1.22 and other previous versions._* In 
> {*}{{1.23.2}}{*}, we are noticing that the job is failing very frequently 
> with the error logs;
> {code:java}
> set_policies() Action: 'read' Resource: '/flow' entity_id: 
> 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' 
> entity_type: 'USER'
> set_policies() status: '200'
> 'read' '/flow' policy already exists. It will be updated...
> set_policies() fetching policy inside -eq 200 status: '200'
> set_policies() after update PUT: '400'
> An error occurred getting 'read' '/flow' policy: 'This node is disconnected 
> from its configured cluster. The requested change will only be allowed if the 
> flag to acknowledge the disconnected node is set.'{code}
> {{_*'This node is disconnected from its configured cluster. The requested 
> change will only be allowed if the flag to acknowledge the disconnected node 
> is set.'*_}}
> The job is configured to run only after all the pods are up and running. 
> Though the pods are up we see exception is the inside pods
> {code:java}
> org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed 
> to connect node to cluster because local flow controller partially updated. 
> Administrator should disconnect node and review flow for corruption.
> at 
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1059)
> at 
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:667)
> at 
> org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:107)
> at 
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:396)
> at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: 
> org.apache.nifi.controller.serialization.FlowSynchronizationException: 
> java.lang.IllegalStateException: Cannot change destination of Connection 
> because the current destination is running
> at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:448)
> at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:206)
> at 
> org.apache.nifi.controller.serialization.StandardFlowSynchronizer.sync(StandardFlowSynchronizer.java:42)
> at 
> org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1530)
> at 
> org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:104)
> at 
> org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:817)
> at 
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1028)
> ... 4 common frames omitted
> Caused by: java.lang.IllegalStateException: Cannot change destination of 
> Connection because the current destination is running
> at 
> org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:310)
> at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:700)
> at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:405)
> at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:543)
> at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:427)
> at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:266)
> at 
> org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:550)
> at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:261)
> at 
> org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3977)
> at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:439)
> ... 10 common frames omitted{code}
> Attaching screenshots of the UI as well. this issue is observed a lot 
> checking with CLI command.
> {code:java}
> ./cli.sh nifi cluster-summary -u 
> https://nifi-headless.doc-norc.svc.cluster.local:9443 -ts 
> /opt/nifi/cert_mgr/truststore.jks -tst jks -tsp changeit -ks 
> /opt/nifi/cert_mgr/keystore.j
> ks -kst jks -ksp changeit
> Total node count: 0
> Connected node count: 0
> Clustered: true
> Connected to cluster: false{code}
>  
> We tried Workaround
> {code:java}
> 1.Exec to the pod that has the flow file issue, delete the flow file so that 
> it deletes from the PVC 
> 2. Exit from pod
> 3. Delete the pod that had the problem{code}
> Pod will respwan, cluster coordinator will recreate the flowfile from the 
> connected nodes
> This connected all the nodes. But this does not feel like an ideal solution 
> as we're seeing this issue quite often and cannot run this WA every time
> !image-2023-10-16-16-12-31-027.png!
>  
> we also see this Exception sometimes 
> {code:java}
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /nifi/leaders/Cluster Coordinator
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2480)
>         at 
> org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:243)
>         at 
> org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:232)
>         at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:94)
>         at 
> org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:229)
>         at 
> org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:220)
>         at 
> org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:42)
>         at 
> org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:155)
>         at 
> org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:135)
>         at 
> org.apache.curator.framework.recipes.locks.InterProcessMutex.getParticipantNodes(InterProcessMutex.java:170)
>         at 
> org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:336)
>         at 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.getLeader(CuratorLeaderElectionManager.java:281)
>         at 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.verifyLeader(CuratorLeaderElectionManager.java:572)
>         at 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.isLeader(CuratorLeaderElectionManager.java:526)
>         at 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$LeaderRole.isLeader(CuratorLeaderElectionManager.java:467)
>         at 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.isLeader(CuratorLeaderElectionManager.java:262)
>         at 
> org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.isActiveClusterCoordinator(NodeClusterCoordinator.java:824)
>         at 
> org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor.monitorHeartbeats(AbstractHeartbeatMonitor.java:132)
>         at 
> org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor$1.run(AbstractHeartbeatMonitor.java:84)
>         at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NIFI-12232) Frequent "failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption"

Reply via email to