[
https://issues.apache.org/jira/browse/NIFI-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817377#comment-17817377
]
René Zeidler commented on NIFI-12232:
-------------------------------------
I've encountered the same issue. It's happened since at least 1.23.2, and I can
realiably reproduce it on 1.25.0 and 2.0.0-M2 as well.
I've been able to create minimal reproduction steps that do not require and
non-standard setup. The issue is independent of any specific processors or any
complicated flow setup. It _always_ occurs when a node disconnects from the
cluster which contains a process group that hasn't been "fully synced". I'll
explain what that means in the reproduction steps.
h2. Minimal Reproduction Steps
# Setup a NiFi {*}cluster with at least 3 nodes{*}, using all default settings.
You may adjust {{nifi.cluster.flow.election.max.wait.time}} and
{{nifi.cluster.flow.election.max.candidates}} to make the node connection
process faster, but this isn't necessary to reproduce the bug.
# I'll call the nodes {*}Node A{*}, {*}Node B{*}, and {*}Node C{*}.
Open the web interface for Node A and Node B.
# On {*}Node A{*}, create a new {*}process group{*}. In that process group,
create a very simple flow: GenerateFlowFile going into UpdateAttribute going
into a funnel. Start the UpdateAttribute processor. Like this:
!image-2024-02-14-13-33-44-354.png!
The exact flow doesn't matter, all that's necessary to produce the bug is a
*running processor* with an {*}ingoing and outgoing connection{*}.
# On {*}Node B{*}, observe that the process group has _automatically synced_
(Right click -> Refresh if you don't want to wait).
# On {*}Node A{*}, go to *Menu -> Cluster* (top right hamburger menu).
{*}Disconnect Node B{*}. Click refresh (bottom left) until the node has
disconnected.
# Right after it was disconnected, *connect Node B* again. Click refresh to
see the status change. It will change to CONNECTING and quickly back to
DISCONNECTED. Check the log file for Node B. You will see the following
exception:
{{o.a.nifi.controller.StandardFlowService Handling reconnection request failed
due to: org.apache.nifi.controller.serialization.FlowSynchronizationException:
Failed to connect node to cluster because local flow controller partially
updated. Administrator should disconnect node and review flow for corruption.}}
[...]
{{Caused by:
org.apache.nifi.controller.serialization.FlowSynchronizationException:
java.lang.IllegalStateException: Cannot change destination of Connection
because the current destination is running}}
# On {*}Node B{*}, you will get the warning that the node is disconnected from
the cluster ({_}This node is currently not connected to the cluster. Any
modifications to the data flow made here will not replicate across the
cluster.{_})
Go into the process group. Observe that the UpdateAttribute processor is
{*}running{*}, which is the direct cause of the exception.
h3. Temporary fix
# On {*}Node B{*}, *stop* the UpdateAttribute processor.
# On {*}Node A{*}, *connect Node B* again. This time it will work and Node B
successfully reconnects to the cluster.
# However, this only allows Node B to reconnect once. The process group on
Node B is still in an inconsistent state and will fail to reconnect the next
time. Repeat steps 5 - 7 above to confirm that the issue persists.
h3. Permanent fix
# On {*}Node B{*}, stop the UpdateAttribute processor and then {*}delete the
whole processor group{*}. Since Node B is currently disconnected from the
cluster, this will only delete the process group locally on this node.
# On {*}Node A{*}, *connect Node B* again. The reconnection will be
successfull and the deleted process group will sync back to Node B. Since the
whole process group was missing, this will now be a "full sync".
# This specific process group on this specific node (Node B) is now "fixed".
It will not cause this issue anymore.
To confirm, repeat steps 5 and 6 above. You can disconnect and reconnect Node B
without issues.
h2. Further notes
* Instead of deleting the process group, you can also stop the disconnected
node completely, delete the flow.json/flow.xml, and start it again. It will
join the cluster again, and all process groups will be "fully synced". This fix
was described in previous comments, but is not necessary to reproduce the issue.
* The fix applies per process group and per node. After fixing the issue for
Node B with the "permanent fix" above, it will still affect Node C. If you
disconnect and try to reconnect Node C it will throw the same exception.
* Also, the group where you initially created the flow (in this example Node
A) is _not_ exempt. If you go to Node C, disconnect and try to reconnect Node
A, it will throw the same exception.
h2. Full error log
{code:java}
2024-02-14 12:49:40,487 INFO [Reconnect to Cluster]
o.a.nifi.controller.StandardFlowService Processing reconnection request from
cluster coordinator.
2024-02-14 12:49:40,487 INFO [Process Cluster Protocol Request-13]
o.a.n.c.p.impl.SocketProtocolListener Finished processing request
1b2d4350-0982-4548-8aa7-10df3d50ced7 (type=RECONNECTION_REQUEST, length=19212
bytes) from nifi-2b:8443 in 7 millis
2024-02-14 12:49:40,487 INFO [Reconnect to Cluster]
o.a.n.c.c.node.NodeClusterCoordinator Resetting cluster node statuses from
{685d125a-67a1-4f49-b2ea-1062c99bcafd=NodeConnectionStatus[nodeId=nifi-2a:8443,
state=CONNECTED, updateId=12],
fab3e7e2-3b39-444a-b37e-4924dbd74999=NodeConnectionStatus[nodeId=nifi-2c:8443,
state=CONNECTED, updateId=39],
17414631-7ac5-425e-8ce0-d962186017f5=NodeConnectionStatus[nodeId=nifi-2b:8443,
state=CONNECTING, updateId=48]} to
{nifi-2c:8443=NodeConnectionStatus[nodeId=nifi-2c:8443, state=CONNECTED,
updateId=39], nifi-2b:8443=NodeConnectionStatus[nodeId=nifi-2b:8443,
state=CONNECTING, updateId=48],
nifi-2a:8443=NodeConnectionStatus[nodeId=nifi-2a:8443, state=CONNECTED,
updateId=12]}
2024-02-14 12:49:40,488 INFO [Reconnect to Cluster]
o.a.nifi.controller.StandardFlowService Setting Flow Controller's Node ID:
nifi-2b:8443
2024-02-14 12:49:40,488 INFO [Reconnect to Cluster]
o.a.n.c.s.VersionedFlowSynchronizer Synchronizing FlowController with proposed
flow: Controller Already Synchronized = true
2024-02-14 12:49:40,501 INFO [Reconnect to Cluster]
o.a.n.c.s.VersionedFlowSynchronizer In order to inherit proposed dataflow, will
stop any components that will be affected by the update
2024-02-14 12:49:40,501 INFO [Reconnect to Cluster]
o.a.n.c.s.AffectedComponentSet Stopping the following components:
AffectedComponentSet[inputPorts=[], outputPorts=[], remoteInputPorts=[],
remoteOutputPorts=[], processors=[], parameterProviders=[],
flowRegistryClients=[], controllerServices=[], reportingTasks=[],
flowAnalysisRules=[], statelessProcessGroups=[]]
2024-02-14 12:49:40,501 INFO [Reconnect to Cluster]
o.a.n.c.s.AffectedComponentSet Successfully stopped all components in 0
milliseconds
2024-02-14 12:49:40,501 INFO [Reconnect to Cluster]
o.apache.nifi.controller.FlowController [Timer Driven] Maximum Thread Count
updated [10] previous [10]
2024-02-14 12:49:40,502 INFO [Reconnect to Cluster]
o.a.n.f.s.StandardVersionedComponentSynchronizer No differences between current
flow and proposed flow for
StandardProcessGroup[identifier=a734e715-018d-1000-6784-b2e925615966,name=NiFi
Flow]
2024-02-14 12:49:40,502 INFO [Reconnect to Cluster]
o.a.nifi.groups.StandardProcessGroup
StandardFunnel[id=a7a9bd0e-018d-1000-0000-00003ac8000f-temp-funnel] added to
StandardProcessGroup[identifier=a7a9bd0e-018d-1000-0000-00003ac8000f,name=Cluster
Reconnect Bug Test]
2024-02-14 12:49:40,503 INFO [Reconnect to Cluster]
o.a.n.c.s.AffectedComponentSet Starting the following components:
AffectedComponentSet[inputPorts=[], outputPorts=[], remoteInputPorts=[],
remoteOutputPorts=[], processors=[], parameterProviders=[],
flowRegistryClients=[], controllerServices=[], reportingTasks=[],
flowAnalysisRules=[], statelessProcessGroups=[]]
2024-02-14 12:49:40,503 ERROR [Reconnect to Cluster]
o.a.nifi.controller.StandardFlowService Handling reconnection request failed
due to: org.apache.nifi.controller.serialization.FlowSynchronizationException:
Failed to connect node to cluster because local flow controller partially
updated. Administrator should disconnect node and review flow for corruption.
org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed
to connect node to cluster because local flow controller partially updated.
Administrator should disconnect node and review flow for corruption.
at
org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:985)
at
org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:655)
at
org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:384)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by:
org.apache.nifi.controller.serialization.FlowSynchronizationException:
java.lang.IllegalStateException: Cannot change destination of Connection
because the current destination is running
at
org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:472)
at
org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:223)
at
org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1740)
at
org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:91)
at
org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:805)
at
org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:954)
... 3 common frames omitted
Caused by: java.lang.IllegalStateException: Cannot change destination of
Connection because the current destination is running
at
org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:295)
at
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:705)
at
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:423)
at
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:549)
at
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:445)
at
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:248)
at
org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:638)
at
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:243)
at
org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3860)
at
org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:464)
... 8 common frames omitted
2024-02-14 12:49:40,503 INFO [Reconnect to Cluster]
o.a.n.c.c.node.NodeClusterCoordinator nifi-2b:8443 requested disconnection from
cluster due to
org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed
to connect node to cluster because local flow controller partially updated.
Administrator should disconnect node and review flow for corruption.
2024-02-14 12:49:40,503 INFO [Reconnect to Cluster]
o.a.n.c.c.node.NodeClusterCoordinator Status of nifi-2b:8443 changed from
NodeConnectionStatus[nodeId=nifi-2b:8443, state=CONNECTING, updateId=48] to
NodeConnectionStatus[nodeId=nifi-2b:8443, state=DISCONNECTED, Disconnect
Code=Node's Flow did not Match Cluster Flow, Disconnect
Reason=org.apache.nifi.controller.serialization.FlowSynchronizationException:
Failed to connect node to cluster because local flow controller partially
updated. Administrator should disconnect node and review flow for corruption.,
updateId=48] {code}
> Frequent "failed to connect node to cluster because local flow controller
> partially updated. Administrator should disconnect node and review flow for
> corruption"
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-12232
> URL: https://issues.apache.org/jira/browse/NIFI-12232
> Project: Apache NiFi
> Issue Type: Bug
> Components: Configuration Management
> Affects Versions: 1.23.2
> Reporter: John Joseph
> Priority: Major
> Attachments: image-2023-10-16-16-12-31-027.png,
> image-2024-02-14-13-33-44-354.png
>
>
> This is an issue that we have been observing in the 1.23.2 version of NiFi
> when we try upgrade,
> Since Rolling upgrade is not supported in NiFi, we scale out the revision
> that is running and {_}run a helm upgrade{_}.
> We have NIFI running in k8s cluster mode, there is a post job that call the
> Tenants and policies API. On a successful run it would run like this
> {code:java}
> set_policies() Action: 'read' Resource: '/flow' entity_id:
> 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin'
> entity_type: 'USER'
> set_policies() status: '200'
> 'read' '/flow' policy already exists. It will be updated...
> set_policies() fetching policy inside -eq 200 status: '200'
> set_policies() after update PUT: '200'
> set_policies() Action: 'read' Resource: '/tenants' entity_id:
> 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin'
> entity_type: 'USER'
> set_policies() status: '200'{code}
> *_This job was running fine in 1.23.0, 1.22 and other previous versions._* In
> {*}{{1.23.2}}{*}, we are noticing that the job is failing very frequently
> with the error logs;
> {code:java}
> set_policies() Action: 'read' Resource: '/flow' entity_id:
> 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin'
> entity_type: 'USER'
> set_policies() status: '200'
> 'read' '/flow' policy already exists. It will be updated...
> set_policies() fetching policy inside -eq 200 status: '200'
> set_policies() after update PUT: '400'
> An error occurred getting 'read' '/flow' policy: 'This node is disconnected
> from its configured cluster. The requested change will only be allowed if the
> flag to acknowledge the disconnected node is set.'{code}
> {{_*'This node is disconnected from its configured cluster. The requested
> change will only be allowed if the flag to acknowledge the disconnected node
> is set.'*_}}
> The job is configured to run only after all the pods are up and running.
> Though the pods are up we see exception is the inside pods
> {code:java}
> org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed
> to connect node to cluster because local flow controller partially updated.
> Administrator should disconnect node and review flow for corruption.
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1059)
> at
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:667)
> at
> org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:107)
> at
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:396)
> at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by:
> org.apache.nifi.controller.serialization.FlowSynchronizationException:
> java.lang.IllegalStateException: Cannot change destination of Connection
> because the current destination is running
> at
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:448)
> at
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:206)
> at
> org.apache.nifi.controller.serialization.StandardFlowSynchronizer.sync(StandardFlowSynchronizer.java:42)
> at
> org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1530)
> at
> org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:104)
> at
> org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:817)
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1028)
> ... 4 common frames omitted
> Caused by: java.lang.IllegalStateException: Cannot change destination of
> Connection because the current destination is running
> at
> org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:310)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:700)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:405)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:543)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:427)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:266)
> at
> org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:550)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:261)
> at
> org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3977)
> at
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:439)
> ... 10 common frames omitted{code}
> Attaching screenshots of the UI as well. this issue is observed a lot
> checking with CLI command.
> {code:java}
> ./cli.sh nifi cluster-summary -u
> https://nifi-headless.doc-norc.svc.cluster.local:9443 -ts
> /opt/nifi/cert_mgr/truststore.jks -tst jks -tsp changeit -ks
> /opt/nifi/cert_mgr/keystore.j
> ks -kst jks -ksp changeit
> Total node count: 0
> Connected node count: 0
> Clustered: true
> Connected to cluster: false{code}
>
> We tried Workaround
> {code:java}
> 1.Exec to the pod that has the flow file issue, delete the flow file so that
> it deletes from the PVC
> 2. Exit from pod
> 3. Delete the pod that had the problem{code}
> Pod will respwan, cluster coordinator will recreate the flowfile from the
> connected nodes
> This connected all the nodes. But this does not feel like an ideal solution
> as we're seeing this issue quite often and cannot run this WA every time
> !image-2023-10-16-16-12-31-027.png!
>
> we also see this Exception sometimes
> {code:java}
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /nifi/leaders/Cluster Coordinator
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2480)
> at
> org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:243)
> at
> org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:232)
> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:94)
> at
> org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:229)
> at
> org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:220)
> at
> org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:42)
> at
> org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:155)
> at
> org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:135)
> at
> org.apache.curator.framework.recipes.locks.InterProcessMutex.getParticipantNodes(InterProcessMutex.java:170)
> at
> org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:336)
> at
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.getLeader(CuratorLeaderElectionManager.java:281)
> at
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.verifyLeader(CuratorLeaderElectionManager.java:572)
> at
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.isLeader(CuratorLeaderElectionManager.java:526)
> at
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$LeaderRole.isLeader(CuratorLeaderElectionManager.java:467)
> at
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.isLeader(CuratorLeaderElectionManager.java:262)
> at
> org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.isActiveClusterCoordinator(NodeClusterCoordinator.java:824)
> at
> org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor.monitorHeartbeats(AbstractHeartbeatMonitor.java:132)
> at
> org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor$1.run(AbstractHeartbeatMonitor.java:84)
> at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)