[
https://issues.apache.org/jira/browse/NIFI-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833586#comment-17833586
]
Peter Gyori commented on NIFI-12969:
------------------------------------
Thank you [~Nissim Shiman] for the detailed steps to reproduce the issue.
I managed to reproduce it with your steps even on a very recent build of NiFi,
where the latest commit on the main branch is
{noformat}
commit b574a7e4: NIFI-12987 allow controller service type to be
searchable{noformat}
I used a 3-node cluster. On Apple Silicon one might need a little stricter
settings for heartbeats:
{code:java}
nifi.cluster.protocol.heartbeat.interval=1
secnifi.cluster.protocol.heartbeat.missable.max=1 {code}
I needed two GenerateFlowFile processors, both with 2 simultaneous threads (and
Run Schedule=0 min as you mentioned).
Screenshot of the test flow.
!simple_flow.png|width=645,height=323!
With the following log settings, I captured the attached [^nifi-app.log]
{code:java}
<logger name="org.apache.nifi.connectable" level="TRACE"/>
<logger name="org.apache.nifi.flow.synchronization" level="DEBUG"/>
<logger name="org.apache.nifi.controller" level="DEBUG"/> {code}
I paste here the log entries that I find the most important. All of these
happen within 2 milliseconds:
{code:java}
2024-04-03 13:48:53,896 TRACE [Timer-Driven Process Thread-2]
org.apache.nifi.connectable.LocalPort
LocalPort[id=a3c770bd-018e-1000-0000-00000d6b782a, type=INPUT_PORT, name=input,
group=pg] obtained claim for FlowFileGate
2024-04-03 13:48:53,896 DEBUG [Reconnect to Cluster]
o.a.n.f.s.StandardVersionedComponentSynchronizer Changing destination of
Connection Connection[ID=a3c84edc-018e-1000-0000-000013bbf65e, Source
ID=a6393e3a-3f79-1d9a-b4eb-2a25ca912ab9, Dest
ID=a3c770bd-018e-1000-0000-00000d6b782a] from
LocalPort[id=a3c770bd-018e-1000-0000-00000d6b782a, type=INPUT_PORT, name=input,
group=pg] to StandardFunnel[id=a34c81f0-018e-1000-7faa-412adc4de192-temp-funnel]
2024-04-03 13:48:53,896 DEBUG [Timer-Driven Process Thread-2]
org.apache.nifi.connectable.LocalPort
LocalPort[id=a3c770bd-018e-1000-0000-00000d6b782a, type=INPUT_PORT, name=input,
group=pg] Transferred 1 FlowFiles
2024-04-03 13:48:53,896 TRACE [Timer-Driven Process Thread-2]
org.apache.nifi.connectable.LocalPort
LocalPort[id=a3c770bd-018e-1000-0000-00000d6b782a, type=INPUT_PORT, name=input,
group=pg] released claim for FlowFileGate
2024-04-03 13:48:53,896 INFO [Reconnect to Cluster]
o.a.n.c.s.AffectedComponentSet Starting the following components:
AffectedComponentSet[inputPorts=[], outputPorts=[], remoteInputPorts=[],
remoteOutputPorts=[], processors=[], parameterProviders=[],
flowRegistryClients=[], controllerServices=[], reportingTasks=[],
flowAnalysisRules=[], statelessProcessGroups=[]]
2024-04-03 13:48:53,897 INFO [Reconnect to Cluster]
o.a.nifi.controller.StandardFlowService Disconnecting node due to Failed to
properly handle Reconnection request due to
org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed
to connect node to cluster because local flow controller partially updated.
Administrator should disconnect node and review flow for corruption.
2024-04-03 13:48:53,897 INFO [Reconnect to Cluster]
o.apache.nifi.controller.FlowController Will no longer send heartbeats
2024-04-03 13:48:53,897 INFO [Reconnect to Cluster]
o.apache.nifi.controller.FlowController FlowController will stop sending
heartbeats to Cluster Coordinator
2024-04-03 13:48:53,897 INFO [Reconnect to Cluster]
o.apache.nifi.controller.FlowController Cluster State changed from Clustered to
Not Clustered
2024-04-03 13:48:53,897 ERROR [Reconnect to Cluster] [...]
Caused by: java.lang.IllegalStateException: Cannot change destination of
Connection because FlowFiles from this Connection are currently held by
LocalPort[id=a3c770bd-018e-1000-0000-00000d6b782a, type=INPUT_PORT, name=input,
group=pg]
at
org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:299)
at
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:706)
at
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:423)
at
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:248)
at
org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:638)
at
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:243)
at
org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3863)
at
org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:464){code}
That is: we have an input port with id a3c770bd-018e-1000-0000-00000d6b782a and
the following events happen:
# obtained claim for FlowFileGate (the port's onTrigger is started and
readLock is active)
# Changing destination of a connection whose destination is
a3c770bd-018e-1000-0000-00000d6b782a
# 1 flowfile is transferred via the port
# released claim for FlowFileGate (the port's onTrigger finished and readLock
is unlocked)
# Disconnecting node.
# Error: Cannot change destination of Connection because FlowFiles from this
Connection are currently held by
LocalPort[id=a3c770bd-018e-1000-0000-00000d6b782a
Apparently, the system tries to change the destination of the port before the
readLock is unlocked and it causes issues.
Once my cluster was in this erroneous state where one node was disconnected I
stopped the flow on all nodes and restarted NiFi and saw that all nodes
synchronized to the flow with the temp funnel. My flow looked like this on all
nodes after the restart:
!simple_flow_with_temp-funnel.png|width=788,height=401!
I do not yet have a good idea on how to avoid the issue. A naive approach would
be to wait a little in StandardConnection and check again if there is still any
unacknowledged flowfile but it would be a brute-force solution that would
introduce unnecessary slowness.
> Under heavy load, nifi node unable to rejoin cluster, graph modified with
> temp funnel
> -------------------------------------------------------------------------------------
>
> Key: NIFI-12969
> URL: https://issues.apache.org/jira/browse/NIFI-12969
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.24.0, 2.0.0-M2
> Reporter: Nissim Shiman
> Assignee: Nissim Shiman
> Priority: Major
> Attachments: nifi-app.log, simple_flow.png,
> simple_flow_with_temp-funnel.png
>
>
> Under heavy load, if a node leaves the cluster (due to heartbeat time out),
> many times it is unable to rejoin the cluster.
> The nodes' graph will have been modified with a temp-funnel as well.
> Appears to be some sort of [timing
> issue|https://github.com/apache/nifi/blob/rel/nifi-2.0.0-M2/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-components/src/main/java/org/apache/nifi/connectable/StandardConnection.java#L298]
> # To reproduce, on a nifi cluster of three nodes, set up:
> 2 GenerateFlowFile processors -> PG
> Inside PG:
> inputPort -> UpdateAttribute
> # Keep all defaults except for the following:
> For UpdateAttribute terminate the success relationship
> One of the GenerateFlowFile processors can be disabled,
> the other one should have Run Schedule to be 0 min (this will allow for the
> heavy load)
> # In nifi.properties (on all 3 nodes) to allow for nodes to fall out of the
> cluster, set: nifi.cluster.protocol.heartbeat.interval=2 sec (default is 5)
> nifi.cluster.protocol.heartbeat.missable.max=1 (default is 8)
> Restart nifi. Start flow. The nodes will quickly fall out and rejoin cluster.
> After a few minutes one will likely not be able to rejoin. The graph for
> that node will have the disabled GenerateFlowFile now pointing to a funnel (a
> temp-funnel) instead of the PG
> Stack trace on that nodes nifi-app.log will look like this: (this is from
> 2.0.0-M2):
> {code:java}
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster]
> o.a.nifi.controller.StandardFlowService Node disconnected due to Failed to
> properly handle Reconnection request due to org.apache.nifi.control
> ler.serialization.FlowSynchronizationException: Failed to connect node to
> cluster because local flow controller partially updated. Administrator should
> disconnect node and review flow for corrup
> tion.
> 2024-03-28 13:55:19,395 ERROR [Reconnect to Cluster]
> o.a.nifi.controller.StandardFlowService Handling reconnection request failed
> due to: org.apache.nifi.controller.serialization.FlowSynchroniza
> tionException: Failed to connect node to cluster because local flow
> controller partially updated. Administrator should disconnect node and review
> flow for corruption.
> org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed
> to connect node to cluster because local flow controller partially updated.
> Administrator should disconnect node and
> review flow for corruption.
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:985)
> at
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:655)
> at
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:384)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> Caused by:
> org.apache.nifi.controller.serialization.FlowSynchronizationException:
> java.lang.IllegalStateException: Cannot change destination of Connection
> because FlowFiles from this Connection
> are currently held by LocalPort[id=99213c00-78ca-4848-112f-5454cc20656b,
> type=INPUT_PORT, name=inputPort, group=innerPG]
> at
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:472)
> at
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:223)
> at
> org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1740)
> at
> org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:91)
> at
> org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:805)
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:954)
> ... 3 common frames omitted
> Caused by: java.lang.IllegalStateException: Cannot change destination of
> Connection because FlowFiles from this Connection are currently held by
> LocalPort[id=99213c00-78ca-4848-112f-5454cc20656b
> , type=INPUT_PORT, name=inputPort, group=innerPG]
> at
> org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:299)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:705)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:423)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:248)
> at
> org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:638)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:243)
> at
> org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3860)
> at
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:464)
> ... 8 common frames omitted
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster]
> o.a.n.c.c.node.NodeClusterCoordinator machine-name-2.organization.org:8443
> requested disconnection from cluster due to org.apache.nifi.c
> ontroller.serialization.FlowSynchronizationException: Failed to connect node
> to cluster because local flow controller partially updated. Administrator
> should disconnect node and review flow for
> corruption.
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster]
> o.a.n.c.c.node.NodeClusterCoordinator Status of
> <machine-name-2.organization>.org:8443 changed from
> NodeConnectionStatus[nodeId=<machine-name-
> 2.organization>.org:8443, state=CONNECTING, updateId=852] to
> NodeConnectionStatus[nodeId=<machine-name-2.organization>.org:8443,
> state=DISCONNECTED, Disconnect Code=Node's Flow did n
> ot Match Cluster Flow, Disconnect
> Reason=org.apache.nifi.controller.serialization.FlowSynchronizationException:
> Failed to connect node to cluster because local flow controller partially
> updated.
> Administrator should disconnect node and review flow for corruption.,
> updateId=854]
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)