[jira] [Commented] (NIFI-12969) Under heavy load, nifi node unable to rejoin cluster, graph modified with temp funnel

Peter Gyori (Jira) Wed, 03 Apr 2024 07:23:15 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833586#comment-17833586
 ]


Peter Gyori commented on NIFI-12969:
------------------------------------

Thank you [~Nissim Shiman] for the detailed steps to reproduce the issue.

I managed to reproduce it with your steps even on a very recent build of NiFi, 
where the latest commit on the main branch is 
{noformat}
commit b574a7e4: NIFI-12987 allow controller service type to be 
searchable{noformat}
I used a 3-node cluster. On Apple Silicon one might need a little stricter 
settings for heartbeats:

 

 
{code:java}
nifi.cluster.protocol.heartbeat.interval=1
secnifi.cluster.protocol.heartbeat.missable.max=1 {code}
I needed two GenerateFlowFile processors, both with 2 simultaneous threads (and 
Run Schedule=0 min as you mentioned).

 

Screenshot of the test flow.

!simple_flow.png|width=645,height=323!

With the following log settings, I captured the attached [^nifi-app.log]

 
{code:java}
<logger name="org.apache.nifi.connectable" level="TRACE"/>
<logger name="org.apache.nifi.flow.synchronization" level="DEBUG"/>
<logger name="org.apache.nifi.controller" level="DEBUG"/> {code}
I paste here the log entries that I find the most important. All of these 
happen within 2 milliseconds:

 

 
{code:java}
2024-04-03 13:48:53,896 TRACE [Timer-Driven Process Thread-2] 
org.apache.nifi.connectable.LocalPort 
LocalPort[id=a3c770bd-018e-1000-0000-00000d6b782a, type=INPUT_PORT, name=input, 
group=pg] obtained claim for FlowFileGate
2024-04-03 13:48:53,896 DEBUG [Reconnect to Cluster] 
o.a.n.f.s.StandardVersionedComponentSynchronizer Changing destination of 
Connection Connection[ID=a3c84edc-018e-1000-0000-000013bbf65e, Source 
ID=a6393e3a-3f79-1d9a-b4eb-2a25ca912ab9, Dest 
ID=a3c770bd-018e-1000-0000-00000d6b782a] from 
LocalPort[id=a3c770bd-018e-1000-0000-00000d6b782a, type=INPUT_PORT, name=input, 
group=pg] to StandardFunnel[id=a34c81f0-018e-1000-7faa-412adc4de192-temp-funnel]
2024-04-03 13:48:53,896 DEBUG [Timer-Driven Process Thread-2] 
org.apache.nifi.connectable.LocalPort 
LocalPort[id=a3c770bd-018e-1000-0000-00000d6b782a, type=INPUT_PORT, name=input, 
group=pg] Transferred 1 FlowFiles
2024-04-03 13:48:53,896 TRACE [Timer-Driven Process Thread-2] 
org.apache.nifi.connectable.LocalPort 
LocalPort[id=a3c770bd-018e-1000-0000-00000d6b782a, type=INPUT_PORT, name=input, 
group=pg] released claim for FlowFileGate
2024-04-03 13:48:53,896 INFO [Reconnect to Cluster] 
o.a.n.c.s.AffectedComponentSet Starting the following components: 
AffectedComponentSet[inputPorts=[], outputPorts=[], remoteInputPorts=[], 
remoteOutputPorts=[], processors=[], parameterProviders=[], 
flowRegistryClients=[], controllerServices=[], reportingTasks=[], 
flowAnalysisRules=[], statelessProcessGroups=[]]
2024-04-03 13:48:53,897 INFO [Reconnect to Cluster] 
o.a.nifi.controller.StandardFlowService Disconnecting node due to Failed to 
properly handle Reconnection request due to 
org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed 
to connect node to cluster because local flow controller partially updated. 
Administrator should disconnect node and review flow for corruption.
2024-04-03 13:48:53,897 INFO [Reconnect to Cluster] 
o.apache.nifi.controller.FlowController Will no longer send heartbeats
2024-04-03 13:48:53,897 INFO [Reconnect to Cluster] 
o.apache.nifi.controller.FlowController FlowController will stop sending 
heartbeats to Cluster Coordinator
2024-04-03 13:48:53,897 INFO [Reconnect to Cluster] 
o.apache.nifi.controller.FlowController Cluster State changed from Clustered to 
Not Clustered
2024-04-03 13:48:53,897 ERROR [Reconnect to Cluster] [...] 
    Caused by: java.lang.IllegalStateException: Cannot change destination of 
Connection because FlowFiles from this Connection are currently held by 
LocalPort[id=a3c770bd-018e-1000-0000-00000d6b782a, type=INPUT_PORT, name=input, 
group=pg]
    at 
org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:299)
    at 
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:706)
    at 
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:423)
    at 
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:248)
    at 
org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:638)
    at 
org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:243)
    at 
org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3863)
    at 
org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:464){code}
That is: we have an input port with id a3c770bd-018e-1000-0000-00000d6b782a and 
the following events happen:
 # obtained claim for FlowFileGate (the port's onTrigger is started and 
readLock is active)
 # Changing destination of a connection whose destination is 
a3c770bd-018e-1000-0000-00000d6b782a
 # 1 flowfile is transferred via the port
 # released claim for FlowFileGate (the port's onTrigger finished and readLock 
is unlocked)
 # Disconnecting node.
 # Error: Cannot change destination of Connection because FlowFiles from this 
Connection are currently held by 
LocalPort[id=a3c770bd-018e-1000-0000-00000d6b782a

Apparently, the system tries to change the destination of the port before the 
readLock is unlocked and it causes issues.

Once my cluster was in this erroneous state where one node was disconnected I 
stopped the flow on all nodes and restarted NiFi and saw that all nodes 
synchronized to the flow with the temp funnel. My flow looked like this on all 
nodes after the restart:

!simple_flow_with_temp-funnel.png|width=788,height=401!

I do not yet have a good idea on how to avoid the issue. A naive approach would 
be to wait a little in StandardConnection and check again if there is still any 
unacknowledged flowfile but it would be a brute-force solution that would 
introduce unnecessary slowness.

 

 

> Under heavy load, nifi node unable to rejoin cluster, graph modified with 
> temp funnel
> -------------------------------------------------------------------------------------
>
>                 Key: NIFI-12969
>                 URL: https://issues.apache.org/jira/browse/NIFI-12969
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.24.0, 2.0.0-M2
>            Reporter: Nissim Shiman
>            Assignee: Nissim Shiman
>            Priority: Major
>         Attachments: nifi-app.log, simple_flow.png, 
> simple_flow_with_temp-funnel.png
>
>
> Under heavy load, if a node leaves the cluster (due to heartbeat time out), 
> many times it is unable to rejoin the cluster.
> The nodes' graph will have been modified with a temp-funnel as well.
> Appears to be some sort of [timing 
> issue|https://github.com/apache/nifi/blob/rel/nifi-2.0.0-M2/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-components/src/main/java/org/apache/nifi/connectable/StandardConnection.java#L298]
>  # To reproduce, on a nifi cluster of three nodes, set up:
> 2 GenerateFlowFile processors -> PG
> Inside PG:
> inputPort -> UpdateAttribute
>  # Keep all defaults except for the following:
> For UpdateAttribute terminate the success relationship
> One of the GenerateFlowFile processors can be disabled,
> the other one should have Run Schedule to be 0 min (this will allow for the 
> heavy load)
>  # In nifi.properties (on all 3 nodes) to allow for nodes to fall out of the 
> cluster, set: nifi.cluster.protocol.heartbeat.interval=2 sec  (default is 5) 
> nifi.cluster.protocol.heartbeat.missable.max=1   (default is 8)
> Restart nifi. Start flow. The nodes will quickly fall out and rejoin cluster. 
> After a few minutes one will likely not be able to rejoin.  The graph for 
> that node will have the disabled GenerateFlowFile now pointing to a funnel (a 
> temp-funnel) instead of the PG
> Stack trace on that nodes nifi-app.log will look like this: (this is from 
> 2.0.0-M2):
> {code:java}
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster] 
> o.a.nifi.controller.StandardFlowService Node disconnected due to Failed to 
> properly handle Reconnection request due to org.apache.nifi.control
> ler.serialization.FlowSynchronizationException: Failed to connect node to 
> cluster because local flow controller partially updated. Administrator should 
> disconnect node and review flow for corrup
> tion.
> 2024-03-28 13:55:19,395 ERROR [Reconnect to Cluster] 
> o.a.nifi.controller.StandardFlowService Handling reconnection request failed 
> due to: org.apache.nifi.controller.serialization.FlowSynchroniza
> tionException: Failed to connect node to cluster because local flow 
> controller partially updated. Administrator should disconnect node and review 
> flow for corruption.
> org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed 
> to connect node to cluster because local flow controller partially updated. 
> Administrator should disconnect node and
>  review flow for corruption.
>         at 
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:985)
>         at 
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:655)
>         at 
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:384)
>         at java.base/java.lang.Thread.run(Thread.java:1583)
> Caused by: 
> org.apache.nifi.controller.serialization.FlowSynchronizationException: 
> java.lang.IllegalStateException: Cannot change destination of Connection 
> because FlowFiles from this Connection
> are currently held by LocalPort[id=99213c00-78ca-4848-112f-5454cc20656b, 
> type=INPUT_PORT, name=inputPort, group=innerPG]
>         at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:472)
>         at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:223)
>         at 
> org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1740)
>         at 
> org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:91)
>         at 
> org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:805)
>         at 
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:954)
>         ... 3 common frames omitted
> Caused by: java.lang.IllegalStateException: Cannot change destination of 
> Connection because FlowFiles from this Connection are currently held by 
> LocalPort[id=99213c00-78ca-4848-112f-5454cc20656b
> , type=INPUT_PORT, name=inputPort, group=innerPG]
>         at 
> org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:299)
>         at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:705)
>         at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:423)
>         at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:248)
>         at 
> org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:638)
>         at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:243)
>         at 
> org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3860)
>         at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:464)
>         ... 8 common frames omitted
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster] 
> o.a.n.c.c.node.NodeClusterCoordinator machine-name-2.organization.org:8443 
> requested disconnection from cluster due to org.apache.nifi.c
> ontroller.serialization.FlowSynchronizationException: Failed to connect node 
> to cluster because local flow controller partially updated. Administrator 
> should disconnect node and review flow for
> corruption.
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster] 
> o.a.n.c.c.node.NodeClusterCoordinator Status of 
> <machine-name-2.organization>.org:8443 changed from 
> NodeConnectionStatus[nodeId=<machine-name-
> 2.organization>.org:8443, state=CONNECTING, updateId=852] to 
> NodeConnectionStatus[nodeId=<machine-name-2.organization>.org:8443, 
> state=DISCONNECTED, Disconnect Code=Node's Flow did n
> ot Match Cluster Flow, Disconnect 
> Reason=org.apache.nifi.controller.serialization.FlowSynchronizationException: 
> Failed to connect node to cluster because local flow controller partially 
> updated.
>  Administrator should disconnect node and review flow for corruption., 
> updateId=854]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NIFI-12969) Under heavy load, nifi node unable to rejoin cluster, graph modified with temp funnel

Reply via email to