[
https://issues.apache.org/jira/browse/NIFI-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Gyori updated NIFI-12969:
-------------------------------
Attachment: simple_flow.png
> Under heavy load, nifi node unable to rejoin cluster, graph modified with
> temp funnel
> -------------------------------------------------------------------------------------
>
> Key: NIFI-12969
> URL: https://issues.apache.org/jira/browse/NIFI-12969
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.24.0, 2.0.0-M2
> Reporter: Nissim Shiman
> Assignee: Nissim Shiman
> Priority: Major
> Attachments: simple_flow.png
>
>
> Under heavy load, if a node leaves the cluster (due to heartbeat time out),
> many times it is unable to rejoin the cluster.
> The nodes' graph will have been modified with a temp-funnel as well.
> Appears to be some sort of [timing
> issue|https://github.com/apache/nifi/blob/rel/nifi-2.0.0-M2/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-components/src/main/java/org/apache/nifi/connectable/StandardConnection.java#L298]
> # To reproduce, on a nifi cluster of three nodes, set up:
> 2 GenerateFlowFile processors -> PG
> Inside PG:
> inputPort -> UpdateAttribute
> # Keep all defaults except for the following:
> For UpdateAttribute terminate the success relationship
> One of the GenerateFlowFile processors can be disabled,
> the other one should have Run Schedule to be 0 min (this will allow for the
> heavy load)
> # In nifi.properties (on all 3 nodes) to allow for nodes to fall out of the
> cluster, set: nifi.cluster.protocol.heartbeat.interval=2 sec (default is 5)
> nifi.cluster.protocol.heartbeat.missable.max=1 (default is 8)
> Restart nifi. Start flow. The nodes will quickly fall out and rejoin cluster.
> After a few minutes one will likely not be able to rejoin. The graph for
> that node will have the disabled GenerateFlowFile now pointing to a funnel (a
> temp-funnel) instead of the PG
> Stack trace on that nodes nifi-app.log will look like this: (this is from
> 2.0.0-M2):
> {code:java}
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster]
> o.a.nifi.controller.StandardFlowService Node disconnected due to Failed to
> properly handle Reconnection request due to org.apache.nifi.control
> ler.serialization.FlowSynchronizationException: Failed to connect node to
> cluster because local flow controller partially updated. Administrator should
> disconnect node and review flow for corrup
> tion.
> 2024-03-28 13:55:19,395 ERROR [Reconnect to Cluster]
> o.a.nifi.controller.StandardFlowService Handling reconnection request failed
> due to: org.apache.nifi.controller.serialization.FlowSynchroniza
> tionException: Failed to connect node to cluster because local flow
> controller partially updated. Administrator should disconnect node and review
> flow for corruption.
> org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed
> to connect node to cluster because local flow controller partially updated.
> Administrator should disconnect node and
> review flow for corruption.
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:985)
> at
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:655)
> at
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:384)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> Caused by:
> org.apache.nifi.controller.serialization.FlowSynchronizationException:
> java.lang.IllegalStateException: Cannot change destination of Connection
> because FlowFiles from this Connection
> are currently held by LocalPort[id=99213c00-78ca-4848-112f-5454cc20656b,
> type=INPUT_PORT, name=inputPort, group=innerPG]
> at
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:472)
> at
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:223)
> at
> org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1740)
> at
> org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:91)
> at
> org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:805)
> at
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:954)
> ... 3 common frames omitted
> Caused by: java.lang.IllegalStateException: Cannot change destination of
> Connection because FlowFiles from this Connection are currently held by
> LocalPort[id=99213c00-78ca-4848-112f-5454cc20656b
> , type=INPUT_PORT, name=inputPort, group=innerPG]
> at
> org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:299)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:705)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:423)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:248)
> at
> org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:638)
> at
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:243)
> at
> org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3860)
> at
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:464)
> ... 8 common frames omitted
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster]
> o.a.n.c.c.node.NodeClusterCoordinator machine-name-2.organization.org:8443
> requested disconnection from cluster due to org.apache.nifi.c
> ontroller.serialization.FlowSynchronizationException: Failed to connect node
> to cluster because local flow controller partially updated. Administrator
> should disconnect node and review flow for
> corruption.
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster]
> o.a.n.c.c.node.NodeClusterCoordinator Status of
> <machine-name-2.organization>.org:8443 changed from
> NodeConnectionStatus[nodeId=<machine-name-
> 2.organization>.org:8443, state=CONNECTING, updateId=852] to
> NodeConnectionStatus[nodeId=<machine-name-2.organization>.org:8443,
> state=DISCONNECTED, Disconnect Code=Node's Flow did n
> ot Match Cluster Flow, Disconnect
> Reason=org.apache.nifi.controller.serialization.FlowSynchronizationException:
> Failed to connect node to cluster because local flow controller partially
> updated.
> Administrator should disconnect node and review flow for corruption.,
> updateId=854]
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)