[ 
https://issues.apache.org/jira/browse/NIFI-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833705#comment-17833705
 ] 

Nissim Shiman commented on NIFI-12969:
--------------------------------------

Thank you very much [~pgyori] for the testing, diagrams, logs and bringing out 
the finer details of this.
This greatly appreciated.

I am looking at a solution like you suggested (i.e. wait a little in 
StandardConnection) and after a short period of time, to give up and in the 
calling method undo the temp connections (and remove the temp funnel) to try to 
keep the graph the way it was before the temp connection(s) were made.

There doesn't appear to be an easy way to predict this situation to avoid 
making the temp connections in the first place. I've tried checking for 
unacknowleged flowfiles on all a group's connections before making 
temp-funnel/temp destinations (for any connection).   But even if it checks 
out, by the time we get to setting an individual connection's destination, 
there could be unacknowledged flowfiles on the connection again.

> Under heavy load, nifi node unable to rejoin cluster, graph modified with 
> temp funnel
> -------------------------------------------------------------------------------------
>
>                 Key: NIFI-12969
>                 URL: https://issues.apache.org/jira/browse/NIFI-12969
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.24.0, 2.0.0-M2
>            Reporter: Nissim Shiman
>            Assignee: Nissim Shiman
>            Priority: Major
>         Attachments: nifi-app.log, simple_flow.png, 
> simple_flow_with_temp-funnel.png
>
>
> Under heavy load, if a node leaves the cluster (due to heartbeat time out), 
> many times it is unable to rejoin the cluster.
> The nodes' graph will have been modified with a temp-funnel as well.
> Appears to be some sort of [timing 
> issue|https://github.com/apache/nifi/blob/rel/nifi-2.0.0-M2/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-components/src/main/java/org/apache/nifi/connectable/StandardConnection.java#L298]
>  # To reproduce, on a nifi cluster of three nodes, set up:
> 2 GenerateFlowFile processors -> PG
> Inside PG:
> inputPort -> UpdateAttribute
>  # Keep all defaults except for the following:
> For UpdateAttribute terminate the success relationship
> One of the GenerateFlowFile processors can be disabled,
> the other one should have Run Schedule to be 0 min (this will allow for the 
> heavy load)
>  # In nifi.properties (on all 3 nodes) to allow for nodes to fall out of the 
> cluster, set: nifi.cluster.protocol.heartbeat.interval=2 sec  (default is 5) 
> nifi.cluster.protocol.heartbeat.missable.max=1   (default is 8)
> Restart nifi. Start flow. The nodes will quickly fall out and rejoin cluster. 
> After a few minutes one will likely not be able to rejoin.  The graph for 
> that node will have the disabled GenerateFlowFile now pointing to a funnel (a 
> temp-funnel) instead of the PG
> Stack trace on that nodes nifi-app.log will look like this: (this is from 
> 2.0.0-M2):
> {code:java}
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster] 
> o.a.nifi.controller.StandardFlowService Node disconnected due to Failed to 
> properly handle Reconnection request due to org.apache.nifi.control
> ler.serialization.FlowSynchronizationException: Failed to connect node to 
> cluster because local flow controller partially updated. Administrator should 
> disconnect node and review flow for corrup
> tion.
> 2024-03-28 13:55:19,395 ERROR [Reconnect to Cluster] 
> o.a.nifi.controller.StandardFlowService Handling reconnection request failed 
> due to: org.apache.nifi.controller.serialization.FlowSynchroniza
> tionException: Failed to connect node to cluster because local flow 
> controller partially updated. Administrator should disconnect node and review 
> flow for corruption.
> org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed 
> to connect node to cluster because local flow controller partially updated. 
> Administrator should disconnect node and
>  review flow for corruption.
>         at 
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:985)
>         at 
> org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:655)
>         at 
> org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:384)
>         at java.base/java.lang.Thread.run(Thread.java:1583)
> Caused by: 
> org.apache.nifi.controller.serialization.FlowSynchronizationException: 
> java.lang.IllegalStateException: Cannot change destination of Connection 
> because FlowFiles from this Connection
> are currently held by LocalPort[id=99213c00-78ca-4848-112f-5454cc20656b, 
> type=INPUT_PORT, name=inputPort, group=innerPG]
>         at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:472)
>         at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:223)
>         at 
> org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1740)
>         at 
> org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:91)
>         at 
> org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:805)
>         at 
> org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:954)
>         ... 3 common frames omitted
> Caused by: java.lang.IllegalStateException: Cannot change destination of 
> Connection because FlowFiles from this Connection are currently held by 
> LocalPort[id=99213c00-78ca-4848-112f-5454cc20656b
> , type=INPUT_PORT, name=inputPort, group=innerPG]
>         at 
> org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:299)
>         at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:705)
>         at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:423)
>         at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:248)
>         at 
> org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:638)
>         at 
> org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:243)
>         at 
> org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3860)
>         at 
> org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:464)
>         ... 8 common frames omitted
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster] 
> o.a.n.c.c.node.NodeClusterCoordinator machine-name-2.organization.org:8443 
> requested disconnection from cluster due to org.apache.nifi.c
> ontroller.serialization.FlowSynchronizationException: Failed to connect node 
> to cluster because local flow controller partially updated. Administrator 
> should disconnect node and review flow for
> corruption.
> 2024-03-28 13:55:19,395 INFO [Reconnect to Cluster] 
> o.a.n.c.c.node.NodeClusterCoordinator Status of 
> <machine-name-2.organization>.org:8443 changed from 
> NodeConnectionStatus[nodeId=<machine-name-
> 2.organization>.org:8443, state=CONNECTING, updateId=852] to 
> NodeConnectionStatus[nodeId=<machine-name-2.organization>.org:8443, 
> state=DISCONNECTED, Disconnect Code=Node's Flow did n
> ot Match Cluster Flow, Disconnect 
> Reason=org.apache.nifi.controller.serialization.FlowSynchronizationException: 
> Failed to connect node to cluster because local flow controller partially 
> updated.
>  Administrator should disconnect node and review flow for corruption., 
> updateId=854]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to