[ 
https://issues.apache.org/jira/browse/NIFI-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644967#comment-17644967
 ] 

Joe Witt commented on NIFI-9559:
--------------------------------

Anyone news to the issue https://issues.apache.org/jira/browse/NIFI-9559
We have also an 3-node Nifi Cluster (1.19.0) on AWS ECS Fargate and external 
Zookeeper (3.8.0 - 3 ZK nodes) and have the same issue.
On high load leading node lost connection to cluster and not correctly 
reconnecting. (edited) 



the load was high on the ZK node or high on the NiFI node?


Only in the Nifi node. So I start the process on Nifi and after 20-30minutes 
connection breaks down.
But memory and cpu on the node is okay.
So task on ECS is still running and I can access the node via direkt IP. In 
Cluster Management it is displayed as disconnected. (edited) 


and once it breaks down it is unable to restore?


right, task is running. And when I restart the task it breaks down and can not 
restore.


yeesh that is brutal.  How do you get it back in?


not really back in a working state. Shutting down the node and delete 
flow.xml.gz and I can start the node and it reconnects to the cluster.
But the whole canvas is lost. So at the moment I have no procedure for a recover


Why would you have to delete the flow.xml.gz and/or why wouldn't that node 
rejoin the cluster and inherit the flow...  ?


so after connection loss to the cluster that happens in a toggling way for all 
nodes.
Node 1 is leader => process load => after a few minutes it loses connection to 
cluster
Node 2 becomes leader => after a few minutes it loses connection


so restarting the node with untouched flow.xml.gz leads to a not starting task


I could an error message in Flow Initialization

can you share/attach those logs?


and this desc in the jira

yes, I will do. But I have to do it tomorrow :smile:


but thanks for your time and questions

 Hi Joe, I want to update you. The problem is solved. Main issue was throttled 
troughput mode on AWS EFS. We are using EFS as storage for the data of nifi 
which has to persist (state, content_repository, flow.file, database_repository 
and so on) Here it was wrong configured as bursting and limit was reached very 
fast in time of processing. So because of throttling node lost connection to 
cluster. And then there was a ping pong because every node uses the same efs 
filesystem (but different folder).

> Zookeeper Client Can't Reconnect - Session timeout has elapsed while SUSPENDED
> ------------------------------------------------------------------------------
>
>                 Key: NIFI-9559
>                 URL: https://issues.apache.org/jira/browse/NIFI-9559
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Shawn Weeks
>            Assignee: Matt Burgess
>            Priority: Minor
>         Attachments: nifi_and_zookeeper_logs.txt, nifi_error.log
>
>
> After a loss of connection to Zookeeper a NiFi node never successfully 
> reconnects to the Zookeeper or the Cluster and instead returns errors about 
> no Cluster Coordinator and a Session timeout has elapsed while SUSPENDED 
> repeatedly until you restart NiFi.
> The error described is the same one at 
> https://issues.apache.org/jira/browse/CURATOR-405 however that patch has been 
> in NiFi for several versions now.
> NiFi version is 1.15.3 and Zookeeper 3.6.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to