[
https://issues.apache.org/jira/browse/NIFI-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644967#comment-17644967
]
Joe Witt commented on NIFI-9559:
--------------------------------
Anyone news to the issue https://issues.apache.org/jira/browse/NIFI-9559
We have also an 3-node Nifi Cluster (1.19.0) on AWS ECS Fargate and external
Zookeeper (3.8.0 - 3 ZK nodes) and have the same issue.
On high load leading node lost connection to cluster and not correctly
reconnecting. (edited)
the load was high on the ZK node or high on the NiFI node?
Only in the Nifi node. So I start the process on Nifi and after 20-30minutes
connection breaks down.
But memory and cpu on the node is okay.
So task on ECS is still running and I can access the node via direkt IP. In
Cluster Management it is displayed as disconnected. (edited)
and once it breaks down it is unable to restore?
right, task is running. And when I restart the task it breaks down and can not
restore.
yeesh that is brutal. How do you get it back in?
not really back in a working state. Shutting down the node and delete
flow.xml.gz and I can start the node and it reconnects to the cluster.
But the whole canvas is lost. So at the moment I have no procedure for a recover
Why would you have to delete the flow.xml.gz and/or why wouldn't that node
rejoin the cluster and inherit the flow... ?
so after connection loss to the cluster that happens in a toggling way for all
nodes.
Node 1 is leader => process load => after a few minutes it loses connection to
cluster
Node 2 becomes leader => after a few minutes it loses connection
so restarting the node with untouched flow.xml.gz leads to a not starting task
I could an error message in Flow Initialization
can you share/attach those logs?
and this desc in the jira
yes, I will do. But I have to do it tomorrow :smile:
but thanks for your time and questions
Hi Joe, I want to update you. The problem is solved. Main issue was throttled
troughput mode on AWS EFS. We are using EFS as storage for the data of nifi
which has to persist (state, content_repository, flow.file, database_repository
and so on) Here it was wrong configured as bursting and limit was reached very
fast in time of processing. So because of throttling node lost connection to
cluster. And then there was a ping pong because every node uses the same efs
filesystem (but different folder).
> Zookeeper Client Can't Reconnect - Session timeout has elapsed while SUSPENDED
> ------------------------------------------------------------------------------
>
> Key: NIFI-9559
> URL: https://issues.apache.org/jira/browse/NIFI-9559
> Project: Apache NiFi
> Issue Type: Bug
> Reporter: Shawn Weeks
> Assignee: Matt Burgess
> Priority: Minor
> Attachments: nifi_and_zookeeper_logs.txt, nifi_error.log
>
>
> After a loss of connection to Zookeeper a NiFi node never successfully
> reconnects to the Zookeeper or the Cluster and instead returns errors about
> no Cluster Coordinator and a Session timeout has elapsed while SUSPENDED
> repeatedly until you restart NiFi.
> The error described is the same one at
> https://issues.apache.org/jira/browse/CURATOR-405 however that patch has been
> in NiFi for several versions now.
> NiFi version is 1.15.3 and Zookeeper 3.6.3
--
This message was sent by Atlassian Jira
(v8.20.10#820010)