Chris McKeever created NIFI-8477:
------------------------------------
Summary: If a node completely dies, can not delete it from the
cluster; AKA Zombie Node
Key: NIFI-8477
URL: https://issues.apache.org/jira/browse/NIFI-8477
Project: Apache NiFi
Issue Type: Bug
Affects Versions: 1.13.2
Environment: Dockerized AWS ECS Instance
Reporter: Chris McKeever
Our nodes are ephemeral. Once they fall over, they don't come back in any
stateful manner. This is known to create data loss, which we are aware of.
The issue we are seeing is, that when they fall over (right now we are
forcefully knocking them over to test resiliency) the cluster heartbeat will
flag them as disconnected, but there is no way to then delete them as we get a
ERROR: Error executing command 'delete-node' : Error deleting node:
java.net.SocketTimeoutException: timeout
We have increased the read/connect timeouts to 20s (from default 5s) and that
changes the error to a `read timeout`
Increasing those values to anything greater than 30s gives us unstable usage
across the board
{ "servlet":"jerseySpring", "message":"Service Unavailable",
"url":"/nifi-api/flow/current-user", "status":"503" }
ERROR: Error executing command 'get-nodes' : Read timed out
Occasionally, when some stars align, we are able to delete the node via the
toolkit cli, but it happens far and few between but does lean itself to some
timing issue.
Have discussed this thoroughly on the Nifi Slack, and Joe W. has mentioned
"I know something super bad happened and well - crap happens - can you help us
clean the cluster back up and get on with life?"
and ...
we need a nicer option. I'm not sure if the CLI does something smart here
--
This message was sent by Atlassian Jira
(v8.3.4#803005)