Chris McKeever created NIFI-8477:
------------------------------------

             Summary: If a node completely dies, can not delete it from the 
cluster; AKA Zombie Node
                 Key: NIFI-8477
                 URL: https://issues.apache.org/jira/browse/NIFI-8477
             Project: Apache NiFi
          Issue Type: Bug
    Affects Versions: 1.13.2
         Environment: Dockerized AWS ECS Instance
            Reporter: Chris McKeever


Our nodes are ephemeral. Once they fall over, they don't come back in any 
stateful manner. This is known to create data loss, which we are aware of. 

The issue we are seeing is, that when they fall over (right now we are 
forcefully knocking them over to test resiliency) the cluster heartbeat will 
flag them as disconnected, but there is no way to then delete them as we get a 
ERROR: Error executing command 'delete-node' : Error deleting node: 
java.net.SocketTimeoutException: timeout


We have increased the read/connect timeouts to 20s (from default 5s) and that 
changes the error to a `read timeout`

Increasing those values to anything greater than 30s gives us unstable usage 
across the board
{ "servlet":"jerseySpring", "message":"Service Unavailable", 
"url":"/nifi-api/flow/current-user", "status":"503" } 
ERROR: Error executing command 'get-nodes' : Read timed out
Occasionally, when some stars align, we are able to delete the node via the 
toolkit cli, but it happens far and few between but does lean itself to some 
timing issue.

 

Have discussed this thoroughly on the Nifi Slack, and Joe W. has mentioned 
"I know something super bad happened and well - crap happens - can you help us 
clean the cluster back up and get on with life?" 
and ... 
we need a nicer option.  I'm not sure if the CLI does something smart here
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to