[ 
https://issues.apache.org/jira/browse/NIFI-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Payne updated NIFI-8477:
-----------------------------
    Fix Version/s: 1.14.0
       Resolution: Fixed
           Status: Resolved  (was: Patch Available)

> If a node completely dies, can not delete it from the cluster; AKA Zombie Node
> ------------------------------------------------------------------------------
>
>                 Key: NIFI-8477
>                 URL: https://issues.apache.org/jira/browse/NIFI-8477
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.13.2
>         Environment: Dockerized AWS ECS Instance
>            Reporter: Chris McKeever
>            Assignee: Mark Payne
>            Priority: Blocker
>              Labels: cluster, disconnection
>             Fix For: 1.14.0
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Our nodes are ephemeral. Once they fall over, they don't come back in any 
> stateful manner. This is known to create data loss, which we are aware of. 
> The issue we are seeing is, that when they fall over (right now we are 
> forcefully knocking them over to test resiliency) the cluster heartbeat will 
> flag them as disconnected, but there is no way to then delete them as we get 
> a 
> ERROR: Error executing command 'delete-node' : Error deleting node: 
> java.net.SocketTimeoutException: timeout
> We have increased the read/connect timeouts to 20s (from default 5s) and that 
> changes the error to a `read timeout`
> Increasing those values to anything greater than 30s gives us unstable usage 
> across the board
> { "servlet":"jerseySpring", "message":"Service Unavailable", 
> "url":"/nifi-api/flow/current-user", "status":"503" } 
> ERROR: Error executing command 'get-nodes' : Read timed out
> Occasionally, when some stars align, we are able to delete the node via the 
> toolkit cli, but it happens far and few between but does lean itself to some 
> timing issue.
>  
> Have discussed this thoroughly on the Nifi Slack, and Joe W. has mentioned 
> "I know something super bad happened and well - crap happens - can you help 
> us clean the cluster back up and get on with life?" 
> and ... 
> we need a nicer option.  I'm not sure if the CLI does something smart here
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to