[
https://issues.apache.org/jira/browse/NIFI-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Payne updated NIFI-8477:
-----------------------------
Fix Version/s: 1.14.0
Resolution: Fixed
Status: Resolved (was: Patch Available)
> If a node completely dies, can not delete it from the cluster; AKA Zombie Node
> ------------------------------------------------------------------------------
>
> Key: NIFI-8477
> URL: https://issues.apache.org/jira/browse/NIFI-8477
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.13.2
> Environment: Dockerized AWS ECS Instance
> Reporter: Chris McKeever
> Assignee: Mark Payne
> Priority: Blocker
> Labels: cluster, disconnection
> Fix For: 1.14.0
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> Our nodes are ephemeral. Once they fall over, they don't come back in any
> stateful manner. This is known to create data loss, which we are aware of.
> The issue we are seeing is, that when they fall over (right now we are
> forcefully knocking them over to test resiliency) the cluster heartbeat will
> flag them as disconnected, but there is no way to then delete them as we get
> a
> ERROR: Error executing command 'delete-node' : Error deleting node:
> java.net.SocketTimeoutException: timeout
> We have increased the read/connect timeouts to 20s (from default 5s) and that
> changes the error to a `read timeout`
> Increasing those values to anything greater than 30s gives us unstable usage
> across the board
> { "servlet":"jerseySpring", "message":"Service Unavailable",
> "url":"/nifi-api/flow/current-user", "status":"503" }
> ERROR: Error executing command 'get-nodes' : Read timed out
> Occasionally, when some stars align, we are able to delete the node via the
> toolkit cli, but it happens far and few between but does lean itself to some
> timing issue.
>
> Have discussed this thoroughly on the Nifi Slack, and Joe W. has mentioned
> "I know something super bad happened and well - crap happens - can you help
> us clean the cluster back up and get on with life?"
> and ...
> we need a nicer option. I'm not sure if the CLI does something smart here
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)