[
https://issues.apache.org/jira/browse/NIFI-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342122#comment-17342122
]
ASF subversion and git services commented on NIFI-8477:
-------------------------------------------------------
Commit 1645886e5a42579b21ccc9c44f4db8b08fc2e8df in nifi's branch
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=1645886 ]
NIFI-8477: If interrupted while waiting for Node Status Update to be replicated
to other nodes, do not throw ProtocolException; instead just log a warning and
return.
This closes #5039
> If a node completely dies, can not delete it from the cluster; AKA Zombie Node
> ------------------------------------------------------------------------------
>
> Key: NIFI-8477
> URL: https://issues.apache.org/jira/browse/NIFI-8477
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.13.2
> Environment: Dockerized AWS ECS Instance
> Reporter: Chris McKeever
> Assignee: Mark Payne
> Priority: Blocker
> Labels: cluster, disconnection
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> Our nodes are ephemeral. Once they fall over, they don't come back in any
> stateful manner. This is known to create data loss, which we are aware of.
> The issue we are seeing is, that when they fall over (right now we are
> forcefully knocking them over to test resiliency) the cluster heartbeat will
> flag them as disconnected, but there is no way to then delete them as we get
> a
> ERROR: Error executing command 'delete-node' : Error deleting node:
> java.net.SocketTimeoutException: timeout
> We have increased the read/connect timeouts to 20s (from default 5s) and that
> changes the error to a `read timeout`
> Increasing those values to anything greater than 30s gives us unstable usage
> across the board
> { "servlet":"jerseySpring", "message":"Service Unavailable",
> "url":"/nifi-api/flow/current-user", "status":"503" }
> ERROR: Error executing command 'get-nodes' : Read timed out
> Occasionally, when some stars align, we are able to delete the node via the
> toolkit cli, but it happens far and few between but does lean itself to some
> timing issue.
>
> Have discussed this thoroughly on the Nifi Slack, and Joe W. has mentioned
> "I know something super bad happened and well - crap happens - can you help
> us clean the cluster back up and get on with life?"
> and ...
> we need a nicer option. I'm not sure if the CLI does something smart here
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)