[jira] [Commented] (NIFI-8477) If a node completely dies, can not delete it from the cluster; AKA Zombie Node

ASF subversion and git services (Jira) Mon, 10 May 2021 13:16:29 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342122#comment-17342122
 ]


ASF subversion and git services commented on NIFI-8477:
-------------------------------------------------------

Commit 1645886e5a42579b21ccc9c44f4db8b08fc2e8df in nifi's branch 
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=1645886 ]

NIFI-8477: If interrupted while waiting for Node Status Update to be replicated 
to other nodes, do not throw ProtocolException; instead just log a warning and 
return.

This closes #5039


> If a node completely dies, can not delete it from the cluster; AKA Zombie Node
> ------------------------------------------------------------------------------
>
>                 Key: NIFI-8477
>                 URL: https://issues.apache.org/jira/browse/NIFI-8477
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.13.2
>         Environment: Dockerized AWS ECS Instance
>            Reporter: Chris McKeever
>            Assignee: Mark Payne
>            Priority: Blocker
>              Labels: cluster, disconnection
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Our nodes are ephemeral. Once they fall over, they don't come back in any 
> stateful manner. This is known to create data loss, which we are aware of. 
> The issue we are seeing is, that when they fall over (right now we are 
> forcefully knocking them over to test resiliency) the cluster heartbeat will 
> flag them as disconnected, but there is no way to then delete them as we get 
> a 
> ERROR: Error executing command 'delete-node' : Error deleting node: 
> java.net.SocketTimeoutException: timeout
> We have increased the read/connect timeouts to 20s (from default 5s) and that 
> changes the error to a `read timeout`
> Increasing those values to anything greater than 30s gives us unstable usage 
> across the board
> { "servlet":"jerseySpring", "message":"Service Unavailable", 
> "url":"/nifi-api/flow/current-user", "status":"503" } 
> ERROR: Error executing command 'get-nodes' : Read timed out
> Occasionally, when some stars align, we are able to delete the node via the 
> toolkit cli, but it happens far and few between but does lean itself to some 
> timing issue.
>  
> Have discussed this thoroughly on the Nifi Slack, and Joe W. has mentioned 
> "I know something super bad happened and well - crap happens - can you help 
> us clean the cluster back up and get on with life?" 
> and ... 
> we need a nicer option.  I'm not sure if the CLI does something smart here
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NIFI-8477) If a node completely dies, can not delete it from the cluster; AKA Zombie Node

Reply via email to