[
https://issues.apache.org/jira/browse/CASSANDRA-18555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732494#comment-17732494
]
Stefan Miklosovic commented on CASSANDRA-18555:
-----------------------------------------------
[~brandon.williams]
If you checked the whole communication (which you probably havent but I dont
blame you for that as that is understandable), I was thinking about propagating
this to "nodetool status" too. So it would be seen from every node in the
cluster, basically. That would be the purest solution but I abandoned it in
favor of saving the state as I did not want to mess with Gossip as that seems
to be way more complex. We would need to add bunch of logic around this. This
seems to be way simpler as we just persist it into already existing column in a
table and it seems like natural fit. Why do you think there is no logic behind
that? If we save that it has decommissioned, what is wrong with saving an
information that it failed to do so?
If it is not persisted, if we fail to decommission and we kill the instance and
it is started again, what state that node is actually in? Because right now its
bootstrap state would be still "completed" but it might partially decommission
itself which is quite dangerous, isnt it?
> A new nodetool/JMX command that tells whether node's decommission failed or
> not
> -------------------------------------------------------------------------------
>
> Key: CASSANDRA-18555
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18555
> Project: Cassandra
> Issue Type: Task
> Components: Observability/JMX
> Reporter: Jaydeepkumar Chovatia
> Assignee: Jaydeepkumar Chovatia
> Priority: Normal
> Time Spent: 3h 40m
> Remaining Estimate: 0h
>
> Currently, when a node is being decommissioned and if any failure happens,
> then an exception is thrown back to the caller.
> But Cassandra's decommission takes considerable time ranging from minutes to
> hours to days. There are various scenarios in that the caller may need to
> probe the status again:
> * The caller times out
> * It is not possible to keep the caller hanging for such a long time
> And If the caller does not know what happened internally, then it cannot
> retry, etc., leading to other issues.
> So, in this ticket, I am going to add a new nodetool/JMX command that can be
> invoked by the caller anytime, and it will return the correct status.
> It might look like a smaller change, but when we need to operate Cassandra at
> scale in a large-scale fleet, then this becomes a bottleneck and require
> constant operator intervention.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]