[
https://issues.apache.org/jira/browse/CASSANDRA-13308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeff Jirsa reassigned CASSANDRA-13308:
--------------------------------------
Assignee: Jeff Jirsa
> Gossip breaks, Hint files not being deleted on nodetool decommission
> --------------------------------------------------------------------
>
> Key: CASSANDRA-13308
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13308
> Project: Cassandra
> Issue Type: Bug
> Components: Streaming and Messaging
> Environment: Using Cassandra version 3.0.9
> Reporter: Arijit
> Assignee: Jeff Jirsa
> Attachments: 28207.stack, logs, logs_decommissioned_node
>
>
> How to reproduce the issue I'm seeing:
> Shut down Cassandra on one node of the cluster and wait until we accumulate a
> ton of hints. Start Cassandra on the node and immediately run "nodetool
> decommission" on it.
> The node streams its replicas and marks itself as DECOMMISSIONED, but other
> nodes do not seem to see this message. "nodetool status" shows the
> decommissioned node in state "UL" on all other nodes (it is also present in
> system.peers), and Cassandra logs show that gossip tasks on nodes are not
> proceeding (number of pending tasks keeps increasing). Jstack suggests that a
> gossip task is blocked on hints dispatch (I can provide traces if this is not
> obvious). Because the cluster is large and there are a lot of hints, this is
> taking a while.
> On inspecting "/var/lib/cassandra/hints" on the nodes, I see a bunch of hint
> files for the decommissioned node. Documentation seems to suggest that these
> hints should be deleted during "nodetool decommission", but it does not seem
> to be the case here. This is the bug being reported.
> To recover from this scenario, if I manually delete hint files on the nodes,
> the hints dispatcher threads throw a bunch of exceptions and the
> decommissioned node is now in state "DL" (perhaps it missed some gossip
> messages?). The node is still in my "system.peers" table
> Restarting Cassandra on all nodes after this step does not fix the issue (the
> node remains in the peers table). In fact, after this point the
> decommissioned node is in state "DN"
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)