[
https://issues.apache.org/jira/browse/CASSANDRA-10231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934942#comment-14934942
]
Stefania commented on CASSANDRA-10231:
--------------------------------------
I noticed two major differences between the dtest logs and the logs attached:
* after n5 restarts it logs no GOSSIP information in {{applyStateLocally}},
hence the GOSSIP information for the decommissioned node is totally missing.
* the tokens for the decommissioned node are still there even though they were
deleted before crashing. This could be explained by the crash if the commit log
is not replayed.
Ultimately it shouldn't matter if the tokens are still there after restarting,
if we received a GOSSIP message with status LEFT we should have been able to
clear them. We would need a full TRACE log to be able to work out why the
GOSSIP entries are missing: either other nodes are not gossiping about the
decommissioned node (unlikely since the expiry time is 3 days) or for some
reason node 5 ignores the GOSSIP entry for the decommissioned node.
I tried running my dtest on the same commit but I could not reproduce this.
However there is one big difference, in that the dtest does not involve any
hinting or streaming of data. So I probably need to install Jepsen.
I would suggest fixing the MV issue that is preventing us from running on the
latest 3.0, or at a minimum running on a commit where hinting works fine and
the batch log can be replayed. Also, we will probably need to run with
DEBUG=true, if not TRACE=true.
Is there anything I can do to help track down the MV issue?
> Null status entries on nodes that crash during decommission of a different
> node
> -------------------------------------------------------------------------------
>
> Key: CASSANDRA-10231
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10231
> Project: Cassandra
> Issue Type: Bug
> Reporter: Joel Knighton
> Assignee: Stefania
> Fix For: 3.0.0 rc2
>
> Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that
> crashes and decommissions nodes throughout the test.
> In a 5 node cluster, if a node crashes at a certain point (unknown) during
> the decommission of a different node, it may start with a null entry for the
> decommissioned node like so:
> DN 10.0.0.5 ? 256 ? null rack1
> This entry does not get updated/cleared by gossip. This entry is removed upon
> a restart of the affected node.
> This issue is further detailed in ticket
> [10068|https://issues.apache.org/jira/browse/CASSANDRA-10068].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)