[
https://issues.apache.org/jira/browse/CASSANDRA-10231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949812#comment-14949812
]
Stefania commented on CASSANDRA-10231:
--------------------------------------
I've should have read this before commenting on CASSANDRA-10089, I left a note
there to move the discussion here.
I think you're correct: we'll end up with stale entries if we populate the
token metadata before recovering the commit log and some entries were
previously deleted but not yet flushed. So if we must populate the token
metadata before commit log replay (cc [~krummas] regarding CASSANDRA-6696),
then we have no other choice but to force a blocking flush when we delete
entries in system {{PEERS}}. At this point I would suggest to be consistent and
force a blocking flush in {{updateTokens}} as well. Note that this is already
done in {{updatePreferredIP}} so we are not introducing something totally new.
In an ideal word, I'd say we should not rely on the content of any tables
(system or not) before recovering the commit log but if this is not possible I
guess we have to be pragmatic.
Really well done on deducing this by the way!
> Null status entries on nodes that crash during decommission of a different
> node
> -------------------------------------------------------------------------------
>
> Key: CASSANDRA-10231
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10231
> Project: Cassandra
> Issue Type: Bug
> Reporter: Joel Knighton
> Assignee: Stefania
> Fix For: 3.0.0 rc2
>
> Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that
> crashes and decommissions nodes throughout the test.
> In a 5 node cluster, if a node crashes at a certain point (unknown) during
> the decommission of a different node, it may start with a null entry for the
> decommissioned node like so:
> DN 10.0.0.5 ? 256 ? null rack1
> This entry does not get updated/cleared by gossip. This entry is removed upon
> a restart of the affected node.
> This issue is further detailed in ticket
> [10068|https://issues.apache.org/jira/browse/CASSANDRA-10068].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)