Joseph Lynch created CASSANDRA-13968:
----------------------------------------
Summary: Cannot replace a live node on large clusters
Key: CASSANDRA-13968
URL: https://issues.apache.org/jira/browse/CASSANDRA-13968
Project: Cassandra
Issue Type: Bug
Environment: Cassandra 2.1.17, Ubuntu Trusty/Xenial (Linux 3.13, 4.4)
Reporter: Joseph Lynch
Attachments:
0001-During-node-replacement-check-for-updates-in-the-tim.patch,
0002-Only-fail-replacement-if-we-_know_-the-node-is-up.patch
During forced node replacements we very frequently (~every time for large
clusters) see:
{noformat}
ERROR [main] 2017-10-17 06:54:35,680 CassandraDaemon.java:583 - Exception
encountered during startup
java.lang.UnsupportedOperationException: Cannot replace a live node...
{noformat}
The old node is dead, the new node that is replacing it thinks it is dead (DN
state), and all other nodes think it is dead (all have the DN state). However,
I believe there are two bugs in the "is live" check that can cause this error,
namely that:
1. We sleep for
[BROADCAST_INTERVAL|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L905]
(hardcoded 60s on 2.1, on later version configurable but still 60s by
default), but
[check|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L919]
for an update in the last RING_DELAY seconds (typically set to 30s). When a
fresh node is joining, in my experience, [the
schema|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L859]
check almost immediately returns true after gossiping with seeds, so in
reality we do not even sleep for RING_DELAY. If operators increase ring delay
past broadcast_interval (as you might do if you think you are victim to the
second bug below), then you guarantee that you will always get the exception
because the gossip update is basically guaranteed to happen in the last
RING_DELAY seconds since you didn't sleep for that duration (you slept for
broadcast). For example if an operator sets ring delay to 300s, then the check
says "oh yea, the last update was 59 seconds ago, which is sooner than 300s, so
fail".
2. We don't actually check that the node is alive, we just check that a gossip
update has happened in the last X seconds. Sometimes with large clusters nodes
are still converging on the proper generation/version of a dead node, and the
"is live" check prevents an operator from replacing the node until gossip has
settled on the cluster regarding the dead node, which for large clusters can
take a really long time. This can be really hurtful to availability in cloud
environments and every time I've seen this error it's the case that the new
node believes that the old node is down (since
[markAlive|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L954]
[marks
dead|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L962]
first and then triggers a callback to
[realMarkAlive|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L975]
which never triggers because the old node is actually down).
I think that #1 is definitely a bug, #2 might be considered an extra safety"
feature (that you don't allow replacement during gossip convergence), but given
that the operator took the effort to supply the replace_address flag, I think
it's prudent to only fail if we really know something is wrong.
I've attached two patches against 2.1, one that fixes bug #1 and one that fixes
(imo) bug #2. I was thinking for #1 that we may want to prevent the schema
check from exiting the RING_DELAY sleep early but maybe it's just better to
backport configurable broadcast_interval and pick the maximum or something. If
we don't like the way I've worked around #2, maybe I could make it an option
that operators could turn on if they wanted? If folks are happy with the
approach I can attach patches for 2.2, 3.0, and 3.11.
A relevant example of a log showing the first bug (in this case the node that
was being replaced was drained moving it to shutdown before replacement, and
ring delay was forced to 300s because the cluster is very large):
{noformat}
INFO [ScheduledTasks:1] 2017-10-17 06:41:21,325 StorageService.java:189 -
Overriding RING_DELAY to 300000ms
INFO [main] 2017-10-17 06:41:24,240 StorageService.java:508 - Gathering node
replacement information for /OLD_ADDRESS
INFO [GossipStage:1] 2017-10-17 06:41:25,198 Gossiper.java:1032 - Node
/OLD_ADDRESS is now part of the cluster
INFO [GossipStage:1] 2017-10-17 06:41:25,200 Gossiper.java:1011 - InetAddress
/OLD_ADDRESS is now DOWN
INFO [main] 2017-10-17 06:41:25,617 StorageService.java:1164 - JOINING:
waiting for ring information
INFO [main] 2017-10-17 06:41:29,618 StorageService.java:1164 - JOINING:
schema complete, ready to bootstrap
INFO [main] 2017-10-17 06:41:29,618 StorageService.java:1164 - JOINING:
waiting for pending range calculation
INFO [main] 2017-10-17 06:41:29,718 StorageService.java:1164 - JOINING:
calculation complete, ready to bootstrap
INFO [GossipStage:1] 2017-10-17 06:41:30,606 Gossiper.java:1032 - Node
/OLD_ADDRESS is now part of the cluster
INFO [GossipStage:1] 2017-10-17 06:41:30,606 TokenMetadata.java:464 -
Updating topology for /OLD_ADDRESS
INFO [GossipStage:1] 2017-10-17 06:41:30,607 TokenMetadata.java:464 -
Updating topology for /OLD_ADDRESS
INFO [GossipStage:1] 2017-10-17 06:41:30,614 Gossiper.java:1011 - InetAddress
/OLD_ADDRESS is now DOWN
ERROR [main] 2017-10-17 06:42:29,722 CassandraDaemon.java:583 - Exception
encountered during startup
java.lang.UnsupportedOperationException: Cannot replace a live node...
{noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]