Tom van der Woerdt created CASSANDRA-15243:
----------------------------------------------
Summary: removenode can cause QUORUM write queries to fail
Key: CASSANDRA-15243
URL: https://issues.apache.org/jira/browse/CASSANDRA-15243
Project: Cassandra
Issue Type: Bug
Components: Consistency/Coordination
Reporter: Tom van der Woerdt
Looks like nobody found this yet
([google|https://www.google.com/search?q="cassandra"+"removenode"+"quorum"+site%3Aissues.apache.org])
so this may be a ticking time bomb for some... :(
This happened to me earlier today. On a Cassandra 3.11.4 cluster with three
DCs, one DC had three servers fail due to unexpected external circumstances.
Replication was NTS configured with 2:2:2.
Cassandra dealt with the failures just fine - great! However, they failed in a
way that would make bringing them back impossible, so I tried to remove them
using 'removenode'.
Suddenly, the application started experiencing a large number of QUORUM write
timeouts. My first reflex was to lower the streaming throughput and compaction
throughput, since timeouts indicated some overload was happening. No luck,
though.
I tried a bunch of other things to reroute queries away from the affected
datacenter, like changing the Severity field on the dynamic snitch. Still, no
luck.
After a while I noticed one strange thing: the WriteTimeoutException listed
that five replicas were required, instead of the four you would expect to see
in a 2:2:2 replication configuration. I shrugged it off as some weird
inconsistency that was probably because of the use of batches.
Skip ahead a bit, I decided to let the streams run again and just wait the
issue out, since nothing I did was working, and maybe just letting the streams
finish would resolve this overload. Magically, as soon as the streams finished,
the errors stopped.
----
There are two issues here, both in AbstractWriteResponseHandler.java.
h3. Cassandra sometimes waits for too many replicas on writes
In
[totalBlockFor|https://github.com/apache/cassandra/blob/71cb0616b7710366a8cd364348c864d656dc5542/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L124]
Cassandra will *always* include pending nodes in `blockfor`. In the case of a
quorum query on a 2:2:2 replication factor, with two replicas in one DC down,
this results in a blockfor of 5. If the pending replica is then also down (as
can happen in a case where removenode is used and not all destination hosts are
up), only 4 of the 5 hosts are available, and quorum queries will never succeed.
h3. UnavailableException not thrown
While debugging this, I spent all my time focusing on this issue as if it was a
timeout. However, Cassandra was doing queries that could never succeed, because
insufficient hosts were available. Throwing an UnavailableException would have
been more helpful. The issue here is caused by
[assureSufficientLiveNodes|https://github.com/apache/cassandra/blob/71cb0616b7710366a8cd364348c864d656dc5542/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L155]
which merely concats the lists of available nodes, and won't consider the
special-case behavior of a pending node that's down.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]