[jira] [Updated] (CASSANDRA-15243) removenode can cause QUORUM write queries to fail

Tom van der Woerdt (JIRA) Mon, 22 Jul 2019 11:59:12 -0700


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-15243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tom van der Woerdt updated CASSANDRA-15243:
-------------------------------------------
    Description: 
Looks like nobody found this yet so this may be a ticking time bomb for some... 
:(

This happened to me earlier today. On a Cassandra 3.11.4 cluster with three 
DCs, one DC had three servers fail due to unexpected external circumstances. 
Replication was NTS configured with 2:2:2.

Cassandra dealt with the failures just fine - great! However, they failed in a 
way that would make bringing them back impossible, so I tried to remove them 
using 'removenode'.

Suddenly, the application started experiencing a large number of QUORUM write 
timeouts. My first reflex was to lower the streaming throughput and compaction 
throughput, since timeouts indicated some overload was happening. No luck, 
though.

I tried a bunch of other things to reroute queries away from the affected 
datacenter, like changing the Severity field on the dynamic snitch. Still, no 
luck.

After a while I noticed one strange thing: the WriteTimeoutException listed 
that five replicas were required, instead of the four you would expect to see 
in a 2:2:2 replication configuration. I shrugged it off as some weird 
inconsistency that was probably because of the use of batches.

Skip ahead a bit, I decided to let the streams run again and just wait the 
issue out, since nothing I did was working, and maybe just letting the streams 
finish would resolve this overload. Magically, as soon as the streams finished, 
the errors stopped.

----

There are two issues here, both in AbstractWriteResponseHandler.java.

h3. Cassandra sometimes waits for too many replicas on writes

In 
[totalBlockFor|https://github.com/apache/cassandra/blob/71cb0616b7710366a8cd364348c864d656dc5542/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L124]
 Cassandra will *always* include pending nodes in `blockfor`. In the case of a 
quorum query on a 2:2:2 replication factor, with two replicas in one DC down, 
this results in a blockfor of 5. If the pending replica is then also down (as 
can happen in a case where removenode is used and not all destination hosts are 
up), only 4 of the 5 hosts are available, and quorum queries will never succeed.

h3. UnavailableException not thrown

While debugging this, I spent all my time focusing on this issue as if it was a 
timeout. However, Cassandra was doing queries that could never succeed, because 
insufficient hosts were available. Throwing an UnavailableException would have 
been more helpful. The issue here is caused by 
[assureSufficientLiveNodes|https://github.com/apache/cassandra/blob/71cb0616b7710366a8cd364348c864d656dc5542/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L155]
 which merely concats the lists of available nodes, and won't consider the 
special-case behavior of a pending node that's down.

  was:
Looks like nobody found this yet 
([google|https://www.google.com/search?q="cassandra"+"removenode"+"quorum"+site%3Aissues.apache.org])
 so this may be a ticking time bomb for some... :(

This happened to me earlier today. On a Cassandra 3.11.4 cluster with three 
DCs, one DC had three servers fail due to unexpected external circumstances. 
Replication was NTS configured with 2:2:2.

Cassandra dealt with the failures just fine - great! However, they failed in a 
way that would make bringing them back impossible, so I tried to remove them 
using 'removenode'.

Suddenly, the application started experiencing a large number of QUORUM write 
timeouts. My first reflex was to lower the streaming throughput and compaction 
throughput, since timeouts indicated some overload was happening. No luck, 
though.

I tried a bunch of other things to reroute queries away from the affected 
datacenter, like changing the Severity field on the dynamic snitch. Still, no 
luck.

After a while I noticed one strange thing: the WriteTimeoutException listed 
that five replicas were required, instead of the four you would expect to see 
in a 2:2:2 replication configuration. I shrugged it off as some weird 
inconsistency that was probably because of the use of batches.

Skip ahead a bit, I decided to let the streams run again and just wait the 
issue out, since nothing I did was working, and maybe just letting the streams 
finish would resolve this overload. Magically, as soon as the streams finished, 
the errors stopped.

----

There are two issues here, both in AbstractWriteResponseHandler.java.

h3. Cassandra sometimes waits for too many replicas on writes

In 
[totalBlockFor|https://github.com/apache/cassandra/blob/71cb0616b7710366a8cd364348c864d656dc5542/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L124]
 Cassandra will *always* include pending nodes in `blockfor`. In the case of a 
quorum query on a 2:2:2 replication factor, with two replicas in one DC down, 
this results in a blockfor of 5. If the pending replica is then also down (as 
can happen in a case where removenode is used and not all destination hosts are 
up), only 4 of the 5 hosts are available, and quorum queries will never succeed.

h3. UnavailableException not thrown

While debugging this, I spent all my time focusing on this issue as if it was a 
timeout. However, Cassandra was doing queries that could never succeed, because 
insufficient hosts were available. Throwing an UnavailableException would have 
been more helpful. The issue here is caused by 
[assureSufficientLiveNodes|https://github.com/apache/cassandra/blob/71cb0616b7710366a8cd364348c864d656dc5542/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L155]
 which merely concats the lists of available nodes, and won't consider the 
special-case behavior of a pending node that's down.


> removenode can cause QUORUM write queries to fail
> -------------------------------------------------
>
>                 Key: CASSANDRA-15243
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15243
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination
>            Reporter: Tom van der Woerdt
>            Priority: Normal
>
> Looks like nobody found this yet so this may be a ticking time bomb for 
> some... :(
> This happened to me earlier today. On a Cassandra 3.11.4 cluster with three 
> DCs, one DC had three servers fail due to unexpected external circumstances. 
> Replication was NTS configured with 2:2:2.
> Cassandra dealt with the failures just fine - great! However, they failed in 
> a way that would make bringing them back impossible, so I tried to remove 
> them using 'removenode'.
> Suddenly, the application started experiencing a large number of QUORUM write 
> timeouts. My first reflex was to lower the streaming throughput and 
> compaction throughput, since timeouts indicated some overload was happening. 
> No luck, though.
> I tried a bunch of other things to reroute queries away from the affected 
> datacenter, like changing the Severity field on the dynamic snitch. Still, no 
> luck.
> After a while I noticed one strange thing: the WriteTimeoutException listed 
> that five replicas were required, instead of the four you would expect to see 
> in a 2:2:2 replication configuration. I shrugged it off as some weird 
> inconsistency that was probably because of the use of batches.
> Skip ahead a bit, I decided to let the streams run again and just wait the 
> issue out, since nothing I did was working, and maybe just letting the 
> streams finish would resolve this overload. Magically, as soon as the streams 
> finished, the errors stopped.
> ----
> There are two issues here, both in AbstractWriteResponseHandler.java.
> h3. Cassandra sometimes waits for too many replicas on writes
> In 
> [totalBlockFor|https://github.com/apache/cassandra/blob/71cb0616b7710366a8cd364348c864d656dc5542/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L124]
>  Cassandra will *always* include pending nodes in `blockfor`. In the case of 
> a quorum query on a 2:2:2 replication factor, with two replicas in one DC 
> down, this results in a blockfor of 5. If the pending replica is then also 
> down (as can happen in a case where removenode is used and not all 
> destination hosts are up), only 4 of the 5 hosts are available, and quorum 
> queries will never succeed.
> h3. UnavailableException not thrown
> While debugging this, I spent all my time focusing on this issue as if it was 
> a timeout. However, Cassandra was doing queries that could never succeed, 
> because insufficient hosts were available. Throwing an UnavailableException 
> would have been more helpful. The issue here is caused by 
> [assureSufficientLiveNodes|https://github.com/apache/cassandra/blob/71cb0616b7710366a8cd364348c864d656dc5542/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L155]
>  which merely concats the lists of available nodes, and won't consider the 
> special-case behavior of a pending node that's down.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (CASSANDRA-15243) removenode can cause QUORUM write queries to fail

Reply via email to