[ 
https://issues.apache.org/jira/browse/CASSANDRA-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229944#comment-15229944
 ] 

Sylvain Lebresne commented on CASSANDRA-10041:
----------------------------------------------

bq. In this case, it is not possible to identify in which phase the counter 
mutation failed.

That's right, you can't (identify in which phase the counter mutation failed). 
But given how counters currently work we can't send you that information: the 
timeout is sent by the coordinator which only get acks once everything is 
finished, so if it doesn't get acks, it doesn't know which phase we're in. We'd 
need to change the protocol used internally as suggested a long time ago in 
CASSANDRA-3199, but we've so far decided that the ROI for that wasn't good 
enough (mostly due to the huge headache that making this change while 
maintaining backward compatibility/rolling upgrade would be). Note in 
particular that even doing that wouldn't _avoid_ the timeout, it would just 
make a tiny bit more info available to the coordinator when it happens but that 
info might not even help being sure whether the counter update has been 
persisted or not.

Overall, closing that issue as not a problem. Yes, whenever a node dies some 
counter inserts can timeout during the windows it takes for the failure 
detector to mark that node dead and this even if you have in theory enough 
nodes alive to fulfill the CL requirements. And yes, that's sad. But it's 
unfortunately a intrinsic limitation of the counter design for which we don't 
have a solution.

Or to put it another way, this is working as designed, which doesn't mean we 
disagree that this is a weakness of said design.

> "timeout during write query at consistency ONE" when updating counter at 
> consistency QUORUM and 2 of 3 nodes alive
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10041
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10041
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: centos 6.6 server, java version "1.8.0_45", cassandra 
> 2.1.8, 3 machines, keyspace with replication factor 3
>            Reporter: Anton Lebedevich
>             Fix For: 2.1.x
>
>
> Test scenario is: kill -9 one node, wait 60 seconds, start it back, wait till 
> it becomes available, wait 120 seconds (during that time all 3 nodes are up), 
> repeat with the next node. Application reads from one table and updates 
> counters in another table with consistency QUORUM. When one node out of 3 is 
> killed application logs this exception for several seconds:
> {noformat}
> Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: 
> Cassandra timeout during write query at consistency ONE (1 replica were 
> required but only 0 acknowledged the write)
>         at 
> com.datastax.driver.core.Responses$Error$1.decode(Responses.java:57) 
> ~[com.datastax.cassandra.cassandra-driver-core-2.1.6.jar:na]
>         at 
> com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37) 
> ~[com.datastax.cassandra.cassandra-driver-core-2.1.6.jar:na]
>         at 
> com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:204) 
> ~[com.datastax.cassandra.cassandra-driver-core-2.1.6.jar:na]
>         at 
> com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:195) 
> ~[com.datastax.cassandra.cassandra-driver-core-2.1.6.jar:na]
>         at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
>  [io.netty.netty-codec-4.0.27.Final.jar:4.0.27.Final]
>         ... 13 common frames omitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to