[
https://issues.apache.org/jira/browse/CASSANDRA-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229944#comment-15229944
]
Sylvain Lebresne commented on CASSANDRA-10041:
----------------------------------------------
bq. In this case, it is not possible to identify in which phase the counter
mutation failed.
That's right, you can't (identify in which phase the counter mutation failed).
But given how counters currently work we can't send you that information: the
timeout is sent by the coordinator which only get acks once everything is
finished, so if it doesn't get acks, it doesn't know which phase we're in. We'd
need to change the protocol used internally as suggested a long time ago in
CASSANDRA-3199, but we've so far decided that the ROI for that wasn't good
enough (mostly due to the huge headache that making this change while
maintaining backward compatibility/rolling upgrade would be). Note in
particular that even doing that wouldn't _avoid_ the timeout, it would just
make a tiny bit more info available to the coordinator when it happens but that
info might not even help being sure whether the counter update has been
persisted or not.
Overall, closing that issue as not a problem. Yes, whenever a node dies some
counter inserts can timeout during the windows it takes for the failure
detector to mark that node dead and this even if you have in theory enough
nodes alive to fulfill the CL requirements. And yes, that's sad. But it's
unfortunately a intrinsic limitation of the counter design for which we don't
have a solution.
Or to put it another way, this is working as designed, which doesn't mean we
disagree that this is a weakness of said design.
> "timeout during write query at consistency ONE" when updating counter at
> consistency QUORUM and 2 of 3 nodes alive
> ------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-10041
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10041
> Project: Cassandra
> Issue Type: Bug
> Environment: centos 6.6 server, java version "1.8.0_45", cassandra
> 2.1.8, 3 machines, keyspace with replication factor 3
> Reporter: Anton Lebedevich
> Fix For: 2.1.x
>
>
> Test scenario is: kill -9 one node, wait 60 seconds, start it back, wait till
> it becomes available, wait 120 seconds (during that time all 3 nodes are up),
> repeat with the next node. Application reads from one table and updates
> counters in another table with consistency QUORUM. When one node out of 3 is
> killed application logs this exception for several seconds:
> {noformat}
> Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException:
> Cassandra timeout during write query at consistency ONE (1 replica were
> required but only 0 acknowledged the write)
> at
> com.datastax.driver.core.Responses$Error$1.decode(Responses.java:57)
> ~[com.datastax.cassandra.cassandra-driver-core-2.1.6.jar:na]
> at
> com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37)
> ~[com.datastax.cassandra.cassandra-driver-core-2.1.6.jar:na]
> at
> com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:204)
> ~[com.datastax.cassandra.cassandra-driver-core-2.1.6.jar:na]
> at
> com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:195)
> ~[com.datastax.cassandra.cassandra-driver-core-2.1.6.jar:na]
> at
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
> [io.netty.netty-codec-4.0.27.Final.jar:4.0.27.Final]
> ... 13 common frames omitted
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)