[ https://issues.apache.org/jira/browse/CASSANDRA-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913077#comment-17913077 ]
Benedict Elliott Smith edited comment on CASSANDRA-20205 at 1/14/25 9:38 PM: ----------------------------------------------------------------------------- * Which version of Paxos? * I presume you mean attempts to write that particular partition fail, rather than all writes for the table? * Are the replicas all in the same location, or different regions? * Do future queries fail, or timeout? If you can pick a specific partition that is failing, and provide a dump of the relevant system.paxos state data from each replica, I can take a look and see what additional information we might want to see. You can at least initially screen out the {{_commit}} blobs if you like, so no user data is provided - if we need any information from there can explore options later. was (Author: benedict): * Which version of Paxos? * I presume you mean attempts to write that particular partition fail, rather than all writes for the table? * Are the replicas all in the same location, or different regions? * Do future queries fail, or timeout? If you can pick a specific partition that is failing, and provide a dump of the relevant system.paxos state data I can take a look and see what additional information we might want to see. You can at least initially screen out the {{_commit}} blobs if you like, so no user data is provided - if we need any information from there can explore options later. > Failed lightweight transaction leaves Paxos in apparently unresolvable state > ---------------------------------------------------------------------------- > > Key: CASSANDRA-20205 > URL: https://issues.apache.org/jira/browse/CASSANDRA-20205 > Project: Apache Cassandra > Issue Type: Bug > Reporter: Peter Machon > Priority: Normal > > In three node Cassandra cluster I am consistently facing the same kind of > fatal situation on tables that are solely written using Cassandra's > lightweight transactions (CAS). > Whenever a lightweight transaction fails to reach quorum (1/2), e.g. due to > high load, any following attempt to write data within a transactions fails, > i.e. does not return {{{}"[applied]"=true{}}}. > Using {{{}select * from system.paxos where cf_id=<id of table>{}}}, I see > that there are entries, which I assume to be pending transactions. > Further, in {{/var/log/Cassandra/system.log}} I see logs like: > {quote}INFO [ScheduledTasks:1] 2025-01-12 21:46:53,005 > UncommittedTableData.java:567 - Scheduling uncommitted paxos data merge task > for {{<any other table>}} > {quote} > {quote}INFO [OptionalTasks:1] 2025-01-12 21:46:53,006 > PaxosCleanupLocalCoordinator.java:89 - Completing uncommitted paxos instances > for {{<table in stalled state>}} on ranges > {quote} > However, I can't figure how to resolve the state {{nodetool repair -full > <keyspace>}} (and variations), as well as restarting all nodes did not > resolve the issue. > _Further information:_ > * Cassandra version: 4.1.5 > * OS: Ubuntu 22.04 > * replication strategy: SimpleStrategy > * replication factor: 3 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org