[ 
https://issues.apache.org/jira/browse/CASSANDRA-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913077#comment-17913077
 ] 

Benedict Elliott Smith edited comment on CASSANDRA-20205 at 1/14/25 9:38 PM:
-----------------------------------------------------------------------------

* Which version of Paxos? 
* I presume you mean attempts to write that particular partition fail, rather 
than all writes for the table? 
* Are the replicas all in the same location, or different regions? 
* Do future queries fail, or timeout?

If you can pick a specific partition that is failing, and provide a dump of the 
relevant system.paxos state data from each replica, I can take a look and see 
what additional information we might want to see. You can at least initially 
screen out the {{_commit}} blobs if you like, so no user data is provided - if 
we need any information from there can explore options later.


was (Author: benedict):
* Which version of Paxos? 
* I presume you mean attempts to write that particular partition fail, rather 
than all writes for the table? 
* Are the replicas all in the same location, or different regions? 
* Do future queries fail, or timeout?

If you can pick a specific partition that is failing, and provide a dump of the 
relevant system.paxos state data I can take a look and see what additional 
information we might want to see. You can at least initially screen out the 
{{_commit}} blobs if you like, so no user data is provided - if we need any 
information from there can explore options later.

> Failed lightweight transaction leaves Paxos in apparently unresolvable state
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20205
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20205
>             Project: Apache Cassandra
>          Issue Type: Bug
>            Reporter: Peter Machon
>            Priority: Normal
>
> In three node Cassandra cluster I am consistently facing the same kind of 
> fatal situation on tables that are solely written using Cassandra's 
> lightweight transactions (CAS).
> Whenever a lightweight transaction fails to reach quorum (1/2), e.g. due to 
> high load, any following attempt to write data within a transactions fails, 
> i.e. does not return {{{}"[applied]"=true{}}}.
> Using {{{}select * from system.paxos where cf_id=<id of table>{}}}, I see 
> that there are entries, which I assume to be pending transactions.
> Further, in {{/var/log/Cassandra/system.log}} I see logs like:
> {quote}INFO [ScheduledTasks:1] 2025-01-12 21:46:53,005 
> UncommittedTableData.java:567 - Scheduling uncommitted paxos data merge task 
> for {{<any other table>}}
> {quote}
> {quote}INFO [OptionalTasks:1] 2025-01-12 21:46:53,006 
> PaxosCleanupLocalCoordinator.java:89 - Completing uncommitted paxos instances 
> for {{<table in stalled state>}} on ranges
> {quote}
> However, I can't figure how to resolve the state {{nodetool repair -full 
> <keyspace>}} (and variations), as well as restarting all nodes did not 
> resolve the issue.
> _Further information:_
>  * Cassandra version: 4.1.5
>  * OS: Ubuntu 22.04
>  * replication strategy: SimpleStrategy
>  * replication factor: 3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to