[ https://issues.apache.org/jira/browse/CASSANDRA-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913489#comment-17913489 ]
Benedict Elliott Smith edited comment on CASSANDRA-20205 at 1/15/25 11:07 PM: ------------------------------------------------------------------------------ Oh, I am sorry. I misread your reply. If the column is missing entirely, I am not sure how best to proceed as this should not be affected by any lower level issues that might be caused by a failed paxos operation. The column is inserted at a higher level, and should only be missing if the transaction fails due to some exception that should be reported instead of any result set. Have you actually tried reading any of these rows to see if the data is being updated? It is possible the transactions are failing to apply due to some lower level issue, but it would not by itself explain the behaviour you are seeing. A few things you could try to see if they help: 1) Disable auto paxos repairs ({{-Dcassandra.disable_paxos_auto_repairs}}) or simply all paxos repairs ({{paxos_repair_enabled: false}}). Only turn off the latter if you have not updated your {{paxos_state_purging}} setting from the default. 2) Upgrade to Paxos v2 ({{paxos_variant: 'v2'}}), and play with the {{ContentionStrategy}} settings I will have a think about how else we might proceed diagnosing the issue remotely, but to warn you I don't have a lot of free time to assist this investigation. In the meantime, if you want to provide whatever other information you can such as full logs (sanitised however you like), the names of the affected tables, it might provide some clues to unpick. was (Author: benedict): Oh, I am sorry. I misread your reply. If the column is missing entirely, I am not sure how best to proceed as this should not be affected by any lower level issues that might be caused by a failed paxos operation. The column is inserted at a higher level, and should only be missing if the transaction fails due to some exception that should be reported instead of any result set. Have you actually tried reading any of these rows to see if the data is being updated? It is possible the transactions are failing to apply due to some lower level issue, but it would not by itself explain the behaviour you are seeing. A few things you could try to see if they help: 1) Disable auto paxos repairs ({{-Dcassandra.disable_paxos_auto_repairs}}) or simply all paxos repairs {{paxos_repair_enabled: false}}. Only turn off the latter if you have not updated your {{paxos_state_purging}} setting from the default. 2) Upgrade to Paxos v2 {{paxos_variant: 'v2'}}, and play with the {{ContentionStrategy}} settings I will have a think about how else we might proceed diagnosing the issue remotely, but to warn you I don't have a lot of free time to assist this investigation. In the meantime, if you want to provide whatever other information you can such as full logs (sanitised however you like), the names of the affected tables, it might provide some clues to unpick. > Failed lightweight transaction leaves Paxos in apparently unresolvable state > ---------------------------------------------------------------------------- > > Key: CASSANDRA-20205 > URL: https://issues.apache.org/jira/browse/CASSANDRA-20205 > Project: Apache Cassandra > Issue Type: Bug > Reporter: Peter Machon > Priority: Normal > Attachments: paxos_1.csv, paxos_2.csv, paxos_3.csv > > > In three node Cassandra cluster I am consistently facing the same kind of > fatal situation on tables that are solely written using Cassandra's > lightweight transactions (CAS). > Whenever a lightweight transaction fails to reach quorum (1/2), e.g. due to > high load, any following attempt to write data within a transactions fails, > i.e. does not return {{{}"[applied]"=true{}}}. > Using {{{}select * from system.paxos where cf_id=<id of table>{}}}, I see > that there are entries, which I assume to be pending transactions. > Further, in {{/var/log/Cassandra/system.log}} I see logs like: > {quote}INFO [ScheduledTasks:1] 2025-01-12 21:46:53,005 > UncommittedTableData.java:567 - Scheduling uncommitted paxos data merge task > for {{<any other table>}} > {quote} > {quote}INFO [OptionalTasks:1] 2025-01-12 21:46:53,006 > PaxosCleanupLocalCoordinator.java:89 - Completing uncommitted paxos instances > for {{<table in stalled state>}} on ranges > {quote} > However, I can't figure how to resolve the state {{nodetool repair -full > <keyspace>}} (and variations), as well as restarting all nodes did not > resolve the issue. > _Further information:_ > * Cassandra version: 4.1.5 > * OS: Ubuntu 22.04 > * replication strategy: SimpleStrategy > * replication factor: 3 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org