[jira] [Comment Edited] (CASSANDRA-20205) Failed lightweight transaction leaves Paxos in apparently unresolvable state

Benedict Elliott Smith (Jira) Wed, 15 Jan 2025 15:16:01 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913489#comment-17913489
 ]


Benedict Elliott Smith edited comment on CASSANDRA-20205 at 1/15/25 11:07 PM:
------------------------------------------------------------------------------

Oh, I am sorry. I misread your reply. If the column is missing entirely, I am 
not sure how best to proceed as this should not be affected by any lower level 
issues that might be caused by a failed paxos operation. The column is inserted 
at a higher level, and should only be missing if the transaction fails due to 
some exception that should be reported instead of any result set. 

Have you actually tried reading any of these rows to see if the data is being 
updated?

It is possible the transactions are failing to apply due to some lower level 
issue, but it would not by itself explain the behaviour you are seeing. A few 
things you could try to see if they help:

1) Disable auto paxos repairs ({{-Dcassandra.disable_paxos_auto_repairs}}) or 
simply all paxos repairs ({{paxos_repair_enabled: false}}). Only turn off the 
latter if you have not updated your {{paxos_state_purging}} setting from the 
default.
2) Upgrade to Paxos v2 ({{paxos_variant: 'v2'}}), and play with the 
{{ContentionStrategy}} settings

I will have a think about how else we might proceed diagnosing the issue 
remotely, but to warn you I don't have a lot of free time to assist this 
investigation.

In the meantime, if you want to provide whatever other information you can such 
as full logs (sanitised however you like), the names of the affected tables, it 
might provide some clues to unpick.


was (Author: benedict):
Oh, I am sorry. I misread your reply. If the column is missing entirely, I am 
not sure how best to proceed as this should not be affected by any lower level 
issues that might be caused by a failed paxos operation. The column is inserted 
at a higher level, and should only be missing if the transaction fails due to 
some exception that should be reported instead of any result set. 

Have you actually tried reading any of these rows to see if the data is being 
updated?

It is possible the transactions are failing to apply due to some lower level 
issue, but it would not by itself explain the behaviour you are seeing. A few 
things you could try to see if they help:

1) Disable auto paxos repairs ({{-Dcassandra.disable_paxos_auto_repairs}}) or 
simply all paxos repairs {{paxos_repair_enabled: false}}. Only turn off the 
latter if you have not updated your {{paxos_state_purging}} setting from the 
default.
2) Upgrade to Paxos v2 {{paxos_variant: 'v2'}}, and play with the 
{{ContentionStrategy}} settings

I will have a think about how else we might proceed diagnosing the issue 
remotely, but to warn you I don't have a lot of free time to assist this 
investigation.

In the meantime, if you want to provide whatever other information you can such 
as full logs (sanitised however you like), the names of the affected tables, it 
might provide some clues to unpick.

> Failed lightweight transaction leaves Paxos in apparently unresolvable state
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20205
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20205
>             Project: Apache Cassandra
>          Issue Type: Bug
>            Reporter: Peter Machon
>            Priority: Normal
>         Attachments: paxos_1.csv, paxos_2.csv, paxos_3.csv
>
>
> In three node Cassandra cluster I am consistently facing the same kind of 
> fatal situation on tables that are solely written using Cassandra's 
> lightweight transactions (CAS).
> Whenever a lightweight transaction fails to reach quorum (1/2), e.g. due to 
> high load, any following attempt to write data within a transactions fails, 
> i.e. does not return {{{}"[applied]"=true{}}}.
> Using {{{}select * from system.paxos where cf_id=<id of table>{}}}, I see 
> that there are entries, which I assume to be pending transactions.
> Further, in {{/var/log/Cassandra/system.log}} I see logs like:
> {quote}INFO [ScheduledTasks:1] 2025-01-12 21:46:53,005 
> UncommittedTableData.java:567 - Scheduling uncommitted paxos data merge task 
> for {{<any other table>}}
> {quote}
> {quote}INFO [OptionalTasks:1] 2025-01-12 21:46:53,006 
> PaxosCleanupLocalCoordinator.java:89 - Completing uncommitted paxos instances 
> for {{<table in stalled state>}} on ranges
> {quote}
> However, I can't figure how to resolve the state {{nodetool repair -full 
> <keyspace>}} (and variations), as well as restarting all nodes did not 
> resolve the issue.
> _Further information:_
>  * Cassandra version: 4.1.5
>  * OS: Ubuntu 22.04
>  * replication strategy: SimpleStrategy
>  * replication factor: 3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-20205) Failed lightweight transaction leaves Paxos in apparently unresolvable state

Reply via email to