[jira] [Comment Edited] (CASSANDRA-19130) Implement transactional table truncation

Abe Ratnofsky (Jira) Fri, 27 Sep 2024 11:58:03 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885454#comment-17885454
 ]


Abe Ratnofsky edited comment on CASSANDRA-19130 at 9/27/24 6:57 PM:
--------------------------------------------------------------------

I think we should allow "legacy" truncations coordinated by non-TCM instances 
to work as they do now, but reject truncations coordinated by TCM instances 
when the cluster contains non-TCM instances. In order to reject truncations 
coordinated by non-TCM instances we'd have to roll out a flag to let pre-TCM 
instances reject truncation, and users don't always upgrade to the latest minor 
before upgrading. So maintaining the legacy behavior when the coordinator is 
pre-TCM feels appropriate to me.

TCM coordinators rejecting truncate during mixed-version mode should be fine 
too, since currently truncate does not work if any nodes are down, and 
mixed-version mode happens during an upgrade when nodes are restarting for 
bounces anyway.

That leaves the path when all instances are running TCM.

Firstly - we'll be able to set the truncation timestamp in the coordinator and 
propagate it with TCM, so all replicas store the same truncation timestamp in 
SystemKeyspace and skip the same mutations during commit log replay. If users 
are running concurrent mutations and truncations, then we have undefined 
behavior, because replicas that receive a mutation before truncation will drop 
the mutation, and replicas that receive a mutation after truncation will keep 
that mutation, then drop it when next recovering from the commitlog. We don't 
have a means to have an agreed order of the mutation timestamps and truncation 
timestamp, since mutation timestamps can be custom but truncation timestamps 
are always wall-clock on the coordinator.

If we really want transactional TRUNCATE support, we should have TRUNCATE work 
via mutation timestamps (rather than wall-clock timestamps), and support 
TRUNCATE USING TIMESTAMP. Users could then expect that after any TRUNCATE no 
future reads return data including any mutation timestamp earlier than 
truncated_at. Using mutation timestamps would also permit easier integration 
with Accord's transactional_mode=full.

 

>From [~samt]'s earlier comment:

> The way truncation works is that it writes a timestamp into a system table on 
> each node, associated with the table being truncated (and a commitlog 
> position). Then, when local reads and writes are done against that table, any 
> cells with a timestamp earlier than the truncation is essentially discarded.

I'm not seeing this, at least on trunk. SystemKeyspace.getTruncatedAt isn't 
called on the read path. This test fails because a single row is returned:
{code:java}
    @Test
    public void testTruncate()
    {
        createTable("CREATE TABLE %s (k int PRIMARY KEY, v int)");              
         SystemKeyspace.saveTruncationRecord(getCurrentColumnFamilyStore(), 
101, CommitLogPosition.NONE);
        execute("INSERT INTO %s (k, v) VALUES (1, 1) USING TIMESTAMP ?", 100L); 
      // Should be empty since truncated after write
        assertRowCount(execute("SELECT * FROM %s"), 0);
    }
 {code}


was (Author: aratnofsky):
I think we should allow "legacy" truncations coordinated by non-TCM instances 
to work as they do now, but reject truncations coordinated by TCM instances 
when the cluster contains non-TCM instances. In order to reject truncations 
coordinated by non-TCM instances we'd have to roll out a flag to let pre-TCM 
instances reject truncation, and users don't always upgrade to the latest minor 
before upgrading. So maintaining the legacy behavior when the coordinator is 
pre-TCM feels appropriate to me.

TCM coordinators rejecting truncate during mixed-version mode should be fine 
too, since currently truncate does not work if any nodes are down, and 
mixed-version mode happens during an upgrade when nodes are restarting for 
bounces anyway.

That leaves the path when all instances are running TCM.

Firstly - we'll be able to set the truncation timestamp in the coordinator and 
propagate it with TCM, so all replicas store the same truncation timestamp in 
SystemKeyspace and skip the same mutations during commit log replay. If users 
are running concurrent mutations and truncations, then we have undefined 
behavior, because replicas that receive a mutation before truncation will drop 
the mutation, and replicas that receive a mutation after truncation will keep 
that mutation, then drop it when next recovering from the commitlog. We don't 
have a means to have an agreed order of the mutation timestamps and truncation 
timestamp, since mutation timestamps can be custom but truncation timestamps 
are always wall-clock on the coordinator.

If we really want transactional TRUNCATE support, we should have TRUNCATE work 
via mutation timestamps (rather than wall-clock timestamps), and support 
TRUNCATE USING TIMESTAMP. Users could then expect that after any TRUNCATE no 
future reads return data including any mutation timestamp earlier than 
truncated_at. Using mutation timestamps would also permit easier integration 
with Accord's transactional_mode=full.

 

>From [~samt]'s earlier comment:

> The way truncation works is that it writes a timestamp into a system table on 
> each node, associated with the table being truncated (and a commitlog 
> position). Then, when local reads and writes are done against that table, any 
> cells with a timestamp earlier than the truncation is essentially discarded.

I'm not seeing this, at least on trunk. SystemKeyspace.getTruncatedAt isn't 
called on the read path. This test fails because a single row is returned:
{code:java}
    @Test
    public void testTruncate()
    {
        createTable("CREATE TABLE %s (k int PRIMARY KEY, v int)");
SystemKeyspace.saveTruncationRecord(getCurrentColumnFamilyStore(), 101, 
CommitLogPosition.NONE);
        execute("INSERT INTO %s (k, v) VALUES (1, 1) USING TIMESTAMP ?", 100L); 
      // Should be empty since truncated after write
        assertRowCount(execute("SELECT * FROM %s"), 0);
    }
 {code}

> Implement transactional table truncation
> ----------------------------------------
>
>                 Key: CASSANDRA-19130
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19130
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Consistency/Coordination
>            Reporter: Marcus Eriksson
>            Assignee: Stefan Miklosovic
>            Priority: Normal
>             Fix For: 5.x
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> TRUNCATE table should leverage cluster metadata to ensure consistent 
> truncation timestamps across all replicas. The current implementation depends 
> on all nodes being available, but this could be reimplemented as a 
> {{Transformation}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-19130) Implement transactional table truncation

Reply via email to