[
https://issues.apache.org/jira/browse/CASSANDRA-19130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885454#comment-17885454
]
Abe Ratnofsky edited comment on CASSANDRA-19130 at 9/27/24 6:57 PM:
--------------------------------------------------------------------
I think we should allow "legacy" truncations coordinated by non-TCM instances
to work as they do now, but reject truncations coordinated by TCM instances
when the cluster contains non-TCM instances. In order to reject truncations
coordinated by non-TCM instances we'd have to roll out a flag to let pre-TCM
instances reject truncation, and users don't always upgrade to the latest minor
before upgrading. So maintaining the legacy behavior when the coordinator is
pre-TCM feels appropriate to me.
TCM coordinators rejecting truncate during mixed-version mode should be fine
too, since currently truncate does not work if any nodes are down, and
mixed-version mode happens during an upgrade when nodes are restarting for
bounces anyway.
That leaves the path when all instances are running TCM.
Firstly - we'll be able to set the truncation timestamp in the coordinator and
propagate it with TCM, so all replicas store the same truncation timestamp in
SystemKeyspace and skip the same mutations during commit log replay. If users
are running concurrent mutations and truncations, then we have undefined
behavior, because replicas that receive a mutation before truncation will drop
the mutation, and replicas that receive a mutation after truncation will keep
that mutation, then drop it when next recovering from the commitlog. We don't
have a means to have an agreed order of the mutation timestamps and truncation
timestamp, since mutation timestamps can be custom but truncation timestamps
are always wall-clock on the coordinator.
If we really want transactional TRUNCATE support, we should have TRUNCATE work
via mutation timestamps (rather than wall-clock timestamps), and support
TRUNCATE USING TIMESTAMP. Users could then expect that after any TRUNCATE no
future reads return data including any mutation timestamp earlier than
truncated_at. Using mutation timestamps would also permit easier integration
with Accord's transactional_mode=full.
>From [~samt]'s earlier comment:
> The way truncation works is that it writes a timestamp into a system table on
> each node, associated with the table being truncated (and a commitlog
> position). Then, when local reads and writes are done against that table, any
> cells with a timestamp earlier than the truncation is essentially discarded.
I'm not seeing this, at least on trunk. SystemKeyspace.getTruncatedAt isn't
called on the read path. This test fails because a single row is returned:
{code:java}
@Test
public void testTruncate()
{
createTable("CREATE TABLE %s (k int PRIMARY KEY, v int)");
SystemKeyspace.saveTruncationRecord(getCurrentColumnFamilyStore(),
101, CommitLogPosition.NONE);
execute("INSERT INTO %s (k, v) VALUES (1, 1) USING TIMESTAMP ?", 100L);
// Should be empty since truncated after write
assertRowCount(execute("SELECT * FROM %s"), 0);
}
{code}
was (Author: aratnofsky):
I think we should allow "legacy" truncations coordinated by non-TCM instances
to work as they do now, but reject truncations coordinated by TCM instances
when the cluster contains non-TCM instances. In order to reject truncations
coordinated by non-TCM instances we'd have to roll out a flag to let pre-TCM
instances reject truncation, and users don't always upgrade to the latest minor
before upgrading. So maintaining the legacy behavior when the coordinator is
pre-TCM feels appropriate to me.
TCM coordinators rejecting truncate during mixed-version mode should be fine
too, since currently truncate does not work if any nodes are down, and
mixed-version mode happens during an upgrade when nodes are restarting for
bounces anyway.
That leaves the path when all instances are running TCM.
Firstly - we'll be able to set the truncation timestamp in the coordinator and
propagate it with TCM, so all replicas store the same truncation timestamp in
SystemKeyspace and skip the same mutations during commit log replay. If users
are running concurrent mutations and truncations, then we have undefined
behavior, because replicas that receive a mutation before truncation will drop
the mutation, and replicas that receive a mutation after truncation will keep
that mutation, then drop it when next recovering from the commitlog. We don't
have a means to have an agreed order of the mutation timestamps and truncation
timestamp, since mutation timestamps can be custom but truncation timestamps
are always wall-clock on the coordinator.
If we really want transactional TRUNCATE support, we should have TRUNCATE work
via mutation timestamps (rather than wall-clock timestamps), and support
TRUNCATE USING TIMESTAMP. Users could then expect that after any TRUNCATE no
future reads return data including any mutation timestamp earlier than
truncated_at. Using mutation timestamps would also permit easier integration
with Accord's transactional_mode=full.
>From [~samt]'s earlier comment:
> The way truncation works is that it writes a timestamp into a system table on
> each node, associated with the table being truncated (and a commitlog
> position). Then, when local reads and writes are done against that table, any
> cells with a timestamp earlier than the truncation is essentially discarded.
I'm not seeing this, at least on trunk. SystemKeyspace.getTruncatedAt isn't
called on the read path. This test fails because a single row is returned:
{code:java}
@Test
public void testTruncate()
{
createTable("CREATE TABLE %s (k int PRIMARY KEY, v int)");
SystemKeyspace.saveTruncationRecord(getCurrentColumnFamilyStore(), 101,
CommitLogPosition.NONE);
execute("INSERT INTO %s (k, v) VALUES (1, 1) USING TIMESTAMP ?", 100L);
// Should be empty since truncated after write
assertRowCount(execute("SELECT * FROM %s"), 0);
}
{code}
> Implement transactional table truncation
> ----------------------------------------
>
> Key: CASSANDRA-19130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19130
> Project: Cassandra
> Issue Type: New Feature
> Components: Consistency/Coordination
> Reporter: Marcus Eriksson
> Assignee: Stefan Miklosovic
> Priority: Normal
> Fix For: 5.x
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> TRUNCATE table should leverage cluster metadata to ensure consistent
> truncation timestamps across all replicas. The current implementation depends
> on all nodes being available, but this could be reimplemented as a
> {{Transformation}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]