[
https://issues.apache.org/jira/browse/CASSANDRA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233045#comment-17233045
]
Caleb Rackliffe commented on CASSANDRA-16226:
---------------------------------------------
bq. I don't think that the temporary performance issue after DROP COMPACT
STORAGE can be fixed without tracking the dropped status.
One proposal I've been rolling around in my head is the following:
1.) Whenever we execute {{DROP COMPACT STORAGE}}, we record the keyspace,
table, and timestamp of the operation in a new table called
{{drop_compact_history}} (i.e. something similar to {{compaction_history}}) in
the {{system}} keyspace.
2.) Much like we do w/ dropped columns, we have the {{TableMetadata}} builder
pull information about when, if ever, CS was dropped. By default, we use
{{Long.MIN_VALUE}} to indicate that it has never been dropped. (If it's never
been dropped and we have a CQL table, the table has never been compact, etc.)
3.) At read time, in {{SinglePartitionReadCommand#canRemoveRow()}}, we do the
following in order:
a.) Short-circuit and return {{false}} if we have a non-empty primary key
timestamp that is {{<=}} the current SSTable max timestamp. (We can do this
check regardless of how the SSTable was created.) This preserves this
optimization for tables that have never been compact, but also allows it when a
table created as compact and no longer is has written new SSTables w/ real
primary key liveness info.
b.) If the primary key liveness information is empty, and we have a CQL table
that has never had {{DROP COMPACT STORAGE}} executed, we return {{false}},
which preserves the current optimization.
c.) Otherwise, we fall through to the column-level logic. (i.e. Empty primary
key liveness info for any table *created* w/ {{COMPACT STORAGE}} is ignored.)
The one slight oddity of this proposal is that it could ignore primary key
liveness info for SSTables that were flushed or compacted after compact storage
was dropped. This seems benign, but I need to think through it a bit more.
Specifically...
- If there is a single empty or "out-of-date" column, we _won't_ be able to
skip the SSTable no matter what the primary key liveness info looks like.
- If all the column values are present and newer than anything that could be in
the SSTable we're inspecting (or no regular columns are selected), we should be
able to skip the SSTable with empty primary key liveness info. My assumption is
the current logic doesn't allow the skip on empty liveness info as an
optimization because it is _assumed_ for CQL tables that there are no live
cells in that case. (The opposite, non-empty liveness info with no live cells,
_is_ possible.)
The other thing is that, assuming the above is correct, I'm not sure step 3b is
even necessary. If it isn't, it also means recording the drop isn't necessary
as well.
> COMPACT STORAGE SSTables created before 3.0 are not correctly skipped by
> timestamp due to missing primary key liveness info
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-16226
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16226
> Project: Cassandra
> Issue Type: Bug
> Components: Legacy/Local Write-Read Paths
> Reporter: Caleb Rackliffe
> Assignee: Caleb Rackliffe
> Priority: Normal
> Labels: perfomance, upgrade
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> This was discovered while tracking down a spike in the number of SSTables
> per read for a COMPACT STORAGE table after a 2.1 -> 3.0 upgrade. Before 3.0,
> there is no direct analog of 3.0's primary key liveness info. When we upgrade
> 2.1 COMPACT STORAGE SSTables to the mf format, we simply don't write row
> timestamps, even if the original mutations were INSERTs. On read, when we
> look at SSTables in order from newest to oldest max timestamp, we expect to
> have this primary key liveness information to determine whether we can skip
> older SSTables after finding completely populated rows.
> ex. I have three SSTables in a COMPACT STORAGE table with max timestamps
> 1000, 2000, and 3000. There are many rows in a particular partition, making
> filtering on the min and max clustering effectively a no-op. All data is
> inserted, and there are no partial updates. A fully specified row with
> timestamp 2500 exists in the SSTable with a max timestamp of 3000. With a
> proper row timestamp in hand, we can easily ignore the SSTables w/ max
> timestamps of 1000 and 2000. Without it, we read 3 SSTables instead of 1,
> which likely means a significant performance regression.
> The following test illustrates this difference in behavior between 2.1 and
> 3.0:
> https://github.com/maedhroz/cassandra/commit/84ce9242bedd735ca79d4f06007d127de6a82800
> A solution here might be as simple as having
> {{SinglePartitionReadCommand#canRemoveRow()}} only inspect primary key
> liveness information for non-compact/CQL tables. Tombstones seem to be
> handled at a level above that anyway. (One potential problem with that is
> whether or not the distinction will continue to exist in 4.0, and dropping
> compact storage from a table doesn't magically make pk liveness information
> appear.)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]