[
https://issues.apache.org/jira/browse/CASSANDRA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Caleb Rackliffe updated CASSANDRA-16226:
----------------------------------------
Test and Documentation Plan:
The patch includes a series of new tests in {{SSTablesIteratedTest}} that
verify expected numbers of SSTables read for several query types across compact
and non-compact tables. They should serve as reasonable documentation and
guardrails against further regression.
The official [docs on compact
storage|https://cassandra.apache.org/doc/latest/cql/appendices.html#appendix-c-dropping-compact-storage]
will need some rework as well, both to indicate that it will live on in 4.0,
and to take into account the concerns in this issue.
was:
The patch includes a new in-JVM upgrade test that exercises the 2.2 -> 3.0
upgrade path were an existing table uses COMPACT STORAGE and we hit the
particular scenario outlined in the description. Outside that, the interesting
parts of the regression suite are the tests around row deletions and cell
deletions where only a primary key remains.
Performance should also be improved significantly in this scenario, so a smoke
test to verify that would be helpful. (There should be zero impact on
non-compact tables.)
Status: Patch Available (was: In Progress)
I've posted (the 3.0 version of) an alternative solution here, which should
merge up cleanly, given {{TableMetadata#isCQLTable()}} is still present on
trunk after CASSANDRA-16217: https://github.com/apache/cassandra/pull/823
(CircleCI is having difficulty with the unit tests, but there is a partial run
[here|https://app.circleci.com/pipelines/github/maedhroz/cassandra/155/workflows/cc54c092-cac1-4144-b9be-577ea4c8a78e].)
My goals w/ this patch were the following:
1.) Maintain the empty row semantics of tables that have either always been
compact or always been non-compact (see previous comment ^), and preserve as
much of them as possible after dropping compact storage, given the historical
limitations of the compact format.
2.) Either improve/reduce or leave unchanged the number of SSTables read in all
cases for tables that have never been compact.
3.) Fix the regression around the number of SSTables read in the original
description of this issue for compact tables, and avoid having the regression
re-introduced when and if compact storage is dropped.
4.) Avoid creating a new SSTable format or burdening operators with running
{{upgradesstables}} again to make all of this above work.
Although there are some issues w/ the unit tests on CI, the relevant local
tests are looking clean, including {{operations.DeleteTest}} and
{{operations.UpgradeTest}}, which verify item #1 above, and the other points
are by and large satisfied. For point #2, there was actually a case where pure
non-compact tables appeared to be reading too many SSTables w/ updates present
(although that [needs to be
reviewed|https://github.com/apache/cassandra/pull/823/files?file-filters%5B%5D=.java#r524809272]).
One drawback of this approach is that when a compact table with existing
SSTables has compact storage dropped, those SSTables still do not contain
primary key liveness info, and therefore may continue to return zero rows
rather than rows w/ primary keys and null regular columns. (Data written after
the drop will, of course, have liveness info, and so mixed behavior is
possible.) I've discussed a few ideas w/ [~jwest] and [~ifesdjeen] around this,
and I'm not convinced any of them would be valuable. (Keep in mind that
applications written against compact storage tables will already be handling
this case, and good documentation will help even further.)
The first is having enough information, either via a sequence number or a new
piece of metadata, to identify whether an SSTable was created while a table was
still compact. Having this data might be useful in some other way, but it still
does not insert primary key liveness info into pre-drop SSTables. DELETE and
UPDATE operations never have primary key liveness information attached to them,
even for pure non-compact tables, so an insert pre-drop in one SSTable,
followed by a delete in a post-drop SSTable, will still translate to the
absence of a row rather than a row with null regular cells.
The second broad idea is to somehow insert primary key liveness information
into pre-drop SSTables, but this is problematic for at least a couple reasons.
If we make it a prerequisite for running DROP COMPACT STORAGE, it means
physically rewriting SSTables. (To be fair, this may not be a huge additional
burden for those still on 2.x, who would have to do this anyway.) The other
oddity is that there doesn't appear to be any existing procedure for
synthesizing primary key liveness info. In the non-compact world, it is carried
only by INSERT operations, not UPDATE or DELETE operations. It's not clear to
me how we would determine where to add it during a {{runsstableupgrade}} run,
particularly how we would know the difference between INSERT and UPDATE
starting with just a compact SSTable on disk.
To summarize, I think tables that are purely compact or non-compact should be
handled pretty well w/ this patch (and will strictly git all the goals above,
where applicable). The interesting case is the mixed compact/non-compact
SSTable set caused by dropping compact storage. In this case, I think it makes
reasonable trade-offs and avoids all the major pain points for operators.
Whatever direction we go, the official docs around compact storage are going to
need updating.
> COMPACT STORAGE SSTables created before 3.0 are not correctly skipped by
> timestamp due to missing primary key liveness info
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-16226
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16226
> Project: Cassandra
> Issue Type: Bug
> Components: Legacy/Local Write-Read Paths
> Reporter: Caleb Rackliffe
> Assignee: Caleb Rackliffe
> Priority: Normal
> Labels: perfomance, upgrade
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
> Time Spent: 2h
> Remaining Estimate: 0h
>
> This was discovered while tracking down a spike in the number of SSTables
> per read for a COMPACT STORAGE table after a 2.1 -> 3.0 upgrade. Before 3.0,
> there is no direct analog of 3.0's primary key liveness info. When we upgrade
> 2.1 COMPACT STORAGE SSTables to the mf format, we simply don't write row
> timestamps, even if the original mutations were INSERTs. On read, when we
> look at SSTables in order from newest to oldest max timestamp, we expect to
> have this primary key liveness information to determine whether we can skip
> older SSTables after finding completely populated rows.
> ex. I have three SSTables in a COMPACT STORAGE table with max timestamps
> 1000, 2000, and 3000. There are many rows in a particular partition, making
> filtering on the min and max clustering effectively a no-op. All data is
> inserted, and there are no partial updates. A fully specified row with
> timestamp 2500 exists in the SSTable with a max timestamp of 3000. With a
> proper row timestamp in hand, we can easily ignore the SSTables w/ max
> timestamps of 1000 and 2000. Without it, we read 3 SSTables instead of 1,
> which likely means a significant performance regression.
> The following test illustrates this difference in behavior between 2.1 and
> 3.0:
> https://github.com/maedhroz/cassandra/commit/84ce9242bedd735ca79d4f06007d127de6a82800
> A solution here might be as simple as having
> {{SinglePartitionReadCommand#canRemoveRow()}} only inspect primary key
> liveness information for non-compact/CQL tables. Tombstones seem to be
> handled at a level above that anyway. (One potential problem with that is
> whether or not the distinction will continue to exist in 4.0, and dropping
> compact storage from a table doesn't magically make pk liveness information
> appear.)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]