In September, the community chose to freeze trunk to begin working on
Quality and Stability with the goal of releasing the most stable Cassandra
major in the project’s history. While lots of work has been ongoing and
folks could follow along with progress on JIRA I thought it would be useful
to cover what has been accomplished so far since I’ve spent a good amount
of time working with others on various testing projects.

During this time we have made significant progress on improving the Quality
and Stability of Cassandra — not only Cassandra 4.0 but also the Cassandra
3.x series and future Cassandra releases. Additionally, testing has
provided the opportunity for new community members and committers to
contribute. While not comprehensive the community has found at least 25
bugs that can be classified as either Data Loss, Corruption, Incorrect
Response, Loss of Stability, Loss of Availability, Concurrency Issues,
Performance Issues, and Lack of Safety. These bugs have been found by a
variety of methodologies including commonly used ones like unit testing and
canary deployments. However, the majority of the bugs have been found or
confirmed using new methodologies like the ones described in a some recent
blog posts [1] [2].

Additionally, the state of the test suites and test tooling have improved.
CASSANDRA-14806 [3] brought some much welcomed improvements to the circleci
workflow and made it easier for people to run (d)tests on supported
platforms (jdk8/11) and the work to get upgrade tests running found several
bugs including CASSADNRA-14958 [4].

While we have made significant progress there is still more to do before we
can be truly confident in an Cassandra 4.0 release. Some ongoing and
outstanding work includes:

* Improving the state of the cqlsh tests [5]
* There is ongoing discussion on the new MessagingService [6] which will
require significant review and testing
* Additional upgrade testing for Cassandra 4.0 including additional support
for upgrade testing using in-jvm dtests [7]
* Work to increase coverage of important areas and new features in
Cassandra 4.0 [8]

While the list above may seem short, the last item contains a long list of
important areas the community has previously discussed adding coverage to.
If you are looking for areas to contribute this is a great starting point.
If there is a name down on an area you are interested in I would encourage
you to reach out to them to discuss how you can help further increase the
community’s confidence in the Quality and Stability of Cassandra.

Below is an in-complete list of many of the severe bugs found during this
part of the release cycle. Thanks again to all of the community members who
contributed to finding these bugs and improving Cassandra for everyone.

CASSANDRA-15004: Anti-compaction briefly removes sstables from the read path
CASSANDRA-14958: Counters fail to increment on 2.X to 3.X mixed version
clusters
CASSANDRA-14936: Anticompaction should throw exceptions on errors, not just
log them
CASSANDRA-14672: After deleting data in 3.11.3, reads fail: "open marker
and close marker have different deletion times"
CASSANDRA-14912: LegacyLayout errors on collection tombstones from dropped
columns
CASSANDRA-14843: Drop/add column name with different Kind can result in
corruption
CASSANDRA-14568: CorruptSSTableExceptions in 3.0.17.1 (CASSANDRA-14568 v2)
Static collection deletions are corrupted in 3.0 <-> 2.{1,2} messages
CASSANDRA-14749: Collection Deletions for Dropped Columns in 2.1/3.0
mixed-mode can delete rows
CASSANDRA-14568: Static collection deletions are corrupted in 3.0 ->
2.{1,2} messages
CASSANDRA-14861: Inaccurate sstable min/max metadata can cause data loss
CASSANDRA-14823: Legacy sstables with range tombstones spanning multiple
index blocks create invalid bound sequences on 3.0+ (#1193)
CASSANDRA-14873: Missing rows when reading 2.1 SSTables in 3.0
CASSANDRA-14838: Dropped columns can cause reverse sstable iteration to
return prematurely
CASSANDRA-14803: Rows that cross index block boundaries can cause
incomplete reverse reads in some cases.
CASSANDRA-14766: DESC order reads can fail to return the last Unfiltered in
the partition (#1170)
CASSANDRA-14991: SSL Cert Hot Reloading should defensively check for sanity
of the new keystore/truststore before loading it
CASSANDRA-14794: Avoid calling iter.next() in a loop when notifying
indexers about range tombstones
CASSANDRA-14780: Avoid creating empty compaction tasks after truncate
CASSANDRA-14657: Handle failures in upgradesstables/cleanup/relocatee
CASSANDRA-14638: Column result order can change in 'SELECT *' results when
upgrading from 2.1 to 3.0 causing response corruption for queries using
prepared statements when static columns are used
CASSANDRA-14919: Regression in paging queries in mixed version clusters
CASSANDRA-14554: LifecycleTransaction encounters
ConcurrentModificationException when used in multi-threaded context
CASSANDRA-14935: PendingAntiCompaction should be more judicious in the
compactions it cancels
CASSANDRA-14894: RangeTombstoneList doesn't properly clean up mergeable or
superseded rts in some cases
CASSANDRA-14824: Expand range tombstone validation checks to multiple
interim request stages
CASSANDRA-14763: Fail incremental repair prepare phase if it encounters
sstables from un-finalized sessions
CASSANDRA-14920: Some comparisons used for verifying paging queries in
dtests only test the column names and not values

Jordan

[1]
http://cassandra.apache.org/blog/2018/08/21/testing_apache_cassandra.html
[2]
http://cassandra.apache.org/blog/2018/10/17/finding_bugs_with_property_based_testing.html
[3] https://issues.apache.org/jira/browse/CASSANDRA-14806
[4] https://issues.apache.org/jira/browse/CASSANDRA-14958
[5] https://issues.apache.org/jira/browse/CASSANDRA-14951
[6] https://issues.apache.org/jira/browse/CASSANDRA-15066
[7] https://issues.apache.org/jira/browse/CASSANDRA-15078
[8]
https://cwiki.apache.org/confluence/display/CASSANDRA/4.0+Quality%3A+Components+and+Test+Plans

Reply via email to