In September, the community chose to freeze trunk to begin working on Quality and Stability with the goal of releasing the most stable Cassandra major in the project’s history. While lots of work has been ongoing and folks could follow along with progress on JIRA I thought it would be useful to cover what has been accomplished so far since I’ve spent a good amount of time working with others on various testing projects.
During this time we have made significant progress on improving the Quality and Stability of Cassandra — not only Cassandra 4.0 but also the Cassandra 3.x series and future Cassandra releases. Additionally, testing has provided the opportunity for new community members and committers to contribute. While not comprehensive the community has found at least 25 bugs that can be classified as either Data Loss, Corruption, Incorrect Response, Loss of Stability, Loss of Availability, Concurrency Issues, Performance Issues, and Lack of Safety. These bugs have been found by a variety of methodologies including commonly used ones like unit testing and canary deployments. However, the majority of the bugs have been found or confirmed using new methodologies like the ones described in a some recent blog posts [1] [2]. Additionally, the state of the test suites and test tooling have improved. CASSANDRA-14806 [3] brought some much welcomed improvements to the circleci workflow and made it easier for people to run (d)tests on supported platforms (jdk8/11) and the work to get upgrade tests running found several bugs including CASSADNRA-14958 [4]. While we have made significant progress there is still more to do before we can be truly confident in an Cassandra 4.0 release. Some ongoing and outstanding work includes: * Improving the state of the cqlsh tests [5] * There is ongoing discussion on the new MessagingService [6] which will require significant review and testing * Additional upgrade testing for Cassandra 4.0 including additional support for upgrade testing using in-jvm dtests [7] * Work to increase coverage of important areas and new features in Cassandra 4.0 [8] While the list above may seem short, the last item contains a long list of important areas the community has previously discussed adding coverage to. If you are looking for areas to contribute this is a great starting point. If there is a name down on an area you are interested in I would encourage you to reach out to them to discuss how you can help further increase the community’s confidence in the Quality and Stability of Cassandra. Below is an in-complete list of many of the severe bugs found during this part of the release cycle. Thanks again to all of the community members who contributed to finding these bugs and improving Cassandra for everyone. CASSANDRA-15004: Anti-compaction briefly removes sstables from the read path CASSANDRA-14958: Counters fail to increment on 2.X to 3.X mixed version clusters CASSANDRA-14936: Anticompaction should throw exceptions on errors, not just log them CASSANDRA-14672: After deleting data in 3.11.3, reads fail: "open marker and close marker have different deletion times" CASSANDRA-14912: LegacyLayout errors on collection tombstones from dropped columns CASSANDRA-14843: Drop/add column name with different Kind can result in corruption CASSANDRA-14568: CorruptSSTableExceptions in 3.0.17.1 (CASSANDRA-14568 v2) Static collection deletions are corrupted in 3.0 <-> 2.{1,2} messages CASSANDRA-14749: Collection Deletions for Dropped Columns in 2.1/3.0 mixed-mode can delete rows CASSANDRA-14568: Static collection deletions are corrupted in 3.0 -> 2.{1,2} messages CASSANDRA-14861: Inaccurate sstable min/max metadata can cause data loss CASSANDRA-14823: Legacy sstables with range tombstones spanning multiple index blocks create invalid bound sequences on 3.0+ (#1193) CASSANDRA-14873: Missing rows when reading 2.1 SSTables in 3.0 CASSANDRA-14838: Dropped columns can cause reverse sstable iteration to return prematurely CASSANDRA-14803: Rows that cross index block boundaries can cause incomplete reverse reads in some cases. CASSANDRA-14766: DESC order reads can fail to return the last Unfiltered in the partition (#1170) CASSANDRA-14991: SSL Cert Hot Reloading should defensively check for sanity of the new keystore/truststore before loading it CASSANDRA-14794: Avoid calling iter.next() in a loop when notifying indexers about range tombstones CASSANDRA-14780: Avoid creating empty compaction tasks after truncate CASSANDRA-14657: Handle failures in upgradesstables/cleanup/relocatee CASSANDRA-14638: Column result order can change in 'SELECT *' results when upgrading from 2.1 to 3.0 causing response corruption for queries using prepared statements when static columns are used CASSANDRA-14919: Regression in paging queries in mixed version clusters CASSANDRA-14554: LifecycleTransaction encounters ConcurrentModificationException when used in multi-threaded context CASSANDRA-14935: PendingAntiCompaction should be more judicious in the compactions it cancels CASSANDRA-14894: RangeTombstoneList doesn't properly clean up mergeable or superseded rts in some cases CASSANDRA-14824: Expand range tombstone validation checks to multiple interim request stages CASSANDRA-14763: Fail incremental repair prepare phase if it encounters sstables from un-finalized sessions CASSANDRA-14920: Some comparisons used for verifying paging queries in dtests only test the column names and not values Jordan [1] http://cassandra.apache.org/blog/2018/08/21/testing_apache_cassandra.html [2] http://cassandra.apache.org/blog/2018/10/17/finding_bugs_with_property_based_testing.html [3] https://issues.apache.org/jira/browse/CASSANDRA-14806 [4] https://issues.apache.org/jira/browse/CASSANDRA-14958 [5] https://issues.apache.org/jira/browse/CASSANDRA-14951 [6] https://issues.apache.org/jira/browse/CASSANDRA-15066 [7] https://issues.apache.org/jira/browse/CASSANDRA-15078 [8] https://cwiki.apache.org/confluence/display/CASSANDRA/4.0+Quality%3A+Components+and+Test+Plans