I would like to propose CASSANDRA-15066 [1] - an important set of bug fixes
and stability improvements to internode messaging code that Benedict, I,
and others have been working on for the past couple of months.

First, some context.   This work started off as a review of CASSANDRA-14503
(Internode connection management is race-prone [2]), CASSANDRA-13630
(Support large internode messages with netty) [3], and a pre-4.0
confirmatory review of such a major new feature.

However, as we dug in, we realized this was insufficient. With more than 50
bugs uncovered [4] - dozens of them critical to correctness and/or
stability of the system - a substantial rework was necessary to guarantee a
solid internode messaging subsystem for the 4.0 release.

In addition to addressing all of the uncovered bugs [4] that were unique to
trunk + 13630 [3] + 14503 [2], we used this opportunity to correct some
long-existing, pre-4.0 bugs and stability issues. For the complete list of
notable bug fixes, read the comments to CASSANDRA-15066 [1]. But I’d like
to highlight a few.

# Lack of message integrity checks

It’s known that TCP checksums are too weak [5] and Ethernet CRC cannot be
relied upon [6] for integrity. With sufficient scale or time, you will hit
bit flips. Sadly, most of the time these go undetected.  Cassandra’s
replication model makes this issue much more serious, as the faulty data
can infect the cluster.

We recognised this problem, and recently introduced a fix for server-client
messages, implementing CRCs in CASSANDRA-13304 (Add checksumming to the
native protocol) [7].

But until CASSANDRA-15066 [1] lands, this is also a critical flaw
internode. We have addressed it by ensuring that no matter what, whether
you use SSL or not, whether you use internode compression or not, a
protocol level CRC is always present, for every message frame. It’s our
deep and sincere belief that shipping a new implementation of the messaging
protocol without application-level data integrity checks would be
unacceptable for a modern database.

# Lack of back-pressure and memory usage constraints

As it stands today, it’s far too easy for a single slow node to become
overwhelmed by messages from its peers.  Conversely, multiple coordinators
can be made unstable by the backlog of messages to deliver to just one
struggling node.

To address this problem, we have introduced strict memory usage constraints
that apply TCP-level back-pressure, on both outbound and inbound.  It is
now impossible for a node to be swamped on inbound, and on outbound it is
made significantly harder to overcommit resources.  It’s a simple, reliable
mechanism that drastically improves cluster stability under load, and
especially overload.

Cassandra is a mature system, and introducing an entirely new messaging
implementation without resolving this fundamental stability issue is
difficult to justify in our view.

# State of the patch, feature freeze and 4.0 timeline concerns

The patch is essentially complete, with much improved unit tests all
passing, dtests green, and extensive fuzz testing underway - with initial
results all positive.  We intend to further improve in-code documentation
and test coverage in the next week or two, and do some minor additional
code review, but we believe it will be basically good to commit in a couple
weeks.

The patch could also use some performance testing. We are hoping that our
colleagues at Netflix and new Cassandra committers Joey and Vinay will help
us with this aspect.  However, this work needs to be done regardless, so
provides no significant delay.

I would classify absolutely most of the work done in this patch as
necessary bug fixes and stability improvements - in line with the stated
goals of the feature freeze.

The only clear “feature” introduced is the expanded metrics, and virtual
tables to observe them.  If the freeze is to be strictly interpreted these
can be removed, but we think this would be unwise.

We hope that the community will appreciate and welcome these changes.
We’ve worked hard to deliver this promptly, to minimise delays to 4.0 were
these issues to be addressed piecemeal, and we sincerely believe this work
will have a positive impact on stability and performance of Cassandra
clusters for years to come.

P.S. I believe that once it’s committed, we should cut our first 4.0 alpha.
It’s about time we started this train (:

[1] https://issues.apache.org/jira/browse/CASSANDRA-15066
[2] https://issues.apache.org/jira/browse/CASSANDRA-14503
[3] https://issues.apache.org/jira/browse/CASSANDRA-13630
[4] https://gist.github.com/belliottsmith/0d12df678d8e9ab06776e29116d56b91
(incomplete list)
[5] https://www.evanjones.ca/tcp-checksums.html
[6] https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html
[7] https://issues.apache.org/jira/browse/CASSANDRA-13304

Reply via email to