Re: Stabilising Internode Messaging in 4.0

Jon Haddad Thu, 04 Apr 2019 17:56:31 -0700

Given the number of issues that are addressed, I definitely think it's
worth strongly considering merging this in.  I think it might be a
little unrealistic to cut the first alpha after the merge though.
Being realistic, any 20K+ LOC change is going to introduce its own
bugs, and we should be honest with ourselves about that.  It seems
likely the issues the patch addressed would have affected the 4.0
release in some form *anyways* so the question might be do we fix them
now or after someone's cluster burns down because there's no inbound /
outbound message load shedding.


Giving it a quick code review and going through the JIRA comments
(well written, thanks guys) there seem to be some pretty important bug
fixes in here as well as paying off a bit of technical debt.

Jon

On Thu, Apr 4, 2019 at 1:37 PM Pavel Yaskevich <xe...@apache.org> wrote:
>
> Great to see such a significant progress made in the area!
>
> On Thu, Apr 4, 2019 at 1:13 PM Aleksey Yeschenko <alek...@apache.org> wrote:
>
> > I would like to propose CASSANDRA-15066 [1] - an important set of bug fixes
> > and stability improvements to internode messaging code that Benedict, I,
> > and others have been working on for the past couple of months.
> >
> > First, some context.   This work started off as a review of CASSANDRA-14503
> > (Internode connection management is race-prone [2]), CASSANDRA-13630
> > (Support large internode messages with netty) [3], and a pre-4.0
> > confirmatory review of such a major new feature.
> >
> > However, as we dug in, we realized this was insufficient. With more than 50
> > bugs uncovered [4] - dozens of them critical to correctness and/or
> > stability of the system - a substantial rework was necessary to guarantee a
> > solid internode messaging subsystem for the 4.0 release.
> >
> > In addition to addressing all of the uncovered bugs [4] that were unique to
> > trunk + 13630 [3] + 14503 [2], we used this opportunity to correct some
> > long-existing, pre-4.0 bugs and stability issues. For the complete list of
> > notable bug fixes, read the comments to CASSANDRA-15066 [1]. But I’d like
> > to highlight a few.
> >
> > # Lack of message integrity checks
> >
> > It’s known that TCP checksums are too weak [5] and Ethernet CRC cannot be
> > relied upon [6] for integrity. With sufficient scale or time, you will hit
> > bit flips. Sadly, most of the time these go undetected.  Cassandra’s
> > replication model makes this issue much more serious, as the faulty data
> > can infect the cluster.
> >
> > We recognised this problem, and recently introduced a fix for server-client
> > messages, implementing CRCs in CASSANDRA-13304 (Add checksumming to the
> > native protocol) [7].
> >
> > But until CASSANDRA-15066 [1] lands, this is also a critical flaw
> > internode. We have addressed it by ensuring that no matter what, whether
> > you use SSL or not, whether you use internode compression or not, a
> > protocol level CRC is always present, for every message frame. It’s our
> > deep and sincere belief that shipping a new implementation of the messaging
> > protocol without application-level data integrity checks would be
> > unacceptable for a modern database.
> >
>
> I'm all for introducing more correctness checks at all levels especially in
> communication.
> Having dealt with multiple data corruption bugs that could have been easily
> prevented by
> having a checksum, it's great to see that we are moving in this direction.
>
>
> > # Lack of back-pressure and memory usage constraints
> >
> > As it stands today, it’s far too easy for a single slow node to become
> > overwhelmed by messages from its peers.  Conversely, multiple coordinators
> > can be made unstable by the backlog of messages to deliver to just one
> > struggling node.
> >
> > To address this problem, we have introduced strict memory usage constraints
> > that apply TCP-level back-pressure, on both outbound and inbound.  It is
> > now impossible for a node to be swamped on inbound, and on outbound it is
> > made significantly harder to overcommit resources.  It’s a simple, reliable
> > mechanism that drastically improves cluster stability under load, and
> > especially overload.
> >
> > Cassandra is a mature system, and introducing an entirely new messaging
> > implementation without resolving this fundamental stability issue is
> > difficult to justify in our view.
> >
>
> I'd say that this is required to be able to ship 4.0 as a release focused
> on stability.
> I personally have been waiting for this to happen for years. Significant
> step forward in our QoS story.
>
>
> >
> > # State of the patch, feature freeze and 4.0 timeline concerns
> >
> > The patch is essentially complete, with much improved unit tests all
> > passing, dtests green, and extensive fuzz testing underway - with initial
> > results all positive.  We intend to further improve in-code documentation
> > and test coverage in the next week or two, and do some minor additional
> > code review, but we believe it will be basically good to commit in a couple
> > weeks.
> >
> > The patch could also use some performance testing. We are hoping that our
> > colleagues at Netflix and new Cassandra committers Joey and Vinay will help
> > us with this aspect.  However, this work needs to be done regardless, so
> > provides no significant delay.
> >
> > I would classify absolutely most of the work done in this patch as
> > necessary bug fixes and stability improvements - in line with the stated
> > goals of the feature freeze.
> >
> > The only clear “feature” introduced is the expanded metrics, and virtual
> > tables to observe them.  If the freeze is to be strictly interpreted these
> > can be removed, but we think this would be unwise.
> >
> > We hope that the community will appreciate and welcome these changes.
> > We’ve worked hard to deliver this promptly, to minimise delays to 4.0 were
> > these issues to be addressed piecemeal, and we sincerely believe this work
> > will have a positive impact on stability and performance of Cassandra
> > clusters for years to come.
> >
> > P.S. I believe that once it’s committed, we should cut our first 4.0 alpha.
> > It’s about time we started this train (:
> >
> > [1] https://issues.apache.org/jira/browse/CASSANDRA-15066
> > [2] https://issues.apache.org/jira/browse/CASSANDRA-14503
> > [3] https://issues.apache.org/jira/browse/CASSANDRA-13630
> > [4] https://gist.github.com/belliottsmith/0d12df678d8e9ab06776e29116d56b91
> > (incomplete list)
> > [5] https://www.evanjones.ca/tcp-checksums.html
> > [6] https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html
> > [7] https://issues.apache.org/jira/browse/CASSANDRA-13304
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Stabilising Internode Messaging in 4.0

Reply via email to