Re: Stabilising Internode Messaging in 4.0

Dinesh Joshi Wed, 10 Apr 2019 13:55:33 -0700

> On Apr 10, 2019, at 9:06 AM, Sankalp Kohli <[email protected]> wrote:
> 
> I think we should wait for testing doc on confluence to come up and discuss 
> what all needs to be added there to gain confidence. 
> 
> If the work is more to split the patch into smaller parts and delays 4.0 even 
> more, can we use time in adding more test coverage, documentation and design 
> docs to this component?  Will that be a middle ground here ?


+1 for finding the right middle ground.

If splitting is an issue, we can look through opportunities to split this. Here 
are the benefits –

1. It makes verifying and testing the code a lot easier.
2. It also helps us pin-point all sorts of regressions (git bisect anybody?) 
precisely. 
3. It also gives us the opportunity to add the correct interfaces. More on this 
later.
4. It helps make the review quicker.

TBH, one of the big reasons we tend to drop such large PRs is the fact that 
Cassandra's code is highly intertwined and it makes it hard to precisely change 
things. We need to iterate towards interfaces that allow us to iterate quickly 
and reduce the amount of highly intertwined code. It helps with testing as 
well. I want us to have a meaningful discussion around it before we drop a big 
PR.

> Examples of large changes not going well is due to lack of testing, 
> documentation and design docs not because they were big. Being big adds to 
> the complexity and increased the total bug count but small changes can be 
> equally worse in terms of impact. 

+1 so many times, Sankalp and I recognize this as well. Thank you for calling 
it out.

I was working with some of the PMCs to develop a document with concrete 
guidelines to help us standardize and avoid issues like these. Due to various 
reasons, that document did not go anywhere. We had similar debate on the 
sidecar and we used CIP to help with it. It did help make the discussion more 
meaningful.

Lack of documentation as many people have called out makes it really hard for 
us to review things especially with such a large change.

Dinesh


> 
>> On Apr 10, 2019, at 8:53 AM, Jordan West <[email protected]> wrote:
>> 
>> There is a lot of discuss here so I’ll try to keep my opinions brief:
>> 
>> 1. The bug fixes are a requirement in order to have a stable 4.0. Whether
>> they come from this patch or the original I have less of an opinion. I do
>> think its important to minimize code changes at this time in the
>> development/freeze cycle — including large refactors which add risk despite
>> how they are being discussed here. From that perspective, I would prefer to
>> see more targeted fixes but since we don’t have them and we have this patch
>> the decision is different.
>> 
>> 2. We should not be measuring complexity in LoC with the exception that all
>> 20k lines *do need to be review* (not just the important parts and because
>> code refactoring tools have bugs too) and more lines take more time.
>> Otherwise, its a poor metric for how long this will take to review.
>> Further, it seems odd that the authors are projecting how long it will take
>> to review — this should be the charge of the reviewers and I would like to
>> hear from them on this. Its clear this a complex patch — as risky as
>> something like 8099 (and the original rewrite by Jason). We should treat it
>> as such and not try to merge it in quickly for the sake of it, repeating
>> past mistakes. The goal of reviewing the messaging service work was to do
>> just that. It would be a disservice to rush in these changes now. If the
>> goal is to get the patch in that should be the priority, not completing it
>> “in two weeks”. Its clear several community members have pushed back on
>> that and are not comfortable with the time frame.
>> 
>> 3. If we need to add special forks of Netty classes to the code because of
>> “how we use Netty” that is a concern to me w.r.t to quality, stability, and
>> maintenance. Is there a place that documents/justifies our non-traditional
>> usage (I saw some in JavaDocs but found it lacking in *why* but had a lot
>> of how/what was changed). Given folks in the community have decent
>> relationships with the Netty team perhaps we should leverage that as well.
>> Have we reached out to them?
>> 
>> 4. In principle, I agree with the technical improvements you mention
>> (backpressure / checksumming / etc). These things should be there. Are they
>> a hard requirement for 4.0? In my opinion no and Dinesh has done a good job
>> of providing some reasons as to why so I won’t go into much detail here. In
>> short, a bug and a missing safety mechanism are not the same thing. Its
>> also important to consider how many users a change like that covers and for
>> what risk — we found a bug in 13304 late in review, had it slipped through
>> it would have subjected users to silent corruption they thought couldn’t
>> occur anymore because we included the feature in a prod Cassandra release.
>> 
>> 5. The patch could seriously benefit from some commit hygiene that would
>> make it easier for folks to review. Had this been done not only would
>> review be easier but also the piecemeal breakup of features/fixes could
>> have been done more easily. I don’t buy the premise that this wasn’t
>> possible. If we had to add the feature/fix later it would have been
>> possible. I’m sure there was a smart way we could have organized it, if it
>> was a priority.
>> 
>> 6. Have any upgrade tests been done/added? I would also like to see some
>> performance benchmarks before merging so that the patch is in a similar
>> state as 14503 in terms of performance testing.
>> 
>> I’m sure I have more thoughts but these seem like the important ones for
>> now.
>> 
>> Jordan
>> 
>> On Wed, Apr 10, 2019 at 8:21 AM Dinesh Joshi <[email protected]>
>> wrote:
>> 
>>> Here's are my 2¢.
>>> 
>>> I think the general direction of this work is valuable but I have a few
>>> concerns I’d like to address. More inline.
>>> 
>>>> On Apr 4, 2019, at 1:13 PM, Aleksey Yeschenko <[email protected]>
>>> wrote:
>>>> 
>>>> I would like to propose CASSANDRA-15066 [1] - an important set of bug
>>> fixes
>>>> and stability improvements to internode messaging code that Benedict, I,
>>>> and others have been working on for the past couple of months.
>>>> 
>>>> First, some context.   This work started off as a review of
>>> CASSANDRA-14503
>>>> (Internode connection management is race-prone [2]), CASSANDRA-13630
>>>> (Support large internode messages with netty) [3], and a pre-4.0
>>>> confirmatory review of such a major new feature.
>>>> 
>>>> However, as we dug in, we realized this was insufficient. With more than
>>> 50
>>>> bugs uncovered [4] - dozens of them critical to correctness and/or
>>>> stability of the system - a substantial rework was necessary to
>>> guarantee a
>>>> solid internode messaging subsystem for the 4.0 release.
>>>> 
>>>> In addition to addressing all of the uncovered bugs [4] that were unique
>>> to
>>>> trunk + 13630 [3] + 14503 [2], we used this opportunity to correct some
>>>> long-existing, pre-4.0 bugs and stability issues. For the complete list
>>> of
>>>> notable bug fixes, read the comments to CASSANDRA-15066 [1]. But I’d like
>>>> to highlight a few.
>>> 
>>> Do you have regression tests that will fail if these bugs are reintroduced
>>> at a later point?
>>> 
>>>> # Lack of message integrity checks
>>>> 
>>>> It’s known that TCP checksums are too weak [5] and Ethernet CRC cannot be
>>>> relied upon [6] for integrity. With sufficient scale or time, you will
>>> hit
>>>> bit flips. Sadly, most of the time these go undetected.  Cassandra’s
>>>> replication model makes this issue much more serious, as the faulty data
>>>> can infect the cluster.
>>>> 
>>>> We recognised this problem, and recently introduced a fix for
>>> server-client
>>>> messages, implementing CRCs in CASSANDRA-13304 (Add checksumming to the
>>>> native protocol) [7].
>>> 
>>> This was discussed in my review and Jason created a ticket[1] to track it.
>>> We explicitly decided to defer this work not only due to the feature freeze
>>> in the community but also for technical reasons detailed below.
>>> 
>>> Regarding new features during the feature freeze window, we have had such
>>> discussions in the past. The most recent being the one I initiated on Zstd
>>> Compressor which went positively and we have moved forward after assessing
>>> risk & community consensus.
>>> 
>>> Regarding checksumming, please scroll down to the comments section in the
>>> link[2] you provided. You'll notice this discussion –
>>> 
>>>>> Daniel Fox Franke:
>>>>> Please don't design new network protocols that don't either run over
>>> TLS or do some other kind of cryptographic authentication. If you have
>>> cryptographic authentication, then CRC is redundant.
>>>> 
>>>> Evan Jones:
>>>> Good point. Even internal applications inside a data center should be
>>> using encryption, and today the performance impact is probably small (I
>>> haven't actually measured it myself these days).
>>> 
>>> Enabling TLS & internode compression are mitigation strategies to avoid
>>> data corruption in transit. By your own admission in CASSANDRA-13304[3], we
>>> don't require checksumming if TLS is present. Here's your full quote –
>>> 
>>>> Aleksey Yeschenko:
>>>> 
>>> 
>>>> Checksums and TLS are orthogonal. It just so happens that you don't need
>>> the former if you already have the latter.
>>> 
>>> I want to be fair, later you did say that we don't want to force people to
>>> pay the cost of TLS overhead. However, I would also like to point out that
>>> with introduction of Netty for internode communication, we have 4-5x the
>>> TLS performance thanks to OpenSSL JNI bindings. You can refer to Norman or
>>> my talks on the topic. So TLS is practical & compression is necessary for
>>> performance. Both strategies work fine to protect against data corruption
>>> making checksumming redundant. With SSL certificate hot reloading, it also
>>> avoids unnecessary cluster restarts providing maximum availability.
>>> 
>>> In the same vein, it's 2019 and if people are not using TLS for internode,
>>> then it is really really bad for data security in our industry and we
>>> should not be encouraging it. In fact, I would go so far as to make TLS as
>>> the default.
>>> 
>>> Managing TLS infrastructure is beyond Cassandra's scope and operators
>>> should figure it out by now for their & their user's sake. Cassandra makes
>>> it super easy & performant to have TLS enabled. People should be using it.
>>> 
>>>> But until CASSANDRA-15066 [1] lands, this is also a critical flaw
>>>> internode. We have addressed it by ensuring that no matter what, whether
>>>> you use SSL or not, whether you use internode compression or not, a
>>>> protocol level CRC is always present, for every message frame. It’s our
>>>> deep and sincere belief that shipping a new implementation of the
>>> messaging
>>>> protocol without application-level data integrity checks would be
>>>> unacceptable for a modern database.
>>> 
>>> My previous technical arguments have provided enough evidence that
>>> protocol level checksumming is not a show stopper.
>>> 
>>> The only reason I see for adding checksums in the protocol is when some
>>> user doesn't want to enable TLS and internode compression. As it stands,
>>> from your comments[7] it seems to be mandatory and adds unnecessary
>>> overhead when TLS and/or Compression is enabled. Frankly I don't think we
>>> need to risk destabilizing trunk for these use-cases. I want to reiterate
>>> that I believe in doing the right thing but we have to make acceptable
>>> tradeoffs – as a community.
>>> 
>>>> # Lack of back-pressure and memory usage constraints
>>>> 
>>>> As it stands today, it’s far too easy for a single slow node to become
>>>> overwhelmed by messages from its peers.  Conversely, multiple
>>> coordinators
>>>> can be made unstable by the backlog of messages to deliver to just one
>>>> struggling node.
>>> 
>>> This is a known issue and it could have been addressed a separate bug fix
>>> – one that could be independently verified.
>>> 
>>>> To address this problem, we have introduced strict memory usage
>>> constraints
>>>> that apply TCP-level back-pressure, on both outbound and inbound.  It is
>>>> now impossible for a node to be swamped on inbound, and on outbound it is
>>>> made significantly harder to overcommit resources.  It’s a simple,
>>> reliable
>>>> mechanism that drastically improves cluster stability under load, and
>>>> especially overload.
>>>> 
>>>> Cassandra is a mature system, and introducing an entirely new messaging
>>>> implementation without resolving this fundamental stability issue is
>>>> difficult to justify in our view.
>>> 
>>> This is great in theory. I would really like to see objective measurements
>>> like Chris did in CASSANDRA-14654[4]. Netflix engineers tested the Netty
>>> refactor with a 200 node cluster[5], Zero Copy Streaming[6] and reported
>>> their results. It's invaluable work. It would be great to see something
>>> similar.
>>> 
>>>> # State of the patch, feature freeze and 4.0 timeline concerns
>>>> 
>>>> The patch is essentially complete, with much improved unit tests all
>>>> passing, dtests green, and extensive fuzz testing underway - with initial
>>>> results all positive.  We intend to further improve in-code documentation
>>>> and test coverage in the next week or two, and do some minor additional
>>>> code review, but we believe it will be basically good to commit in a
>>> couple
>>>> weeks.
>>> 
>>> A 20K LoC patch is unverifiable especially without much documentation. It
>>> places undue burden on reviewers. It also messes up everyone's branches
>>> once you commit such a large refactor. Let's be considerate to others in
>>> the community. It is a good engineering practice to limit patches to a size
>>> that is reasonable.
>>> 
>>> More importantly such large refactors lend themselves to new bugs that
>>> cannot be caught easily unless you have a very strong regression test suite
>>> which Cassandra arguably lacks. Therefore I am of the opinion that the bug
>>> fixes can be applied piecemeal into the codebase. They should be small
>>> enough that can be individually reviewed and independently verified.
>>> 
>>> I also noticed that you have replaced Netty's classes[7]. I am of the
>>> opinion that they should be upstreamed if they're better so the wider Netty
>>> community benefits from it and we don't have to maintain our own classes.
>>> 
>>>> P.S. I believe that once it’s committed, we should cut our first 4.0
>>> alpha.
>>>> It’s about time we started this train (:
>>> 
>>> +1 on working towards an alpha but that is a separate discussion from this
>>> issue.
>>> 
>>> Thanks,
>>> 
>>> Dinesh
>>> 
>>> [1] https://issues.apache.org/jira/browse/CASSANDRA-14578
>>> [2] https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html
>>> [3]
>>> https://issues.apache.org/jira/browse/CASSANDRA-13304?focusedCommentId=16183034&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16183034
>>> [4] https://issues.apache.org/jira/browse/CASSANDRA-14654
>>> [5] https://issues.apache.org/jira/browse/CASSANDRA-14747
>>> [6] https://issues.apache.org/jira/browse/CASSANDRA-14765
>>> [7]
>>> https://issues.apache.org/jira/browse/CASSANDRA-15066?focusedCommentId=16801277&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16801277
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Stabilising Internode Messaging in 4.0

Reply via email to