[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

Alex Petrov (Jira) Thu, 23 Jul 2020 02:17:13 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163372#comment-17163372
 ]


Alex Petrov commented on CASSANDRA-15299:
-----------------------------------------

Thanks Olivier, you're right this one is fixed. I was using other branch to 
hunt down the leaks and never switched back.

I did some basic performance tests, and it looks like new v5 outperforms always 
outperforms old v4 and v5, and performs either on par with a new v4, or better 
than it, in most of scenarios. Scenarios where v4 seems to perform better that 
I was able to produce were with a single connected client. There, with a small 
number of concurrent async operations, v4 was performing better. However, this 
might not be such an important scenario. In general, the more we can use 
batching, the better results seem to be. At the same time, checksumming doesn't 
seem to negatively contribute to V5 performance.

This is a very basic benchmark, since server and client run on the same 
machine, so there's definitely some interference. What might be good about this 
test, Cassandra read/write path is mocked out to avoid further interference.

Scenario 1: requestSize = 1197, responseSize = 234, 10 connected clients, 1 
async operation per batch:
{code:java}
V5
Variance: 135333.13542032443
Median:   155.0
90p:      159.0
95p:      160.0
99p:      160.0

V4
Variance: 107837.257938666
Median:   153.0
90p:      158.0
95p:      158.0
99p:      158.0
{code}
Scenario 2: requestSize = 1197, responseSize = 234, 10 connected clients, 5 
async operation per batch:
{code:java}
Variance: 2.3154951046883754E7
Median:   183.0
90p:      190.0
95p:      191.0
99p:      192.0

Variance: 9145882.574275833
Median:   196.0
90p:      203.0
95p:      204.0
99p:      204.0
{code}
Scenario 3: requestSize = 1197, responseSize = 234, 10 connected clients, 10 
async operation per batch:
{code:java}
V5
Variance: 6.080661284308405E7
Median:   182.0
90p:      190.0
95p:      191.0
99p:      191.0

Variance: 7.45228262686024E7
Median:   218.0
90p:      229.0
95p:      230.0
99p:      231.0
{code}
Scenario 4: requestSize = 1515743, responseSize = 415145, 10 connected clients, 
1 async operation per batch:
{code:java}
V5
Variance: 3.809988628626683E7
Median:   2442.0
90p:      2482.0
95p:      2487.0
99p:      2489.2995

V4
Variance: 8.865722070719789E7
Median:   3083.0
90p:      3129.0
95p:      3134.0
99p:      3136.0
{code}
Here, we have even a larger boost from V5, which is great, and I think this was 
exactly what was aimed for. I haven't tested multiple large ops per batch, but 
I also don't see many use-cases for such scenario.

One of the things we probably want to improve is to avoid copying from 
{{Frame}} to outgoing buffers. We can use something in the spirit of 
{{SegmentBuilder}} that driver uses, and simply write frames directly to the 
outgoing buffer. It looks like this part may simplify memory management, since 
we'll have one thing less to worry about. Similar improvement can be done on 
the incoming path, but here it might be trickier to do so since we're reusing 
the code from {{net}} package. It looks like reusing code from {{net}} has 
actually made some of the things a bit harder to follow. 

Speaking of {{Segment}} in driver, we probably want to use naming that is 
consistent with server.

One other comment I wanted to leave is on {{FlushItem}} and releasing things. 
Since FlushItem generally has two parts: incoming message (one we're responding 
to), and response one, it might make sense to release them together. I haven't 
attempted doing this in my patch, since I thought it might be better if we 
combine this change with copy-avoiding suggestion mentioned above. But in 
general, knowing that two logical parts are released together might make it all 
easier to follow.

That said, a simple measurement has shown that this copying does not dominate 
the picture. In fact, simple profiling hasn't shown any obvious hotspots. We 
can (and probably should) open a separate ticket that could aim at performance 
improvements around native protocol.

Some of the things I haven't tested which might be important to check that I 
unfortunately didn't have enough time to:
 * rapid and frequent disconnects and reconnects 
 * sending and receiving messages other than QueryMessage (such as Options, etc)
 * backpressure on overload
 * behaviour in presence of delays, disconnects, and corruption (we also have a 
netty Proxy for that)

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15299
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Messaging/Client
>            Reporter: Aleksey Yeschenko
>            Assignee: Alex Petrov
>            Priority: Normal
>              Labels: protocolv5
>             Fix For: 4.0-alpha
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

Reply via email to