[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099833#comment-17099833
 ] 

Sam Tunnicliffe commented on CASSANDRA-15299:
---------------------------------------------

Hi [~omichallat] , thanks for the feedback this is really useful.
{quote}You're hinting at renaming at a later point. I think that will be most 
welcome, the names as they are now are pretty confusing. And IMHO you should 
rename the outer container: even though "frame" is better suited for it, it's 
more important to preserve familiarity for people coming from the legacy code.
{quote}
I was absolutely planning on renaming, I just imagined that there might be 
several opinions on this, so was holding off to give the discussion a chance. 
 My personal preference would be to rename the inner container; I totally 
appreciate your point about maintaining continuity for people coming from the 
legacy code, but there a few points that sway me in the opposite direction:

1) Internal consistency and compatibility. The proposal here is to reuse both 
the concepts and, where possible, the implementation from the internode su 
bsystem. 
 2) At some point legacy is legacy and we ought to pick the right names for 
things and concepts rather than be forever bound by previous decisions. This 
can definitely be painful, but is probably necessary at times. c.f. the use of 
{{Row}} in 2.2 and earlier and how that's changed wrt to {{Partition}} and CQL 
rows with 3.0.
 3) I wonder if it's actually more important to aim for clarity in naming in 
order to better accomodate people without pre-existing familiarity with the 
codebase. Those of us who have been working on the code for some time already 
have a mental model and for sure, re-mapping the terms therein is annoying, 
somewhat detrimental to productivity and in some cases, potentially dangerous 
(though I don't think that's such a concern here). For those coming to the code 
for the first time though, abiguous and/or outdated naming surely has an even 
greater impact in terms of aggravating the learning curve.
{quote}From a client point of view, a dropped frame will result in request 
timeouts. We have no way of providing a better error, since the stream ids of 
the failed requests are in the corrupt payload. I'm wondering if it might not 
be better to drop the connection all the time: at least the client gets 
immediate feedback (we could try to propagate a cause), instead of a bunch of 
requests timing out for no apparent reason.
{quote}
Yes, I agree. This is the major difference between the internode and 
client/server messaging schemes. Failing hard and fast whenever a corrupt frame 
is encountered seems the right thing to do at the moment. I suppose it might be 
possible to track the stream ids for a given message on the client side and 
have the server return an error, rather than closing the connection. But I'm 
not sure just how feasible that is on the client side.
{quote}When two large messages are sent in a row, how will you tell the last 
outer frame of the first message from the first outer frame of the second? 
Don't you need some sort of is_last_frame flag in the header?
{quote}
The first of outer frame of a large message contains the inner frame header, 
which specifies the length in bytes of the CQL message body. The processor 
accumulates outer frames until that message has been completely received. The 
single CQL message per large message constraint is required for this to work.
{quote}When compression is enabled, how are you going to test when the payload 
reaches the max size?...all the messages in the payload are compressed 
together, so you can't check the size as you go like you do for the 
uncompressed version
{quote}
That's correct, we keep accumulating messages into the payload until the max 
*uncompressed* size is reached, or we run out of messages. So there is some 
loss of density when using compression, but this is an implementation detail 
and could be improved later without requiring further changes to the protocol.

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15299
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Messaging/Client
>            Reporter: Aleksey Yeschenko
>            Assignee: Sam Tunnicliffe
>            Priority: Normal
>              Labels: protocolv5
>             Fix For: 4.0-beta
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to