[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-11-13 Thread Adam Holmberg (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231545#comment-17231545
 ] 

Adam Holmberg edited comment on CASSANDRA-15299 at 11/13/20, 2:57 PM:
--

{quote}This is the thing I'm unsure about - the python driver is not an asf 
project (yet), so it's not really up to me/us whether we update that branch ... 
TBH, I don't recall the full reasoning for using the cassandra-test branch in 
the first place ...{quote}

The {{cassandra-test}} branch was created explicitly to allow server-side 
testing of merged client-impacting features, independent of driver releases. It 
was born of a time when there were multiple client-impacting changes in flight 
so a single commit would not suffice. Normally we would coordinate a merge into 
that driver branch with a PR for the server, so CI could be tested with it.

Updating the branch should not be an issue. I, or [~aboudreault] can facilitate.


was (Author: aholmber):
{quote}This is the thing I'm unsure about - the python driver is not an asf 
project (yet), so it's not really up to me/us whether we update that branch ... 
TBH, I don't recall the full reasoning for using the cassandra-test branch in 
the first place ...{quote}

The {{cassandra-test}} branch was created explicitly to allow server-side 
testing of merged client-impacting features, independent of driver releases. It 
was born of a time when there were multiple client-impacting changes in-flight 
so a single commit would not suffice. Normally we would coordinate a merge into 
that driver branch with a PR for the server, so CI could be tested with it.

Updating the branch should not be an issue. I, or [~aboudreault] can facilitate.

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
> Attachments: Process CQL Frame.png, V5 Flow Chart.png
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-11-13 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231462#comment-17231462
 ] 

Michael Semb Wever edited comment on CASSANDRA-15299 at 11/13/20, 1:48 PM:
---

bq. One thing that's a bit concerning is that the cassandra-test branch of the 
driver, which is what dtests are currently using, is currently 693 commits 
behind the master branch.

If we're updating to use a new updated version of the driver, does that mean 
the {{cassandra-test}} branch is being sync'd up to master in the progress? 

{quote}Docker images:
- beobal/cassandra-testing-ubuntu1910-java11:2020
- beobal/cassandra-testing-ubuntu1910-java11-w-dependencies:2020{quote}

Is it time to start deploying these images under 
[{{apache/}}|https://hub.docker.com/u/apache] ?
If agreed, I can open an infra ticket to set up deployment of docker images.

bq. I'll open PRs to cassandra-builds and cassandra-dtest before going any 
further here.

Go for it! :-)



was (Author: michaelsembwever):
bq. One thing that's a bit concerning is that the cassandra-test branch of the 
driver, which is what dtests are currently using, is currently 693 commits 
behind the master branch.

If we're updating to use a new updated version of the driver, does that mean 
the {{cassandra-test}} branch being sync'd up to master in the progress? 

{quote}Docker images:
- beobal/cassandra-testing-ubuntu1910-java11:2020
- beobal/cassandra-testing-ubuntu1910-java11-w-dependencies:2020{quote}

Is it time to start deploying these images under 
[{{apache/}}|https://hub.docker.com/u/apache] ?
If agreed, I can open an infra ticket to set up deployment of docker images.

bq. I'll open PRs to cassandra-builds and cassandra-dtest before going any 
further here.

Go for it! :-)


> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
> Attachments: Process CQL Frame.png, V5 Flow Chart.png
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> 

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-11-04 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226048#comment-17226048
 ] 

Alex Petrov edited comment on CASSANDRA-15299 at 11/4/20, 1:36 PM:
---

Patch looks good to me, I have only several comments: 

* in {{CQLMessageHeader#processLargeMessage}}, we seem to be creating a 
{{LargeMessage}} object to call {{processCqlFrame()}} over a frame that is 
assembled from a single byte buffer. Maybe we can just call {{processCqlFrame}} 
directly?
* this is probably something that we should address outside this ticket, but 
still: "corrupt frame recovered" seems to be slightly misleading wording, same 
as {{Frame#recoverable}}. We can not recover the frame itself, we just can 
skip/drop it. Maybe we can rename this, along with metrics in Messaging, before 
they become public in 4.0.
* in {{CQLMessageHeader#processCqlFrame}}, we only call 
{{handleErrorAndRelease}}. However, it may theoretically happen that we fail 
before we finish {{messageDecoder.decode(channel, frame)}}. Maybe we can do 
something like the [1], to make it consistent with what we do in 
{{ProtocolDecoder#decode}}? 
* in {{CQLMessageHeader$LargeMessage#onComplete}}, we wrap 
{{processCqlFrame(assembleFrame()}} call in try/catch which is identical to 
try/catch block from {{processCqlFrame}} itself. Just checking if this is 
intended.
* in {{Dispatcher#processRequest}}, we can slightly simplify code by declaring 
{{FlushItem flushItem}} variable outside try/catch block and assigning a 
corresponding value in try or catch, and only calling {{flush}} once.
* during sever bootstrap initialization, we're using deprecated low/high 
watermark child options, probably we should use 
{{.childOption(ChannelOption.WRITE_BUFFER_WATER_MARK, new 
WriteBufferWaterMark(8 * 1024, 32 * 1024))}} instead.
* {{SimpleClientBurnTest#random}} is unused

If this is helpful, I've also put nits mentioned above (and a couple cleanups) 
in a commit 
[here|https://github.com/apache/cassandra/commit/76ee6e00f42446d94679e9c9001c81ebfa9418ab].
* this seems to be test-only, but still might be good to fix, there seems to be 
a leak in simple client (see [2])

Not sure if this is helpful, but I've also put together a v5 flow chart, which 
might be helpful if anyone wants a quick overview of what's going on in v5.

[1]
{code}
protected void processCqlFrame(Frame frame)
{
M message = null;
try
{
message = messageDecoder.decode(channel, frame);
dispatcher.accept(channel, message, this::toFlushItem);
}
catch (Exception e)
{
if (message == null)
frame.release();
handleErrorAndRelease(e, frame.header);
}
}
{code}

[2]
{code}
ERROR [nioEventLoopGroup-2-2] 2020-11-02 14:52:12,101 
ResourceLeakDetector.java:320 - LEAK: ByteBuf.release() was not called before 
it's garbage-collected. See 
https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:

io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:363)

io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)

io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178)

io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:115)
org.apache.cassandra.transport.Message.encode(Message.java:360)

org.apache.cassandra.transport.SimpleClient$InitialHandler$3.write(SimpleClient.java:510)

io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)

io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)

io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)

io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)

io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)

io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)

io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)

io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
java.lang.Thread.run(Thread.java:748)
{code}


was (Author: ifesdjeen):
Patch looks good to me, I have only several comments: 

* in {{CQLMessageHeader#processLargeMessage}}, we seem to be creating a 
{{LargeMessage}} object to call {{processCqlFrame()}} over a frame that is 
assembled from a single byte buffer. Maybe we can just call {{processCqlFrame}} 
directly?
* this is probably 

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-10-29 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220921#comment-17220921
 ] 

Sam Tunnicliffe edited comment on CASSANDRA-15299 at 10/29/20, 11:11 AM:
-

Here are some numbers from running tlp-stress against 4.0-beta2 vs the 15299 
branch {{70923bae4}} with protocol v4 & v5.
 The dataset was sized to be mostly in-memory and reading/writing at 
{{CL.ONE}}. Each workload ran for 10 minutes, with a 1m warmup, on a 3 node 
cluster.

Both latency and throughput are pretty much on a par, with v5 showing a bit of 
an improvement where the workload is amenable to compression. 
 Memory usage and GC were pretty much the same too. If anything, during the v5 
runs the servers spent less time in GC, but this was so close as not be 
significant.

There are definitely some places we can improve the performance of v5, and I 
don't think anything here indicates a serious regression in either v4 or v5 
performance. Given the additional integrity checks in v5, I think not 
regressing meets the perf bar here, at least in the first instance.
h4. 100% read workload, reads a single row at a time.
{code:java}
Reads   
 
Count  Latency (p99)  1min (req/s) 
4.0-beta2 V4135449878  72.43 224222.86
15299 V4138424112   68.3 229765.45 
15299 V5137618437  70.35 231602.36 
4.0-beta2 V4 LZ4103348953  66.68 173437.38
15299 V4 LZ4105114560  68.83 176192.36
15299 V5 LZ4131833462  70.19 222092.99

{code}
h4. Mixed r/w workload (50/50). K/V with blobs upto 65k
{code:java}
WritesReads
   Count  Latency (p99)  1min (req/s) |  Count  
Latency (p99)  1min (req/s) 
4.0-beta2 V434455009  76.59  56557.78 |   34464217  
74.85  56546.40
15299 V433368171  74.90  54859.94 |   33361940  
67.77  54848.57 
15299 V532991815  76.00  54780.74 |   33001153  
76.72  54856.99
4.0-beta2 V4 LZ432152220  83.41  53306.03 |   32147206  
83.22  53334.00
15299 V4 LZ431158895  71.01  51106.60 |   31153453  
72.75  51087.01
15299 V5 LZ432634296  75.73  54370.71 |   32644765  
76.70  54396.15

{code}
h4. Mixed r/w/d workload (60/30/10). Wide rows with values up to 200b and 
slicing on both the selects and deletions.
{code:java}
WritesReads 
   Deletes  
   Count  Latency (p99)  1min (req/s) |   Count  
Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s)
4.0-beta2 V418725688 197.47  25377.16 | 9357971 
193.86  12687.94 | 3117394 178.44   4221.90
15299 V417975636 185.88  24160.78 | 8986125 
184.97  12087.13 | 2995562 179.99   4023.88
15299 V518429252 192.91  25349.35 | 9223277 
188.15  12678.46 | 3073312 184.24   4232.23
4.0-beta2 V4 LZ418407719 179.94  25160.16 | 9197664 
178.25  12575.03 | 3068134 180.90   4195.38
15299 V4 LZ417678994 171.39  24073.09 | 8842952 
196.06  12064.18 | 2947344 170.53   4026.37 
15299 V5 LZ418274085 208.57  25127.02 | 9138491 
163.23  12558.28 | 3045264 203.54   4188.88 

{code}


was (Author: beobal):
Here are some numbers from running tlp-stress against 4.0-beta2 vs the 15299 
branch {{70923bae4}} with protocol v4 & v5.
 The dataset was sized to be mostly in-memory and reading/writing at 
{{CL.ONE}}. Each workload ran for 10 minutes, with a 1m warmup, on a 3 node 
cluster.

Both latency and throughput are pretty much on a par, with v5 showing a bit of 
an improvement where the workload is amenable to compression. 
 Memory usage and GC were pretty much the same too. If anything, during the v5 
runs the servers spent less time in GC, but this was so close as not be 
significant.

There are definitely some places we can improve the performance of v5, and I 
don't think anything here indicates a serious regression in either v4 or v5 
performance. Given the additional integrity checks in v5, I think not 
regressing meets the perf bar here, at least in the first instance.
h4. 100% read workload, reads a single row at a time.
{code:java}
Count  Latency (p99)  1min (req/s) 
4.0-beta2 V4

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-10-28 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17222476#comment-17222476
 ] 

Caleb Rackliffe edited comment on CASSANDRA-15299 at 10/28/20, 10:01 PM:
-

Reading through the overallocation commit, it seems we might be able to factor 
a little helper out of {{Decoder#extractHeader()}} and {{decodeFrame()}} that 
decodes and validates the header flags.

ex.

{noformat}
private EnumSet decodeFlags(ProtocolVersion version, int flags)
{
EnumSet decodedFlags = Header.Flag.deserialize(flags);

if (version.isBeta() && !decodedFlags.contains(Header.Flag.USE_BETA))
throw new ProtocolException(String.format("Beta version of the protocol 
used (%s), but USE_BETA flag is unset", version), version);

return decodedFlags;
}
{noformat}

I've been able to get through the most recent commits, and things look pretty 
good. Just want to go over {{ClientResourceLimitsTest}} one more time...


was (Author: maedhroz):
Reading through the overallocation commit, it seems we might be able to factor 
a little helper out of {{Decoder#extractHeader()}} and {{decodeFrame()}} that 
decodes and validates the header flags.

ex.

{noformat}
private EnumSet decodeFlags(ProtocolVersion version, int flags)
{
EnumSet decodedFlags = Header.Flag.deserialize(flags);

if (version.isBeta() && !decodedFlags.contains(Header.Flag.USE_BETA))
throw new ProtocolException(String.format("Beta version of the protocol 
used (%s), but USE_BETA flag is unset", version), version);

return decodedFlags;
}
{noformat}

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-10-27 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17220921#comment-17220921
 ] 

Sam Tunnicliffe edited comment on CASSANDRA-15299 at 10/27/20, 9:31 AM:


Here are some numbers from running tlp-stress against 4.0-beta2 vs the 15299 
branch {{70923bae4}} with protocol v4 & v5.
 The dataset was sized to be mostly in-memory and reading/writing at 
{{CL.ONE}}. Each workload ran for 10 minutes, with a 1m warmup, on a 3 node 
cluster.

Both latency and throughput are pretty much on a par, with v5 showing a bit of 
an improvement where the workload is amenable to compression. 
 Memory usage and GC were pretty much the same too. If anything, during the v5 
runs the servers spent less time in GC, but this was so close as not be 
significant.

There are definitely some places we can improve the performance of v5, and I 
don't think anything here indicates a serious regression in either v4 or v5 
performance. Given the additional integrity checks in v5, I think not 
regressing meets the perf bar here, at least in the first instance.
h4. 100% read workload, reads a single row at a time.
{code:java}
Count  Latency (p99)  1min (req/s) 
4.0-beta2 V4135449878  72.43 224222.86
15299 V4138424112   68.3 229765.45 
15299 V5137618437  70.35 231602.36 
4.0-beta2 V4 LZ4103348953  66.68 173437.38
15299 V4 LZ4105114560  68.83 176192.36
15299 V5 LZ4131833462  70.19 222092.99

{code}
h4. Mixed r/w workload (50/50). K/V with blobs upto 65k
{code:java}
   Count  Latency (p99)  1min (req/s) |  Count  
Latency (p99)  1min (req/s) 
4.0-beta2 V434455009  76.59  56557.78 |   34464217  
74.85  56546.40
15299 V433368171  74.90  54859.94 |   33361940  
67.77  54848.57 
15299 V532991815  76.00  54780.74 |   33001153  
76.72  54856.99
4.0-beta2 V4 LZ432152220  83.41  53306.03 |   32147206  
83.22  53334.00
15299 V4 LZ431158895  71.01  51106.60 |   31153453  
72.75  51087.01
15299 V5 LZ432634296  75.73  54370.71 |   32644765  
76.70  54396.15

{code}
h4. Mixed r/w/d workload (60/30/10). Wide rows with values up to 200b and 
slicing on both the selects and deletions.
{code:java}
   Count  Latency (p99)  1min (req/s) |   Count  
Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s)
4.0-beta2 V418725688 197.47  25377.16 | 9357971 
193.86  12687.94 | 3117394 178.44   4221.90
15299 V417975636 185.88  24160.78 | 8986125 
184.97  12087.13 | 2995562 179.99   4023.88
15299 V518429252 192.91  25349.35 | 9223277 
188.15  12678.46 | 3073312 184.24   4232.23
4.0-beta2 V4 LZ418407719 179.94  25160.16 | 9197664 
178.25  12575.03 | 3068134 180.90   4195.38
15299 V4 LZ417678994 171.39  24073.09 | 8842952 
196.06  12064.18 | 2947344 170.53   4026.37 
15299 V5 LZ418274085 208.57  25127.02 | 9138491 
163.23  12558.28 | 3045264 203.54   4188.88 

{code}


was (Author: beobal):
Here are some numbers from running tlp-stress against 4.0-beta2 vs the 15299 
branch {{70923bae4}} with protocol v4 & v5.
The dataset was sized to be mostly in-memory and reading/writing at {{CL.ONE}}. 
Each workload ran for 10 minutes, with a 1m warmup, on a 3 node cluster. 

Both latency and throughput are pretty much on a par, with v5 showing a bit of 
an improvement where the workload is amenable to compression. 
Memory usage and GC were pretty much the same too. If anything, during the v5 
runs the servers spent less time in GC, but this was so close as not be 
significant.

There are definitely some places we can improve the performance of v5, and I 
don't think anything here indicates a serious regression in either v4 or v5 
performance. Given the additional integrity checks in v5, I think not 
regressing meets the perf bar here, at least in the first instance.


h4. 100% read workload, reads a single row at a time. 

{code}
Count  Latency (p99)  1min (req/s) 
4.0-beta2 V4135449878  72.43 224222.86
15299 V4138424112   68.3 229765.45 
15299 V5137618437  70.35 231602.36 
4.0-beta2 V4 LZ4103348953  66.68 173437.38
15299 V4 LZ4105114560  68.83 

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-10-09 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211482#comment-17211482
 ] 

Caleb Rackliffe edited comment on CASSANDRA-15299 at 10/10/20, 12:28 AM:
-

[~samt] Here are the detailed/tactical points from my first pass at review. 
Some of these items, especially the testing bits, are things I could 
potentially branch and work on a bit myself, so let me know if you'd like a 
hand. I still want to make one more top-down pass at things Monday now that the 
smaller details are clearer.

*Correctness*

- {{AbstractMessageHandler}}
-- Are we incrementing {{receivedCount}} both on the first frame of a large 
message (in {{processFirstFrameOfLargeMessage()}}) and on the last frame (in 
{{processSubsequentFrameOfLargeMessage()}})?
-- {{forceOverAllocation()}} only allocates enough to hit the limit, but is it 
possible for the subsequent release to de-allocate the full frame size?
- {{Flusher}} uses a {{PayloadAllocator}} from either the CRC or LZ4 encoder. 
CRC includes the header and trailer size when it hits the {{BufferPool}}, while 
LZ4 uses the {{Payload}} constructor that doesn't take those into account. Is 
this correct?
- {{PostV5ExceptionHandler}} - in {{exceptionCaught()}}, does 
{{payload.release()}} need to be in a {{finally}} block?
- {{CQLMessageHandler}} - should {{processOneContainedMessage()}} release the 
frame in a {{finally}} block?
- It isn't really a correctness issue per-se, but {{FrameSet}} would feel safer 
if we used composition rather than inheritance. (Was the motivator not creating 
the additional object?)

*Testing*

- {{CQLTester}} - It might be a little awkward to follow the semantics of 
"require" -> "prepare" -> "initialize". Maybe we should just inline 
{{prepareNetwork()}}.
- The large frame accumulation state machine might be a good unit test target. 
Lowest level might be a test-only subclass of 
{{AbstractMessageHandler.LargeMessage}} that simulates the cases we're 
interested in. Another would be to go one level up and test the 
{{CQLMessageHandler.LargeMessage}}, but that pulls all of {{CQLMessageHandler}} 
in, as it's not a static class. (We could make it static, create a little 
interface to abstract {{CQLMessageHandler}} away, etc.)
- It might be worth pulling up {{WaitQueue}} (which is already static) as a 
top-level class, then throw a few unit-level tests around it it.
- {{StartupMessage}} - There's enough logic that it might be nice to unit test 
execute(). (Mocking {{Connection}} and checking replays, etc.)

*Documentation*

- Should describe {{native_transport_receive_queue_capacity_in_bytes}} in 
{{cassandra.yaml}} and mention it in the Jira docs section?
- Is the plan to just merge the {{ql_protocol_V5_framing.asc}} content into 
{{native_protocol_v5.spec}} when things solidify?
- {{cql_protocol_V5_framing.asc}} - worth mentioning somewhere there (or if 
this ends up in the official V5 spec) that non-self-contained frames still only 
hold part of one message and can't contain the end of one large message and the 
start of another one
- {{OutboundMessageQueue}} - The race from CASSANDRA-15958 is fixed? (Noticed 
the comment was removed...)
- {{InboundMessageHandler}} - The class-level JavaDoc is now almost identical 
to {{AbstractMessageHandler}}, so we might want to narrow down to the places 
where the former specializes. (ex. Only it uses, {{ProcessMessage}}.) The same 
thing goes for the {{LargeMessage}} abstract class and its two sub-classes.
- {{CQLMessageHandler}} - could use brief class-level JavaDoc on how it extends 
{{AbstractMessageHandler}}
- {{transport.Frame}} - brief class-level JavaDoc might help clarify its 
scope/usage (even if it's renamed)
- Would class-level JavaDoc for {{transport.Dispatcher}} and 
{{transport.Flusher}} be helpful. (ex. How do they interact with each other?)
- {{Payload#finish()}} - It might be useful to have a bit of detail in the 
method JavaDoc explaining when we're expected to call it in the {{Payload}} 
lifecycle.
- Do we need a deprecation plan for {{native_transport_frame_block_size_in_kb}} 
...or not so much, because it was only added for 4.0 anyway?

*Zombies*

- {{transport.Frame.Header.BODY_LENGTH_SIZE}} is unused
- {{PasswordAuthenticatorTest}} - {{import java.net.InetSocketAddress}} is 
unused
- {{OutboundConnection}} - {{import java.util.stream.Stream}} is unused
- {{StressSettings}} - {{import com.datastax.driver.core.Metadata}} and 
{{import com.google.common.collect.ImmutableMap}} are unused
- {{FrameCompressor.LZ4Compressor}} - {{compress()}} doesn't need the 
{{throws}} clause, and {{outputOffset + 0}} could be simplified
- {{ProtocolNegotiationTest}} - can remove the dead {{throws}} clauses from a 
few tests
- {{ProtocolErrorTest}} - {{testErrorMessageWithNullString()}} doesn't need 
throws clause
- {{CQLMessageHandler}} - {{UNKNOWN_STREAM_ID}} and 

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-10-09 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211482#comment-17211482
 ] 

Caleb Rackliffe edited comment on CASSANDRA-15299 at 10/10/20, 12:27 AM:
-

[~samt] Here are the detailed/tactical points from my first pass at review. 
Some of these items, especially the testing bits, are things I could 
potentially branch and work on a bit myself, so let me know if you'd like a 
hand. I still want to make one more top-down pass at things Monday now that the 
smaller details are clearer.

*Correctness*

- {{AbstractMessageHandler}}
-- Are we incrementing {{receivedCount}} both on the first frame of a large 
message (in {{processFirstFrameOfLargeMessage()}}) and on the last frame (in 
{{processSubsequentFrameOfLargeMessage()}})?
-- {{forceOverAllocation()}} only allocates enough to hit the limit, but is it 
possible for the subsequent release to de-allocate the full frame size?
- {{Flusher}} uses a {{PayloadAllocator}} from either the CRC or LZ4 encoder. 
CRC includes the header and trailer size when it hits the {{BufferPool}}, while 
LZ4 uses the {{Payload}} constructor that doesn't take those into account. Is 
this correct?
- {{PostV5ExceptionHandler}} - in {{exceptionCaught()}}, does 
{{payload.release()}} need to be in a {{finally}} block?
- {{CQLMessageHandler}} - should {{processOneContainedMessage()}} release the 
frame in a {{finally}} block?
- It isn't really a correctness issue per-se, but {{FrameSet}} would feel safer 
if we used composition rather than inheritance. (Was the motivator not creating 
the additional object?)

*Testing*

- {{CQLTester}} - It might be a little awkward to follow the semantics of 
"require" -> "prepare" -> "initialize". Maybe we should just inline 
{{prepareNetwork()}}.
- The large frame accumulation state machine might be a good unit test target. 
Lowest level might be a test-only subclass of 
{{AbstractMessageHandler.LargeMessage}} that simulates the cases we're 
interested in. Another would be to go one level up and test the 
{{CQLMessageHandler.LargeMessage}}, but that pulls all of {{CQLMessageHandler}} 
in, as it's not a static class. (We could make it static, create a little 
interface to abstract {{CQLMessageHandler}} away, etc.)
- It might be worth pulling up {{WaitQueue}} (which is already static) as a 
top-level class, then throw a few unit-level tests around it it.
- {{StartupMessage}} - There's enough logic that it might be nice to unit test 
execute(). (Mocking {{Connection}} and checking replays, etc.)

*Documentation*

- Should describe {{native_transport_receive_queue_capacity_in_bytes}} in 
{{cassandra.yaml}} and mention it in the Jira docs section?
- Is the plan to just merge the {{ql_protocol_V5_framing.asc}} content into 
{{native_protocol_v5.spec}} when things solidify?
- {{cql_protocol_V5_framing.asc}} - worth mentioning somewhere there (or if 
this ends up in the official V5 spec) that non-self-contained frames still only 
hold part of one message and can't contain the end of one large message and the 
start of another one
- {{OutboundMessageQueue}} - The race from CASSANDRA-15958 is fixed? (Noticed 
the comment was removed...)
- {{InboundMessageHandler}} - The class-level JavaDoc is now almost identical 
to {{AbstractMessageHandler}}, so we might want to narrow down to the places 
where the former specializes. (ex. Only it uses, {{ProcessMessage}}.) The same 
thing goes for the {{LargeMessage}} abstract class and its two sub-classes.
- {{CQLMessageHandler}} - could use brief class-level JavaDoc on how it extends 
{{AbstractMessageHandler}}
- {{transport.Frame}} - brief class-level JavaDoc might help clarify its 
scope/usage (even if it's renamed)
- Would class-level JavaDoc for {{transport.Dispatcher}} and 
{{transport.Flusher}} be helpful. (ex. How do they interact with each other?)
- {{Payload#finish()}} - It might be useful to have a bit of detail in the 
method JavaDoc explaining when we're expected to call it in the {{Payload}} 
lifecycle.
- Do we need a deprecation plan for {{native_transport_frame_block_size_in_kb}} 
...or not so much, because it was only added for 4.0 anyway?

*Zombies*

- {{transport.Frame.Header.BODY_LENGTH_SIZE}} is unused
- {{PasswordAuthenticatorTest}} - {{import java.net.InetSocketAddress}} is 
unused
- {{OutboundConnection}} - {{import java.util.stream.Stream}} is unused
- {{StressSettings}} - {{import com.datastax.driver.core.Metadata}} and 
{{import com.google.common.collect.ImmutableMap}} are unused
- {{FrameCompressor.LZ4Compressor}} - {{compress()}} doesn't need the 
{{throws}} clause, and {{outputOffset + 0}} could be simplified
- {{ProtocolNegotiationTest}} - can remove the dead {{throws}} clauses from a 
few tests
- {{ProtocolErrorTest}} - {{testErrorMessageWithNullString()}} doesn't need 
throws clause
- {{CQLMessageHandler}} - {{UNKNOWN_STREAM_ID}} and 

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-09-28 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198465#comment-17198465
 ] 

Sam Tunnicliffe edited comment on CASSANDRA-15299 at 9/28/20, 11:33 AM:


Sorry it's been a while without any visible movement here, but I've just pushed 
some more commits to address the latest comments from [~ifesdjeen] and 
[~omichallat]. I've added some tests for protocol negotiation, correct handling 
of corrupt messages and resource management.
{quote} * In CQLMessageHandler#processOneContainedMessage, when we can't 
acquire capacity and, subsequently, we're not passing the frame further down 
the line. Shouold we release the frame in this case, since usually we're 
releasing the source frame after flush.{quote}
Done, though we only need to do this when {{throwOnOverload == true}} as 
otherwise we process the inflight request before applying backpressure.
{quote} * ReusableBuffer is unused.{quote}
Ah yes, removed
{quote} * Server has a few unused imports and eventExecutorGroup which is 
unused.{quote}
Cleaned up the imports and removed eventExecutorGroup
{quote} * I'm not sure if we currently handle releasing corrupted frames.{quote}
For self-contained frames, there's nothing to do here as no resources have been 
acquired before the corruption is detected, hence {{CorruptFrame::release}} is 
a no-op. For frames which are part of a large message, there may be some 
resource allocated before we discover corruption. This is ok though, as we 
consume the frame, supplying it to the large message state machine, which 
handles releasing the bufffers of the previous frames (if any). I've added a 
test for this scenario which includes a check that everything allocated has 
been freed.
{quote}Shouold we maybe make FrameSet auto-closeable and make sure we always 
release buffers in finally? I've also made a similar change to processItem 
which would add item to flushed to make sure it's released. That makes flushed 
variable name not quite right though.
{quote}
I've pulled in some of your change to {{processItem}} as it removes some 
duplication around polling the queue. I've removed the condition in the 
{{finally}} of {{ImmediateFlusher}} though, since if we throw from 
{{processQueue}} then {{doneWork}} will be false anyway, but there may have 
been some items processed and waiting to flush. The trade off is calling 
{{flushWrittenChannels}} even if there's no work to do, but that seems both 
cheap and unlikely, what do you think?
 As far as making {{FrameSet}} autoclosable, I don't think that's feasible, 
given how they are created and accessed.  I've tried to address one of your 
comment re: the memory management here by adding some comments. They're 
probably not yet enough, but let me know if they are helpful at all.
{quote}We can (and probably should) open a separate ticket that could aim at 
performance improvements around native protocol.
{quote}
Agreed, I'd like to do some further perf testing, but the results from your 
initial tests makes a follow-up ticket seem a reasonable option.
{quote}I've noticed an issue when the client starts protocol negotiation with 
an unsupported version.
{quote}
Fixed, thanks.

[~ifesdjeen], I haven't pulled in your burn test or changes to {{SimpleClient}} 
yet, I'll try to do that next week. I also haven't done any automated renaming 
yet, I'll hold off on that so as not to add to the cognitive burden until we're 
pretty much done with review.
||branch||CI||
|[15299-trunk|https://github.com/beobal/cassandra/tree/15299-trunk]|[circle|https://app.circleci.com/pipelines/github/beobal/cassandra?branch=15299-trunk]|


was (Author: beobal):
Sorry it's been a while without any visible movement here, but I've just pushed 
some more commits to address the latest comments from [~ifesdjeen] and 
[~omichallat]. I've added some tests for protocol negotiation, correct handling 
of corrupt messages and resource management.
 
{quote} * In CQLMessageHandler#processOneContainedMessage, when we can't 
acquire capacity and, subsequently, we're not passing the frame further down 
the line. Shouold we release the frame in this case, since usually we're 
releasing the source frame after flush.
{quote}

Done, though we only need to do this when {{throwOnOverload == true}} as 
otherwise we process the inflight request before applying backpressure.

{quote} * ReusableBuffer is unused. {quote}

Ah yes, removed

{quote} * Server has a few unused imports and eventExecutorGroup which is 
unused.{quote}

Cleaned up the imports and removed eventExecutorGroup

{quote} * I'm not sure if we currently handle releasing corrupted frames.{quote}

For self-contained frames, there's nothing to do here as no resources have been 
acquired before the corruption is detected, hence {{CorruptFrame::release}} is 
a no-op. For frames which are part of a large message, there may be 

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-07-21 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162062#comment-17162062
 ] 

Alex Petrov edited comment on CASSANDRA-15299 at 7/21/20, 2:29 PM:
---

Thank you for the changes. I'm mostly done with a second pass of the review. 

First, wanted to bring up a couple of important things:

  * Current memory management is rather hard to follow, and maybe we should 
make an effort to document and/or codify it somewhere. I've spent about a day 
following all paths and figuring out ins-and-outs of it. One thing that I've 
found is that in {{FrameSet#finish}}, we're releasing not only the frame that 
was encoded into the sending buffer, but also the one we flush. This is done in 
three places with {{writeAndFlush}}. I believe this was leading to crc mismatch 
bug that I was catching earlier (see stacktrace [1] below). What might have 
been happening is we were releasing the buffer too early, which would allow it 
to get recycled.
  * There were several places in {{SimpleClient}} where we were not accouting 
for bytes. One is that we were never releasing {{Frame}} of the response we 
were getting. I've implemented copying, which is not optimal, but should work 
for testing purposes. The other one is that we weren't giving the bytes back to 
the limits. They would be acquired via netty processing, so we need to release 
them eventually. I've added a simple hook to release those. An alterntive to 
this would be to not use limits at all, with a downside of loosing a bit of 
observability. 
  * [this one was discussed privately and source of the leak suggested by Sam] 
Flusher takes care of calling release for the flush item, which releases the 
source buffer. However, _response_ frame is released only for small buffers in 
{{FrameSet}}

Here's a branch with aforementioned changes. However, I haven't run CI on it, 
I'll check if it breaks anything later today: 
https://github.com/ifesdjeen/cassandra/pull/new/15299-alex

{code}
[1] 
org.apache.cassandra.net.Crc$InvalidCrc: Read -854589741, Computed 1432984585
 at 
org.apache.cassandra.transport.CQLMessageHandler.processCorruptFrame(CQLMessageHandler.java:328)
 at 
org.apache.cassandra.net.AbstractMessageHandler.process(AbstractMessageHandler.java:217)
 at org.apache.cassandra.net.FrameDecoder.deliver(FrameDecoder.java:321)
 at org.apache.cassandra.net.FrameDecoder.channelRead(FrameDecoder.java:285)
 at org.apache.cassandra.net.FrameDecoder.channelRead(FrameDecoder.java:269)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at
{code}

Some small nits: 

  * In {{CQLMessageHandler#processOneContainedMessage}}, when we can't acquire 
capacity and, subsequently, we're not passing the frame further down the line. 
Shouold we release the frame in this case, since usually we're releasing the 
source frame after flush.
  * {{ReusableBuffer}} is unused.
  * {{Server}} has a few unused imports and {{eventExecutorGroup}} which is 
unused.
  * I'm not sure if we currently handle releasing corrupted frames.
  * In {{FrameSet}}, we're currently relying on the fact that we'll be able to 
go through {{finish}} and release everything successfully. However, this might 
not always be the case. I couldn't trigger such error, but it might still be 
possible. Shouold we maybe make {{FrameSet}} auto-closeable and make sure we 
always release buffers in {{finally}}? I've also made a similar change to 
{{processItem}} which would add item to {{flushed}} to make sure it's released. 
That makes {{flushed}} variable name not quite right though.

And some improvements that were done to simple client:

  * it now supports time outs
  * supports sending multiple messages in a single frame when using v5
  * simpler pipelines
  * reuses code for frame processing 

There's one more issue with a driver, which is easy to reproduce (with attached 
jar and test), but here's a stack trace:

{code}
DEBUG [cluster1-nio-worker-1] 2020-07-21 11:54:20,474 Connection.java:1396 - 
Connection[/127.0.0.1:64050-2, inFlight=2, closed=false] connection error
java.lang.AssertionError: null
at 
com.datastax.driver.core.BytesToSegmentDecoder.decode(BytesToSegmentDecoder.java:56)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:332)
at 
io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501)
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at 

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-07-10 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155676#comment-17155676
 ] 

Sam Tunnicliffe edited comment on CASSANDRA-15299 at 7/10/20, 7:12 PM:
---

[~omichallat] I absolutely take on board all of your arguments. Now the changes 
are getting to a somewhat more decent shape, lets take another look and see if 
it a large scale rename is absolutely necessary. Maybe it's just Stockholm 
Syndrome, but I'm finding that the clashes between the {{o.a.c.net}} and 
{{o.a.c.transport}} classes are less and less of a problem. Perhaps we can live 
with things as they are after all.

Unfortunately, I've been a bit short of time to focus on this since the last 
update, but I've made some changes to the server configuration to improve 
testability and added a first meaningful(ish) test. Fleshing out the test 
coverage should now be a bit easier, which I'll try to do next week.

As I mentioned, I've been running end to end tests with the 
[java2772|https://github.com/datastax/java-driver/tree/java2772] branch of the 
java driver and have bundled a build from there for now. 
What would be really useful would be some experimental level of support in the 
python driver, for cqlsh and python dtests. 

|branch|[15299-trunk|https://github.com/beobal/cassandra/tree/15299-trunk]|
|dtests|[15299|https://github.com/beobal/cassandra-dtest/tree/15299]|
|ci|[circle|https://app.circleci.com/pipelines/github/beobal/cassandra?branch=15299-trunk]|


was (Author: beobal):
[~omichallat] I absolutely take on board all of your arguments. Now the changes 
are getting to a somewhat more decent shape, lets take another look and see if 
it a large scale rename is absolutely necessary. Maybe it's just Stockholm 
Syndrome, but I'm finding that the clashes between the {{o.a.c.net}} and 
{{o.a.c.transport}} classes are less and less of a problem. Perhaps we can live 
with things as they are after all.

Unfortunately, I've been a bit short of time to focus on this since the last 
update, but I've made some changes to the server configuration to improve 
testability and added a first meaningful(ish) test. Fleshing out the test 
coverage should now be a bit easier, which I'll try to do next week.

As I mentioned, I've been running end to end tests with the 
[java2772|https://github.com/datastax/java-driver/tree/java2772] branch of the 
java driver and have bundled a build from there for now. 
What would be really useful would be some experimental level of support in the 
python driver, for cqlsh and python dtests. 


> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty 

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-07-10 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139920#comment-17139920
 ] 

Sam Tunnicliffe edited comment on CASSANDRA-15299 at 7/10/20, 7:11 PM:
---

Pushed an updated branch where the protocol changes are pretty stable. I've 
been testing this with the 
[java2772|https://github.com/datastax/java-driver/tree/java2772] branch of the 
driver and with debug-cql and everything is working pretty much as expected. 
Due to current lack of support in the python driver, I've had to use a modified 
[dtest branch|https://github.com/beobal/cassandra-dtest/tree/15299] and make a 
temporary hack to cqlsh, but the tests that are running are pretty much green. 
Obviously, those are not covering anything from v5 now, but my primary concern 
was to make sure there's no regressions for v4 clients.

Although I think this is ready for some more eyes on it, there's still a 
non-trivial amount of work to be done. Items outstanding include:
 * Comprehensive unit and in-jvm tests - in progress
 * Metrics
 * Python driver support (doesn't have to be fully implemented, but a basic 
level is needed for pytests and cqlsh)
 * Documentation
 * Renaming existing classes. There are a number of slightly confusing 
conflicts in naming now. These should be simple to resolve, just automated 
renaming mostly, but I've held off doing them for now because they'll make the 
patch much bigger and probably harder to read.

The patch is also not quite a massive at it might appear at first. A large 
proportion is the revert of CASSANDRA-13304, and the rest is largely additive. 
I've tried hard not touch code on the v4 path, even where it could clearly be 
refactored, to minimize the delta & the risk there. So the patch largely 
consists of some v5 specific additions, moving a few existing classes/methods 
around, and changing modifiers on previously private/package-private things.

I'm still actively working on the tests, but I don't think that (nor the 
renaming) need hold up review any more.
|branch|[15299-trunk|https://github.com/beobal/cassandra/tree/15299-trunk]|
|dtests|[15299|https://github.com/beobal/cassandra-dtest/tree/15299]|
|ci|[circle|https://app.circleci.com/pipelines/github/beobal/cassandra?branch=15299-trunk]|


was (Author: beobal):
Pushed an updated branch where the protocol changes are pretty stable. I've 
been testing this with the 
[java2772|https://github.com/datastax/java-driver/tree/java2772] branch of the 
driver and with debug-cql and everything is working pretty much as expected. 
Due to current lack of support in the python driver, I've had to use a modified 
[dtest branch|https://github.com/beobal/cassandra-dtest/tree/15299] and make a 
temporary hack to cqlsh, but the tests that are running are pretty much green. 
Obviously, those are not covering anything from v5 now, but my primary concern 
was to make sure there's no regressions for v4 clients.

Although I think this is ready for some more eyes on it, there's still a 
non-trivial amount of work to be done. Items outstanding include:
 * Comprehensive unit and in-jvm tests - in progress
 * Metrics
 * Python driver support (doesn't have to be fully implemented, but a basic 
level is needed for pytests and cqlsh)
 * Documentation
 * Renaming existing classes. There are a number of slightly confusing 
conflicts in naming now. These should be simple to resolve, just automated 
renaming mostly, but I've held off doing them for now because they'll make the 
patch much bigger and probably harder to read.

The patch is also not quite a massive at it might appear at first. A large 
proportion is the revert of CASSANDRA-13304, and the rest is largely additive. 
I've tried hard not touch code on the v4 path, even where it could clearly be 
refactored, to minimize the delta & the risk there. So the patch largely 
consists of some v5 specific additions, moving a few existing classes/methods 
around, and changing modifiers on previously private/package-private things.

I'm still actively working on the tests, but I don't think that (nor the 
renaming) need hold up review any more.
|branch|[15299-trunk|https://github.com/beobal/cassandra/tree/15299-trunk]|
|dtests|[15299|https://github.com/beobal/cassandra-dtest/tree/15299]|
|ci|[circle|https://app.circleci.com/pipelines/github/beobal/cassandra?branch=15299-trunk]|

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-06-30 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148831#comment-17148831
 ] 

Sam Tunnicliffe edited comment on CASSANDRA-15299 at 6/30/20, 5:08 PM:
---

Thanks for the comments [~ifesdjeen] & [~omichallat], I've pushed a few commits 
to the [branch|https://github.com/beobal/cassandra/commits/15299-trunk].

{quote} 
There are several things that I wanted to bring to your attention:
{quote}
I've handled most of these in a refactor of Flusher. As you suggested, for 
framed items we now collate the frames and only allocate the payloads when we 
flush to the netty channel. So now, we allocate the payload based on the actual 
number of bytes required for the specific channel.
{quote}{{ExceptionHandlers$PostV5ExceptionHandler#exceptionCaught}}: when 
flushing an exception, we don't call release on the payload.
{quote}
Included in the {{minor cleanups}} commit
{quote}There are several places in {{SimpleClient}} where 
{{largePayload#release}} isn't called.
{quote}
I've refactored the flushing large messages in {{SimpleClient}} to match 
{{Flusher}}, so this is working properly now.
{quote}Other things...
{quote}
{quote}{{Dispatcher#processRequest}}, we don't need to cast error to 
{{Message.Response}} if we change its type to {{ErrorMessage}}.
{quote}
In {{CqlMessageHandler#releaseAfterFlush}}, we can call 
{{sourceFrame#release()}} instead of {{sourceFrame.body.release()}} for 
consistency with other calls

Both in {{minor cleanups}}
{quote}{{Server#requestPayloadInFlightPerEndpoint}} can be a non-static 
{{Server}} member.
{quote}
If you don't mind I'd prefer to leave this as it is for now as it's 
pre-existing and changing would require reworking CASSANDRA-15519 (changing 
limits at runtime).
{quote}Should we hide {{flusher.queued.add()}} behind a method to disallow 
accessing queue directly?
{quote}
I've done this, but I'm not 100% convinced of its utility. As the two 
{{Flusher}} subclasses need access to the queue, we have to provide package 
private methods {{poll}} and {{isEmpty}} as well as one to {{enqueue}}. So 
unless we move {{Flusher}} to its own subpackage, the queue is effectively 
visible to everything else in {{o.a.c.Transport}}
{quote}We can change the code a bit to make {{FlushItemConverter}} instances 
explicit. Right now, we basically have two converters both called 
{{#toFlushItem}} in {{CQLMessageHandler}} and {{LegacyDispatchHandler}}. We 
could have them as inner classes. It's somewhat useful since if you change the 
signature of this method, or stop using it, it'll be hard to find that it is 
actually an implementation of converter.
{quote}
I've left this as it is just for the moment. I'm working on some tests which 
supply a lambda to act as the converter, so I'll come back to this when those 
have solidified a bit more.
{quote}Looks like {{MessageConsumer}} could be generic, since we cast it to 
either request or response.
{quote}
I've parameterised {{MessageConsumer}} & {{CQLMessageHandler}} according to the 
subclass of Message they expect and extended this a bit by moving the logic out 
of {{Message$ProtocolEncoder}} to an abstract {{Message$Decoder}} with concrete subclasses for {{Request}} and {{Response}}.

{quote}Looks like {{CQLMessageHandler#processCorruptFrame}}, initially had an 
intention of handling recovery, but now just throws a CRC exception regardless. 
This does match description, but usage of {{isRecoverable}} seems to be 
redundant here, unless we change semantics of recovery.
{quote}
It is somewhat redundant here, except that it logs a slightly different message 
to indicate whether the CRC mismatch was found in the frame header or body. 
I'll leave it as it is for now as it's technically possible to recover from a 
corrupt body, but would be problematic for clients just now.

I still have some comments to address, as well as those from [~omichallat] ...
{quote}{{Frame$Decoder}} and other classes that are related to legacy path can 
be extracted to a separate class, since {{Frame}} itself is still useful, but 
classes that facilitate legacy encoding/decoding/etc can be extracted.
{quote}
{quote}{{Frame#encodeHeaderInto}} seems to be duplicating the logic we have in 
{{Frame$Encoder#encodeHeader}}, should we unify the two? Maybe we can have 
encoding/decoding methods shared for both legacy and new paths, for example, as 
static methods?
{quote}
{quote}As you have mentioned, it would be great to rename {{Frame}} to 
something different, like {{Envelope}}, since right now we have 
{{FrameDecoder#Frame}} and {{Frame$Decoder}} and variable names that correspond 
with class names, which makes it all hard to follow.
{quote}


was (Author: beobal):
Thanks for the comments [~ifesdjeen] & [~omichallat], I've pushed a few commits 
to the [branch|https://github.com/beobal/cassandra/commits/15299-trunk].

{quote} 
There are 

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-06-30 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148831#comment-17148831
 ] 

Sam Tunnicliffe edited comment on CASSANDRA-15299 at 6/30/20, 5:06 PM:
---

Thanks for the comments [~ifesdjeen] & [~omichallat], I've pushed a few commits 
to the [branch|https://github.com/beobal/cassandra/commits/15299-trunk].

{quote} 
There are several things that I wanted to bring to your attention:
{quote}
I've handled most of these in a refactor of Flusher. As you suggested, for 
framed items we now collate the frames and only allocate the payloads when we 
flush to the netty channel. So now, we allocate the payload based on the actual 
number of bytes required for the specific channel.
{quote}{{ExceptionHandlers$PostV5ExceptionHandler#exceptionCaught}}: when 
flushing an exception, we don't call release on the payload.
{quote}
Included in the {{minor cleanups}} commit
{quote}There are several places in {{SimpleClient}} where 
{{largePayload#release}} isn't called.
{quote}
I've refactored the flushing large messages in {{SimpleClient}} to match 
{{Flusher}}, so this is working properly now.
{quote}Other things...
{quote}
{quote}{{Dispatcher#processRequest}}, we don't need to cast error to 
{{Message.Response}} if we change its type to {{ErrorMessage}}.
{quote}
In {{CqlMessageHandler#releaseAfterFlush}}, we can call 
{{sourceFrame#release()}} instead of {{sourceFrame.body.release()}} for 
consistency with other calls

Both in {{minor cleanups}}
{quote}{{Server#requestPayloadInFlightPerEndpoint}} can be a non-static 
{{Server}} member.
{quote}
If you don't mind I'd prefer to leave this as it is for now as it's 
pre-existing and changing would require reworking CASSANDRA-15519 (changing 
limits at runtime).
{quote}Should we hide {{flusher.queued.add()}} behind a method to disallow 
accessing queue directly?
{quote}
I've done this, but I'm not 100% convinced of its utility. As the two 
{{Flusher}} subclasses need access to the queue, we have to provide package 
private methods {{poll}} and {{isEmpty}} as well as one to {{enqueue}}. So 
unless we move {{Flusher}} to its own subpackage, the queue is effectively 
visible to everything else in {{o.a.c.Transport}}
{quote}We can change the code a bit to make {{FlushItemConverter}} instances 
explicit. Right now, we basically have two converters both called 
{{#toFlushItem}} in {{CQLMessageHandler}} and {{LegacyDispatchHandler}}. We 
could have them as inner classes. It's somewhat useful since if you change the 
signature of this method, or stop using it, it'll be hard to find that it is 
actually an implementation of converter.
{quote}
> I've left this as it is just for the moment. I'm working on some tests which 
> supply a lambda to act as the converter, so I'll come back to this when those 
> have solidified a bit more.
{quote}Looks like {{MessageConsumer}} could be generic, since we cast it to 
either request or response.
{quote}
I've parameterised {{MessageConsumer}} & {{CQLMessageHandler}} according to the 
subclass of Message they expect and extended this a bit by moving the logic out 
of {{Message$ProtocolEncoder}} to an abstract\{{ Message$Decoder}} with concrete subclasses for {{Request}} and {{Response}}.

>bq.Looks like {{CQLMessageHandler#processCorruptFrame}}, initially had an 
>intention of handling recovery, but now just throws a CRC exception 
>regardless. This does match description, but usage of {{isRecoverable}} seems 
>to be redundant here, unless we change semantics of recovery.

It is somewhat redundant here, except that it logs a slightly different message 
to indicate whether the CRC mismatch was found in the frame header or body. 
I'll leave it as it is for now as it's technically possible to recover from a 
corrupt body, but would be problematic for clients just now.

I still have some comments to address, as well as those from [~omichallat] ...
{quote}Frame$Decoder and other classes that are related to legacy path can be 
extracted to a separate class, since Frame itself is still useful, but classes 
that facilitate legacy encoding/decoding/etc can be extracted.
{quote}
{quote}Frame#encodeHeaderInto seems to be duplicating the logic we have in 
Frame$Encoder#encodeHeader, should we unify the two? Maybe we can have 
encoding/decoding methods shared for both legacy and new paths, for example, as 
static methods?
{quote}
{quote}As you have mentioned, it would be great to rename Frame to something 
different, like Envelope, since right now we have FrameDecoder#Frame and 
Frame$Decoder and variable names that correspond with class names, which makes 
it all hard to follow.
{quote}


was (Author: beobal):
Thanks for the comments [~ifesdjeen] & [~omichallat].

{quote} 
There are several things that I wanted to bring to your attention:
{quote}
I've handled most of these in a refactor of Flusher. As you 

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-06-18 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139920#comment-17139920
 ] 

Sam Tunnicliffe edited comment on CASSANDRA-15299 at 6/18/20, 6:58 PM:
---

Pushed an updated branch where the protocol changes are pretty stable. I've 
been testing this with the 
[java2772|https://github.com/datastax/java-driver/tree/java2772] branch of the 
driver and with debug-cql and everything is working pretty much as expected. 
Due to current lack of support in the python driver, I've had to use a modified 
[dtest branch|https://github.com/beobal/cassandra-dtest/tree/15299] and make a 
temporary hack to cqlsh, but the tests that are running are pretty much green. 
Obviously, those are not covering anything from v5 now, but my primary concern 
was to make sure there's no regressions for v4 clients.

Although I think this is ready for some more eyes on it, there's still a 
non-trivial amount of work to be done. Items outstanding include:
 * Comprehensive unit and in-jvm tests - in progress
 * Metrics
 * Python driver support (doesn't have to be fully implemented, but a basic 
level is needed for pytests and cqlsh)
 * Documentation
 * Renaming existing classes. There are a number of slightly confusing 
conflicts in naming now. These should be simple to resolve, just automated 
renaming mostly, but I've held off doing them for now because they'll make the 
patch much bigger and probably harder to read.

The patch is also not quite a massive at it might appear at first. A large 
proportion is the revert of CASSANDRA-13304, and the rest is largely additive. 
I've tried hard not touch code on the v4 path, even where it could clearly be 
refactored, to minimize the delta & the risk there. So the patch largely 
consists of some v5 specific additions, moving a few existing classes/methods 
around, and changing modifiers on previously private/package-private things.

I'm still actively working on the tests, but I don't think that (nor the 
renaming) need hold up review any more.
|branch|[15299-trunk|https://github.com/beobal/cassandra/tree/15299-trunk]|
|dtests|[15299|https://github.com/beobal/cassandra-dtest/tree/15299]|
|ci|[circle|https://app.circleci.com/pipelines/github/beobal/cassandra?branch=15299-trunk]|


was (Author: beobal):
Pushed an updated branch where the protocol changes are pretty stable. I've 
been testing this with the java2772 branch of the driver and with debug-cql and 
everything is working pretty much as expected. Due to current lack of support 
in the python driver, I've had to use a modified [dtest 
branch|https://github.com/beobal/cassandra-dtest/tree/15299] and make a 
temporary hack to cqlsh, but the tests that are running are pretty much green. 
Obviously, those are not covering anything from v5 now, but my primary concern 
was to make sure there's no regressions for v4 clients.

Although I think this is ready for some more eyes on it, there's still a 
non-trivial amount of work to be done. Items outstanding include:

* Comprehensive unit and in-jvm tests - in progress
* Metrics
* Python driver support (doesn't have to be fully implemented, but a basic 
level is needed for pytests and cqlsh)
* Documentation
* Renaming existing classes. There are a number of slightly confusing conflicts 
in naming now. These should be simple to resolve, just automated renaming 
mostly, but I've held off doing them for now because they'll make the patch 
much bigger and probably harder to read.

The patch is also not quite a massive at it might appear at first. A large 
proportion is the revert of CASSANDRA-13304, and the rest is largely additive. 
I've tried hard not touch code on the v4 path, even where it could clearly be 
refactored, to minimize the delta & the risk there. So the patch largely 
consists of some v5 specific additions, moving a few existing classes/methods 
around, and changing modifiers on previously private/package-private things.

I'm still actively working on the tests, but I don't think that (nor the 
renaming) need hold up review any more.

|branch|[15299-trunk|https://github.com/beobal/cassandra/tree/15299-trunk]|
|dtests|[15299|https://github.com/beobal/cassandra-dtest/tree/15299]|
|ci|[circle|https://app.circleci.com/pipelines/github/beobal/cassandra?branch=15299-trunk]|


> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-05-22 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113865#comment-17113865
 ] 

Benedict Elliott Smith edited comment on CASSANDRA-15299 at 5/22/20, 9:35 AM:
--

bq. That is, the specific lengths are chosen to ensure that the CRC 24/32 
provides adequate protection.

FWIW, depending on your definition of "adequate" CRC32 may well not fit the 
bill for 128KiB, so with CRC32 for the payload 128KiB should very much be an 
upper limit.  A protocol change should move us to CRC64 or CRC128 before we 
consider increasing the maximum frame size.  

For reference, from Koopman: 
https://users.ece.cmu.edu/~koopman/crc/c32/0x82608edb_len.txt
This means that the minimum number of bits necessary to corrupt the stream is 3 
when the payload is smaller than 91608 bits, or ~11KiB, and just 2 when larger 
than that.

I personally doubt there will be much value in increasing the frame size soon 
anyway, since the goal is only to amortise the cost of splitting well and 
increase compression efficiency.  We would want to revisit the compression 
algorithm we use before we increase the size, to find one able to exploit 
longer histories (while not requiring too many resources per connection).

Which is to say it's definitely for a future protocol, and also to clarify that 
CRC32 was a poor man's compromise at the time of authoring the internode 
messaging, because there is no accelerated and readily accessible CRC64 or 
CRC128 implementation in our supported JDKs.  I expect they will be available 
in the future though, or we can probably include our own using the modern 
instructions.


was (Author: benedict):
bq. That is, the specific lengths are chosen to ensure that the CRC 24/32 
provides adequate protection.

FWIW, depending on your definition of "adequate" CRC32 may well not fit the 
bill for 128KiB, so with CRC32 for the payload 128KiB should very much be an 
upper limit.  A protocol change should move us to CRC64 or CRC128 before we 
consider increasing the maximum frame size.  

I personally doubt there will be much value in increasing the frame size soon 
anyway, since the goal is only to amortise the cost of splitting well and 
increase compression efficiency.  We would want to revisit the compression 
algorithm we use before we increase the size, to find one able to exploit 
longer histories (while not requiring too many resources per connection).

Which is to say it's definitely for a future protocol, and also to clarify that 
CRC32 was a poor man's compromise at the time of authoring the internode 
messaging, because there is no accelerated and readily accessible CRC64 or 
CRC128 implementation in our supported JDKs.  I expect they will be available 
in the future though, or we can probably include our own using the modern 
instructions.

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and 

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-05-14 Thread Olivier Michallat (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17107811#comment-17107811
 ] 

Olivier Michallat edited comment on CASSANDRA-15299 at 5/15/20, 12:49 AM:
--

As I'm starting to work on the driver changes, I have a few more remarks about 
the binary format:
 * length fields are just big enough to accommodate the 128 KiB limit. Wouldn't 
it be more cautious to keep a bit of margin? I have no issue with the current 
value, or even the fact that it is hard-coded; but it feels a bit restrictive 
to remove any possibility of ever increasing it without a protocol change.
 * along the same lines, maybe the header should reserve more space for flags. 
We only have 1 so far (and space for 5 more), but as experience has shown, new 
needs can arise over time. Legacy messages use either 1 or 4 bytes.
 * why store the uncompressed length in the header? In the legacy frame format, 
it's encoded in the compressed payload (Snappy does that natively, and for LZ4 
we prepend it). We have code that takes a buffer and returns a compressed 
buffer, it would be nice to reuse it without having to change anything. I don't 
see any advantage having it in the header, if the payload is corrupted we have 
nothing to decompress anyway, so the uncompressed length is of no use.

 Putting all those changes together, we could have a unique header format that 
works both with and without compression:
{code:java}
  0   1   2   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |  Payload Length   |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | flags | CRC24 of Header   |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
{code}


was (Author: omichallat):
As I'm starting to work on the driver changes, I have a few more remarks about 
the binary format:
 * length fields are just big enough to accommodate the 128 KiB limit. Wouldn't 
it be more cautious to keep a bit of margin? I have no issue with the current 
value, or even the fact that it is hard-coded; but it feels a bit restrictive 
to remove any possibility of ever increasing it without a protocol change.
 * along the same lines, maybe the header should reserve more space for flags. 
We only have one so far, but as experience has shown, new needs can arise over 
time. Legacy messages use either 1 or 4 bytes.
 * why store the uncompressed length in the header? In the legacy frame format, 
it's encoded in the compressed payload (Snappy does that natively, and for LZ4 
we prepend it). We have code that takes a buffer and returns a compressed 
buffer, it would be nice to reuse it without having to change anything. I don't 
see any advantage having it in the header, if the payload is corrupted we have 
nothing to decompress anyway, so the uncompressed length is of no use.

 Putting all those changes together, we could have a unique header format that 
works both with and without compression:
{code:java}
  0   1   2   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |  Payload Length   |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | flags | CRC24 of Header   |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
{code}


> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-alpha
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to 

[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-05-11 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104465#comment-17104465
 ] 

Sam Tunnicliffe edited comment on CASSANDRA-15299 at 5/11/20, 1:39 PM:
---

bq.  find out that the _uncompressed_ size is above the limit, split it into 
128 KiB chunks, then compress each chunk separately and wrap it into its own 
(uncontained) outer frame?

Yes, exactly.

bq. uncontained frames are never recoverable when compression is enabled

Large messages are never recoverable when a corrupt payload is detected. 
Self-contained frames *could* be recoverable as long as only the payload is 
corrupt, but like we discussed earlier this is more complicated here than in 
the internode case due to the payload potentially containing multiple stream 
ids, so we may as well close the connection whenever a corrupt frame is 
encountered.


was (Author: beobal):
>  find out that the _uncompressed_ size is above the limit, split it into 128 
>KiB chunks, then compress each chunk separately and wrap it into its own 
>(uncontained) outer frame?

Yes, exactly.

>  uncontained frames are never recoverable when compression is enabled

Large messages are never recoverable when a corrupt payload is detected. 
Self-contained frames *could* be recoverable as long as only the payload is 
corrupt, but like we discussed earlier this is more complicated here than in 
the internode case due to the payload potentially containing multiple stream 
ids, so we may as well close the connection whenever a corrupt frame is 
encountered.

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Sam Tunnicliffe
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-beta
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15299) CASSANDRA-13304 follow-up: improve checksumming and compression in protocol v5-beta

2020-02-28 Thread Jorge Bay (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047432#comment-17047432
 ] 

Jorge Bay edited comment on CASSANDRA-15299 at 2/28/20 10:27 AM:
-

Let me know if I can help in any way with this ticket (early review / tests).


was (Author: jorgebg):
Let me know if I can help in anyway with this ticket (early review / tests).

> CASSANDRA-13304 follow-up: improve checksumming and compression in protocol 
> v5-beta
> ---
>
> Key: CASSANDRA-15299
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15299
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
>Priority: Normal
>  Labels: protocolv5
> Fix For: 4.0-beta
>
>
> CASSANDRA-13304 made an important improvement to our native protocol: it 
> introduced checksumming/CRC32 to request and response bodies. It’s an 
> important step forward, but it doesn’t cover the entire stream. In 
> particular, the message header is not covered by a checksum or a crc, which 
> poses a correctness issue if, for example, {{streamId}} gets corrupted.
> Additionally, we aren’t quite using CRC32 correctly, in two ways:
> 1. We are calculating the CRC32 of the *decompressed* value instead of 
> computing the CRC32 on the bytes written on the wire - losing the properties 
> of the CRC32. In some cases, due to this sequencing, attempting to decompress 
> a corrupt stream can cause a segfault by LZ4.
> 2. When using CRC32, the CRC32 value is written in the incorrect byte order, 
> also losing some of the protections.
> See https://users.ece.cmu.edu/~koopman/pubs/KoopmanCRCWebinar9May2012.pdf for 
> explanation for the two points above.
> Separately, there are some long-standing issues with the protocol - since 
> *way* before CASSANDRA-13304. Importantly, both checksumming and compression 
> operate on individual message bodies rather than frames of multiple complete 
> messages. In reality, this has several important additional downsides. To 
> name a couple:
> # For compression, we are getting poor compression ratios for smaller 
> messages - when operating on tiny sequences of bytes. In reality, for most 
> small requests and responses we are discarding the compressed value as it’d 
> be smaller than the uncompressed one - incurring both redundant allocations 
> and compressions.
> # For checksumming and CRC32 we pay a high overhead price for small messages. 
> 4 bytes extra is *a lot* for an empty write response, for example.
> To address the correctness issue of {{streamId}} not being covered by the 
> checksum/CRC32 and the inefficiency in compression and checksumming/CRC32, we 
> should switch to a framing protocol with multiple messages in a single frame.
> I suggest we reuse the framing protocol recently implemented for internode 
> messaging in CASSANDRA-15066 to the extent that its logic can be borrowed, 
> and that we do it before native protocol v5 graduates from beta. See 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderCrc.java
>  and 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/FrameDecoderLZ4.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org