[jira] [Commented] (CASSANDRA-13126) native transport protocol corruption when using SSL

Stefan Podkowinski (JIRA) Tue, 14 Feb 2017 02:48:57 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865565#comment-15865565
 ]


Stefan Podkowinski commented on CASSANDRA-13126:
------------------------------------------------

bq. Connections should be closed after a DecoderException

We _could_ do this easily. But are you sure this exception is non-recoverable? 
Will all streams be affected? If we do, we'd have to close the whole 
connection, as we can't signal the error to individual streams without the 
stream_id in the frame that can't be decoded. Wouldn't frequently reconnecting 
clients possibly cause more memory pressure in this case and further escalate 
the issue?
 

> native transport protocol corruption when using SSL
> ---------------------------------------------------
>
>                 Key: CASSANDRA-13126
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13126
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Tom van der Woerdt
>            Priority: Critical
>
> This is a series of conditions that can result in client connections becoming 
> unusable.
> 1) Cassandra GC must be well-tuned, to have short GC pauses every minute or so
> 2) *client* SSL must be enabled and transmitting a significant amount of data
> 3) Cassandra must run with the default library versions
> 4) disableexplicitgc must be set (this is the default in the current 
> cassandra-env.sh)
> This ticket relates to CASSANDRA-13114 which is a possible workaround (but 
> not a fix) for the SSL requirement to trigger this bug.
> * Netty allocates nio.ByteBuffers for every outgoing SSL message.
> * ByteBuffers consist of two parts, the jvm object and the off-heap object. 
> The jvm object is small and goes with regular GC cycles, the off-heap object 
> gets freed only when the small jvm object is freed. To avoid exploding the 
> native memory use, the jvm defaults to limiting its allocation to the max 
> heap size. Allocating beyond that limit triggers a System.gc(), a retry, and 
> potentially an exception.
> * System.gc is a no-op under disableexplicitgc
> * This means ByteBuffers are likely to throw an exception when too many 
> objects are being allocated
> * The netty version shipped in Cassandra is broken when using SSL (see 
> CASSANDRA-13114) and causes significantly too many bytebuffers to be 
> allocated.
> This gets more complicated though.
> When /some/ clients use SSL, and others don't, the clients not using SSL can 
> still be affected by this bug, as bytebuffer starvation caused by ssl will 
> leak to other users.
> ByteBuffers are used very early on in the native protocol as well. Before 
> even being able to decode the network protocol, this error can be thrown :
> {noformat}
> io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct 
> buffer memory
> {noformat}
> Note that this comes back with stream_id 0, so clients end up waiting for the 
> client timeout before the query is considered failed and retried.
> A few frames later on the same connection, this appears:
> {noformat}
> Provided frame does not appear to be Snappy compressed
> {noformat}
> And after that everything errors out with:
> {noformat}
> Invalid or unsupported protocol version (54); the lowest supported version is 
> 3 and the greatest is 4
> {noformat}
> So this bug ultimately affects the binary protocol and the connection becomes 
> useless if not downright dangerous.
> I think there are several things that need to be done here.
> * CASSANDRA-13114 should be fixed (easy, and probably needs to land in 3.0.11 
> anyway)
> * Connections should be closed after a DecoderException
> * DisableExplicitGC should be removed from the default JVM arguments
> Any of these three would limit the impact to clients.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (CASSANDRA-13126) native transport protocol corruption when using SSL

Reply via email to