[
https://issues.apache.org/jira/browse/CASSANDRA-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865565#comment-15865565
]
Stefan Podkowinski commented on CASSANDRA-13126:
------------------------------------------------
bq. Connections should be closed after a DecoderException
We _could_ do this easily. But are you sure this exception is non-recoverable?
Will all streams be affected? If we do, we'd have to close the whole
connection, as we can't signal the error to individual streams without the
stream_id in the frame that can't be decoded. Wouldn't frequently reconnecting
clients possibly cause more memory pressure in this case and further escalate
the issue?
> native transport protocol corruption when using SSL
> ---------------------------------------------------
>
> Key: CASSANDRA-13126
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13126
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Tom van der Woerdt
> Priority: Critical
>
> This is a series of conditions that can result in client connections becoming
> unusable.
> 1) Cassandra GC must be well-tuned, to have short GC pauses every minute or so
> 2) *client* SSL must be enabled and transmitting a significant amount of data
> 3) Cassandra must run with the default library versions
> 4) disableexplicitgc must be set (this is the default in the current
> cassandra-env.sh)
> This ticket relates to CASSANDRA-13114 which is a possible workaround (but
> not a fix) for the SSL requirement to trigger this bug.
> * Netty allocates nio.ByteBuffers for every outgoing SSL message.
> * ByteBuffers consist of two parts, the jvm object and the off-heap object.
> The jvm object is small and goes with regular GC cycles, the off-heap object
> gets freed only when the small jvm object is freed. To avoid exploding the
> native memory use, the jvm defaults to limiting its allocation to the max
> heap size. Allocating beyond that limit triggers a System.gc(), a retry, and
> potentially an exception.
> * System.gc is a no-op under disableexplicitgc
> * This means ByteBuffers are likely to throw an exception when too many
> objects are being allocated
> * The netty version shipped in Cassandra is broken when using SSL (see
> CASSANDRA-13114) and causes significantly too many bytebuffers to be
> allocated.
> This gets more complicated though.
> When /some/ clients use SSL, and others don't, the clients not using SSL can
> still be affected by this bug, as bytebuffer starvation caused by ssl will
> leak to other users.
> ByteBuffers are used very early on in the native protocol as well. Before
> even being able to decode the network protocol, this error can be thrown :
> {noformat}
> io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Direct
> buffer memory
> {noformat}
> Note that this comes back with stream_id 0, so clients end up waiting for the
> client timeout before the query is considered failed and retried.
> A few frames later on the same connection, this appears:
> {noformat}
> Provided frame does not appear to be Snappy compressed
> {noformat}
> And after that everything errors out with:
> {noformat}
> Invalid or unsupported protocol version (54); the lowest supported version is
> 3 and the greatest is 4
> {noformat}
> So this bug ultimately affects the binary protocol and the connection becomes
> useless if not downright dangerous.
> I think there are several things that need to be done here.
> * CASSANDRA-13114 should be fixed (easy, and probably needs to land in 3.0.11
> anyway)
> * Connections should be closed after a DecoderException
> * DisableExplicitGC should be removed from the default JVM arguments
> Any of these three would limit the impact to clients.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)