Jon Meredith created CASSANDRA-16616:
----------------------------------------

             Summary: Harden internode message resource limit accounting 
against serialization failures
                 Key: CASSANDRA-16616
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16616
             Project: Cassandra
          Issue Type: Bug
          Components: Messaging/Internode
            Reporter: Jon Meredith
            Assignee: Jon Meredith


If the internode messaging exception recovery code fails and is unable to 
correctly adjust the resource limits for an OutboundConnection, it affects the 
other connection types sharing the same OutboundConnections so that any of the 
connections could hit {{assert using >= 0;}} in
{{org.apache.cassandra.net.ResourceLimits.Concurrent#release}}.

While it is possible to modify all of the outbound connection code to 
re-initialize all of the connections with a correct limit, the effort to test 
and maintain the recovery code seems too high for something that should "never 
happen" (except it did once, which is why it needs hardening).  The safer 
option is to kill the JVM and have whatever external monitoring is in place 
restart the instance in a known good state.

Additionally, the logging for dropping outbound messages that have expired or 
are unserializable messages takes place after the recovery handling logic. If 
there are problems with the recovery logic that throw an exception, the message 
is never logged for future diagnosis. Logging should take place first, and then 
releasing capacity/handling the expiration/serialization.

Discovered on a branch modified for testing that threw an exception in the 
Verb.serializeSize method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to