[
https://issues.apache.org/jira/browse/CASSANDRA-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333301#comment-17333301
]
Sam Tunnicliffe commented on CASSANDRA-16581:
---------------------------------------------
Currently, catching a {{ProtocolException}} in by the pipeline's exception
handler is supposed to close the channel, forcing the client to reconnect. For
v5, this is {{o.a.c.t.ExceptionHandlers.ExceptionHandler}} and for v4 and
lower, {{o.a.c.t.PreV5Handlers.ExceptionHandler}} in trunk and
{{o.a.c.t.Message.ExceptionHandler}} in prior versions. )
{code}
// On protocol exception, close the channel as soon as the message have been
sent
if (cause instanceof ProtocolException)
future.addListener((ChannelFutureListener) f -> ctx.close());
{code}
However, many if not most instances of {{ProtocolException}} are actually
contained in a {{WrappedException}} at this point, so not many actually trigger
this condition. This is changed by David's patches, but we spoke offline and
agreed that this should be reverted for v4- in 3.0, 3.11 and trunk as
reconnections can be expensive, especially on the server side when auth is
enabled.
As David mentioned, this is also a slightly more tricky in v5 as a frame can
contain envelopes for multiple streams. In the case of a fatal error (one which
renders the entire frame unusable), the server is not able to notify the client
of the stream ids present in the frame. To avoid causing a wave of client side
timeouts, we decided to fail fast and close the client connection if any
protocol error is detected.
{quote}
>From a client point of view, a dropped frame will result in request timeouts.
>We have no way of providing a better error, since the stream ids of the failed
>requests are in the corrupt payload. I'm wondering if it might not be better
>to drop the connection all the time: at least the client gets immediate
>feedback (we could try to propagate a cause), instead of a bunch of requests
>timing out for no apparent reason.
[1|https://issues.apache.org/jira/browse/CASSANDRA-15299?focusedCommentId=17099447&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17099447].
{quote}
However, many protocol errors are not fatal; examples include invalid
consistency levels, illegal statements in a BATCH, incorrect opcode. In these
cases, the server _may_ be able to respond appropriately to the invalid
envelope and continue processing the remainder of the frame.
I've pushed a couple of v5 specific commits
[here|https://github.com/beobal/cassandra/tree/16581-trunk-v5]. The general
idea is to attempt to differentiate between so-called protocol errors from
which the server can recover and those from which it can't. With this in mind,
the message processor will return an error response the first time it
encountered a protocol exception, but only terminate the connection if it
immediately encounters a second error on the very next envelope in the frame.
The reason for failing only on consecutive errors is that any individual error
may be recoverable. For instance, a client could send a Frame with 100
envelopes and every other one might have some recoverable corruption. A run of
consecutive errors in the same frame is a heuristic for identifying
non-recoverable corruption, and while not perfect, it seems fairly reasonable
to me.
An exception to this rule is if the body length advertised in the envelope
header is invalid (i.e. < 0). In this case, the message processor is unable to
even attempt to skip over the message, so it throws and closes the connection
immediately.
> Failure to execute queries should emit a KPI other than read
> timeout/unavailable so it can be alerted/tracked
> -------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-16581
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16581
> Project: Cassandra
> Issue Type: Bug
> Components: Messaging/Client, Observability/Metrics
> Reporter: David Capwell
> Assignee: David Capwell
> Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0-rc
>
>
> When we are unable to parse a message we do not have a way to detect this
> from a monitoring point of view so can get into situations where we believe
> the database is fine but the clients are on-fire. This case popped up in the
> 2.1 to 3.0 upgrade as paging state wasn’t mixed-mode safe.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]