Michaël Figuière created CASSANDRA-15049:
--------------------------------------------
Summary: Requests blocked at NTR stage should be rejected
Key: CASSANDRA-15049
URL: https://issues.apache.org/jira/browse/CASSANDRA-15049
Project: Cassandra
Issue Type: Bug
Reporter: Michaël Figuière
CASSANDRA-11363 has emphasized that if the NTR stage's thread pool and queue
are full, the Netty Event Loops may block waiting on the NTR queue. The
solution that was brought in CASSANDRA-11363 was to increase the default queue
size from 128 to 1024. This significantly reduced the number of blocked
requests observed but hasn't removed the problem entirely. Whenever a Netty
Event Loop is blocked, the responsiveness of Cassandra is significantly
impacted so it seems inappropriate to rely solely on increasing this queue size
until everything looks fine... at the time the tuning was done.
In fact, this situation looks exactly like the definition of the {{Overloaded}}
error of the CQL Protocol:
{code:java}
0x1001 Overloaded: the request cannot be processed because the
coordinator node is overloaded{code}
Therefore, whenever a request can't make it to the NTR stage, it should be
rejected with an {{Overloaded}} error to the client. This can be done at low
cost as we're already in the Netty Event Loop owning the channel to that client.
It would then be the client responsibility to retry with another coordinator,
which is likely to lead to a better P99 latency than blocking on an already too
long queue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]