Michaël Figuière created CASSANDRA-15049:
--------------------------------------------

             Summary: Requests blocked at NTR stage should be rejected
                 Key: CASSANDRA-15049
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15049
             Project: Cassandra
          Issue Type: Bug
            Reporter: Michaël Figuière


CASSANDRA-11363 has emphasized that if the NTR stage's thread pool and queue 
are full, the Netty Event Loops may block waiting on the NTR queue. The 
solution that was brought in CASSANDRA-11363 was to increase the default queue 
size from 128 to 1024. This significantly reduced the number of blocked 
requests observed but hasn't removed the problem entirely. Whenever a Netty 
Event Loop is blocked, the responsiveness of Cassandra is significantly 
impacted so it seems inappropriate to rely solely on increasing this queue size 
until everything looks fine... at the time the tuning was done.

In fact, this situation looks exactly like the definition of the {{Overloaded}} 
error of the CQL Protocol:
{code:java}
0x1001 Overloaded: the request cannot be processed because the
        coordinator node is overloaded{code}
Therefore, whenever a request can't make it to the NTR stage, it should be 
rejected with an {{Overloaded}} error to the client. This can be done at low 
cost as we're already in the Netty Event Loop owning the channel to that client.

It would then be the client responsibility to retry with another coordinator, 
which is likely to lead to a better P99 latency than blocking on an already too 
long queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to