[ 
https://issues.apache.org/jira/browse/CASSANDRA-15049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16789978#comment-16789978
 ] 

Sumanth Pasupuleti commented on CASSANDRA-15049:
------------------------------------------------

On a very related front, I am working on a patch for 
https://issues.apache.org/jira/browse/CASSANDRA-15013 (almost done with the 
patch, writing UTs).
This is to tackle exactly the same issue, to prevent any blocking of event loop 
threads while trying to enqueue on NTR queue. Patch involves the option to 
either throw OverloadedException or put backpressure on the channel. More in 
CASSANDRA-15013.

> Requests blocked at NTR stage should be rejected
> ------------------------------------------------
>
>                 Key: CASSANDRA-15049
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15049
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Michaël Figuière
>            Priority: Major
>
> CASSANDRA-11363 has emphasized that if the NTR stage's thread pool and queue 
> are full, the Netty Event Loops may block waiting on the NTR queue. The 
> solution that was brought in CASSANDRA-11363 was to increase the default 
> queue size from 128 to 1024. This significantly reduced the number of blocked 
> requests observed but hasn't removed the problem entirely. Whenever a Netty 
> Event Loop is blocked, the responsiveness of Cassandra is significantly 
> impacted so it seems inappropriate to rely solely on increasing this queue 
> size until everything looks fine... at the time the tuning was done.
> In fact, this situation looks exactly like the definition of the 
> {{Overloaded}} error of the CQL Protocol:
> {code:java}
> 0x1001 Overloaded: the request cannot be processed because the
>       coordinator node is overloaded{code}
> Therefore, whenever a request can't make it to the NTR stage, it should be 
> rejected with an {{Overloaded}} error to the client. This can be done at low 
> cost as we're already in the Netty Event Loop owning the channel to that 
> client.
> It would then be the client responsibility to retry with another coordinator, 
> which is likely to lead to a better P99 latency than blocking on an already 
> too long queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to