[ 
https://issues.apache.org/jira/browse/CASSANDRA-13039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15780125#comment-15780125
 ] 

Corentin Chary commented on CASSANDRA-13039:
--------------------------------------------

We did some additional debugging:
Most of the time *seem* to be spend in "signalNotEmpty()" (in 
LinkedBlockingQueue) when it's trying to unpark on of the reader threads. 
Looking at the metrics it seems that the backlog is always empty, and that the 
system is doing a *lot* of context switches to ensure that. A workaround that 
seems to work is to set otc_coalescing_window_us (and otc_coalescing_strategy) 
to make sure that the backlog doesn't stay empty.

> Mutation time mostly spent in LinkedBlockingQueue.put()
> -------------------------------------------------------
>
>                 Key: CASSANDRA-13039
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13039
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Corentin Chary
>         Attachments: mutation-linkedlist-block.png, profiler-snapshot.nps
>
>
> On a setup with a sustained write load of 70kQPS per node and a RF of 2 it 
> looks like most of the mutation time is spend in 
> OutboundTcpConnection.enqueue() -> backlog.put()
> backlog is an unbounded LinkedBlockingQueue, which means that .put() can only 
> be blocking if a lock is taken. I strongly suspect that this is caused by the 
> use of drainTo() in CoalescingStrategies which is causing contention for the 
> producers.
> On the other hand, not using drainTo() could lead to starvation of the 
> consumers.
> Possible solutions:
> - Allow multiple connections per size and per hosts in 
> OutboundTcpConnectionPool
> - Switch from drainTo to multiple take()
> - Switch to ConcurrentLinkedQueue (which is lockless), also means we need 
> active polling.
> Maybe a good solution would be something hybrid: a bounded 
> LinkedBlockingQueue and an unbounded ConcurrentLinkedQueue. This way you get 
> low latency when you don't have a lot of messages, and throughput when you do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to