[ 
https://issues.apache.org/jira/browse/CASSANDRA-13039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15782402#comment-15782402
 ] 

Corentin Chary commented on CASSANDRA-13039:
--------------------------------------------

Enabling otc_coalescing_window_us seems to create "backlog" of death scenarios 
where most of the time is spent in ExpireMessages() because the backlog becomes 
huge. The consumer is never able to cope with the hundreds of producers.

The new backpressure mechanism could be a solution to that but it seems too 
aggressive, and isn't enabled by default.

Another issue is that multiple different things are run on Stage.MUTATION: 
performing the local mutation and executing Verb.MUTATION (which will itself 
schedule its own local mutation on Stage.MUTATION, and there is probably a risk 
of deadlock here).

A solution to that could be to run only the local mutations on Stage.MUTATION. 
I think this is similar to what is done by counters.

> Mutation time mostly spent in LinkedBlockingQueue.put()
> -------------------------------------------------------
>
>                 Key: CASSANDRA-13039
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13039
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Corentin Chary
>         Attachments: mutation-linkedlist-block.png, profiler-snapshot.nps
>
>
> On a setup with a sustained write load of 70kQPS per node and a RF of 2 it 
> looks like most of the mutation time is spend in 
> OutboundTcpConnection.enqueue() -> backlog.put()
> backlog is an unbounded LinkedBlockingQueue, which means that .put() can only 
> be blocking if a lock is taken. I strongly suspect that this is caused by the 
> use of drainTo() in CoalescingStrategies which is causing contention for the 
> producers.
> On the other hand, not using drainTo() could lead to starvation of the 
> consumers.
> Possible solutions:
> - Allow multiple connections per size and per hosts in 
> OutboundTcpConnectionPool
> - Switch from drainTo to multiple take()
> - Switch to ConcurrentLinkedQueue (which is lockless), also means we need 
> active polling.
> Maybe a good solution would be something hybrid: a bounded 
> LinkedBlockingQueue and an unbounded ConcurrentLinkedQueue. This way you get 
> low latency when you don't have a lot of messages, and throughput when you do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to