Corentin Chary created CASSANDRA-13039:
------------------------------------------

             Summary: Mutation time mostly spent in LinkedBlockingQueue.put()
                 Key: CASSANDRA-13039
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13039
             Project: Cassandra
          Issue Type: Bug
          Components: Coordination
            Reporter: Corentin Chary
         Attachments: mutation-linkedlist-block.png, profiler-snapshot.nps

On a setup with a sustained write load of 70kQPS per node and a RF of 2 it 
looks like most of the mutation time is spend in 
OutboundTcpConnection.enqueue() -> backlog.put()

backlog is an unbounded LinkedBlockingQueue, which means that .put() can only 
be blocking if a lock is taken. I strongly suspect that this is caused by the 
use of drainTo() in CoalescingStrategies which is causing contention for the 
producers.

On the other hand, not using drainTo() could lead to starvation of the 
consumers.

Possible solutions:
- Allow multiple connections per size and per hosts in OutboundTcpConnectionPool
- Switch from drainTo to multiple take()
- Switch to ConcurrentLinkedQueue (which is lockless), also means we need 
active polling.

Maybe a good solution would be something hybrid: a bounded LinkedBlockingQueue 
and an unbounded ConcurrentLinkedQueue. This way you get low latency when you 
don't have a lot of messages, and throughput when you do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to