Corentin Chary created CASSANDRA-13039: ------------------------------------------
Summary: Mutation time mostly spent in LinkedBlockingQueue.put() Key: CASSANDRA-13039 URL: https://issues.apache.org/jira/browse/CASSANDRA-13039 Project: Cassandra Issue Type: Bug Components: Coordination Reporter: Corentin Chary Attachments: mutation-linkedlist-block.png, profiler-snapshot.nps On a setup with a sustained write load of 70kQPS per node and a RF of 2 it looks like most of the mutation time is spend in OutboundTcpConnection.enqueue() -> backlog.put() backlog is an unbounded LinkedBlockingQueue, which means that .put() can only be blocking if a lock is taken. I strongly suspect that this is caused by the use of drainTo() in CoalescingStrategies which is causing contention for the producers. On the other hand, not using drainTo() could lead to starvation of the consumers. Possible solutions: - Allow multiple connections per size and per hosts in OutboundTcpConnectionPool - Switch from drainTo to multiple take() - Switch to ConcurrentLinkedQueue (which is lockless), also means we need active polling. Maybe a good solution would be something hybrid: a bounded LinkedBlockingQueue and an unbounded ConcurrentLinkedQueue. This way you get low latency when you don't have a lot of messages, and throughput when you do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)