[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230924#comment-15230924 ]
Paulo Motta commented on CASSANDRA-11363: ----------------------------------------- I went through 2.1.12 changes and didn't find anything suspicious. On 2.1.13 and 3.0.3 though, we changed the {{ServerConnection}} query state map from a {{NonBlockingHashMap}} to a {{ConcurrentHashMap}} on CASSANDRA-10938, which might be misbehaving for some reason. Is anyone willing to try the revert patch below on 2.1.13 or 3.0.3 and check if that changes anything? {noformat} diff --git a/src/java/org/apache/cassandra/transport/ServerConnection.java b/src/java/org/apache/cassandra/transport/ServerConnection.java index ce4d164..5991b33 100644 --- a/src/java/org/apache/cassandra/transport/ServerConnection.java +++ b/src/java/org/apache/cassandra/transport/ServerConnection.java @@ -17,7 +17,6 @@ */ package org.apache.cassandra.transport; -import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.ConcurrentMap; import io.netty.channel.Channel; @@ -29,6 +28,8 @@ import org.apache.cassandra.config.DatabaseDescriptor; import org.apache.cassandra.service.ClientState; import org.apache.cassandra.service.QueryState; +import org.cliffc.high_scale_lib.NonBlockingHashMap; + public class ServerConnection extends Connection { private enum State { UNINITIALIZED, AUTHENTICATION, READY } @@ -37,7 +38,7 @@ public class ServerConnection extends Connection private final ClientState clientState; private volatile State state; - private final ConcurrentMap<Integer, QueryState> queryStates = new ConcurrentHashMap<>(); + private final ConcurrentMap<Integer, QueryState> queryStates = new NonBlockingHashMap<>(); public ServerConnection(Channel channel, int version, Connection.Tracker tracker) { {noformat} > Blocked NTR When Connecting Causing Excessive Load > -------------------------------------------------- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination > Reporter: Russell Bradberry > Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool Name Active Pending Completed Blocked All > time blocked > MutationStage 0 8 8387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 7 2532457 0 > 0 > ReadRepairStage 0 0 150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor 2 190 474 0 > 0 > ValidationExecutor 0 0 0 0 > 0 > MigrationStage 0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator 0 0 310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 1 10 94 0 > 0 > MemtablePostFlush 1 34 257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 387957 16 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)