[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load

Paulo Motta (JIRA) Thu, 07 Apr 2016 12:52:03 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230924#comment-15230924
 ]


Paulo Motta commented on CASSANDRA-11363:
-----------------------------------------

I went through 2.1.12 changes and didn't find anything suspicious. On 2.1.13 
and 3.0.3 though, we changed the {{ServerConnection}} query state map from a 
{{NonBlockingHashMap}} to a {{ConcurrentHashMap}} on CASSANDRA-10938, which 
might be misbehaving for some reason.

Is anyone willing to try the revert patch below on 2.1.13 or 3.0.3 and check if 
that changes anything?

{noformat}
diff --git a/src/java/org/apache/cassandra/transport/ServerConnection.java 
b/src/java/org/apache/cassandra/transport/ServerConnection.java
index ce4d164..5991b33 100644
--- a/src/java/org/apache/cassandra/transport/ServerConnection.java
+++ b/src/java/org/apache/cassandra/transport/ServerConnection.java
@@ -17,7 +17,6 @@
  */
 package org.apache.cassandra.transport;
 
-import java.util.concurrent.ConcurrentHashMap;
 import java.util.concurrent.ConcurrentMap;
 
 import io.netty.channel.Channel;
@@ -29,6 +28,8 @@ import org.apache.cassandra.config.DatabaseDescriptor;
 import org.apache.cassandra.service.ClientState;
 import org.apache.cassandra.service.QueryState;
 
+import org.cliffc.high_scale_lib.NonBlockingHashMap;
+
 public class ServerConnection extends Connection
 {
     private enum State { UNINITIALIZED, AUTHENTICATION, READY }
@@ -37,7 +38,7 @@ public class ServerConnection extends Connection
     private final ClientState clientState;
     private volatile State state;
 
-    private final ConcurrentMap<Integer, QueryState> queryStates = new 
ConcurrentHashMap<>();
+    private final ConcurrentMap<Integer, QueryState> queryStates = new 
NonBlockingHashMap<>();
 
     public ServerConnection(Channel channel, int version, Connection.Tracker 
tracker)
     {
{noformat}

> Blocked NTR When Connecting Causing Excessive Load
> --------------------------------------------------
>
>                 Key: CASSANDRA-11363
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Russell Bradberry
>            Priority: Critical
>         Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the 
> machine load increases to very high levels (> 120 on an 8 core machine) and 
> native transport requests get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of 
> total requests being processed at a given time (as the latter increases with 
> the former in our system)
> Currently there is between 600 and 800 client connections on each machine and 
> each machine is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a 
> viable option cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool Name                    Active   Pending      Completed   Blocked  All 
> time blocked
> MutationStage                     0         8        8387821         0        
>          0
> ReadStage                         0         0         355860         0        
>          0
> RequestResponseStage              0         7        2532457         0        
>          0
> ReadRepairStage                   0         0            150         0        
>          0
> CounterMutationStage             32       104         897560         0        
>          0
> MiscStage                         0         0              0         0        
>          0
> HintedHandoff                     0         0             65         0        
>          0
> GossipStage                       0         0           2338         0        
>          0
> CacheCleanupExecutor              0         0              0         0        
>          0
> InternalResponseStage             0         0              0         0        
>          0
> CommitLogArchiver                 0         0              0         0        
>          0
> CompactionExecutor                2       190            474         0        
>          0
> ValidationExecutor                0         0              0         0        
>          0
> MigrationStage                    0         0             10         0        
>          0
> AntiEntropyStage                  0         0              0         0        
>          0
> PendingRangeCalculator            0         0            310         0        
>          0
> Sampler                           0         0              0         0        
>          0
> MemtableFlushWriter               1        10             94         0        
>          0
> MemtablePostFlush                 1        34            257         0        
>          0
> MemtableReclaimMemory             0         0             94         0        
>          0
> Native-Transport-Requests       128       156         387957        16        
>     278451
> Message type           Dropped
> READ                         0
> RANGE_SLICE                  0
> _TRACE                       0
> MUTATION                     0
> COUNTER_MUTATION             0
> BINARY                       0
> REQUEST_RESPONSE             0
> PAGED_RANGE                  0
> READ_REPAIR                  0
> {code}
> Attached is the jstack output for both CMS and G1GC.
> Flight recordings are here:
> https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
> https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr
> It is interesting to note that while the flight recording was taking place, 
> the load on the machine went back to healthy, and when the flight recording 
> finished the load went back to > 100.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load

Reply via email to