Russell Bradberry created CASSANDRA-11363:
---------------------------------------------

             Summary: Blocked NTR When Connecting Causing Excessive Load
                 Key: CASSANDRA-11363
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
             Project: Cassandra
          Issue Type: Bug
          Components: Coordination
            Reporter: Russell Bradberry
         Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack

When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the 
machine load increases to very high levels (> 120 on an 8 core machine) and 
native transport requests get blocked in tpstats.

I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.

The issue does not seem to affect the nodes running 2.1.9.

The issue seems to coincide with the number of connections OR the number of 
total requests being processed at a given time (as the latter increases with 
the former in our system)

Currently there is between 600 and 800 client connections on each machine and 
each machine is handling roughly 2000-3000 client requests per second.

Disabling the binary protocol fixes the issue for this node but isn't a viable 
option cluster-wide.

Here is the output from tpstats:

{code}
Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
MutationStage                     0         8        8387821         0          
       0
ReadStage                         0         0         355860         0          
       0
RequestResponseStage              0         7        2532457         0          
       0
ReadRepairStage                   0         0            150         0          
       0
CounterMutationStage             32       104         897560         0          
       0
MiscStage                         0         0              0         0          
       0
HintedHandoff                     0         0             65         0          
       0
GossipStage                       0         0           2338         0          
       0
CacheCleanupExecutor              0         0              0         0          
       0
InternalResponseStage             0         0              0         0          
       0
CommitLogArchiver                 0         0              0         0          
       0
CompactionExecutor                2       190            474         0          
       0
ValidationExecutor                0         0              0         0          
       0
MigrationStage                    0         0             10         0          
       0
AntiEntropyStage                  0         0              0         0          
       0
PendingRangeCalculator            0         0            310         0          
       0
Sampler                           0         0              0         0          
       0
MemtableFlushWriter               1        10             94         0          
       0
MemtablePostFlush                 1        34            257         0          
       0
MemtableReclaimMemory             0         0             94         0          
       0
Native-Transport-Requests       128       156         387957        16          
  278451

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
MUTATION                     0
COUNTER_MUTATION             0
BINARY                       0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0
{code}

Attached is the jstack output for both CMS and G1GC.

Flight recordings are here:
https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr

It is interesting to note that while the flight recording was taking place, the 
load on the machine went back to healthy, and when the flight recording 
finished the load went back to > 100.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to