[ 
https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230325#comment-15230325
 ] 

Alain RODRIGUEZ commented on CASSANDRA-11363:
---------------------------------------------

I also observed in C*2.1.12 that a certain percentage of the 
Native-Transport-Requests are blocked, yet no major CPU or resources issue on 
my side though, it might then not be related.

For what it is worth here is something I observed about 
Native-Transport-Requests: increasing the 'native_transport_max_threads' value 
help mitigating this, as expected, but Native-Transport-Requests number is 
still a non zero value.

{noformat}
[alain@bastion-d3-prod ~]$ knife ssh "role:cassandra" "nodetool tpstats | grep 
Native-Transport-Requests" | grep -e server1 -e server2 -e server3 -e server4 | 
sort | awk 'BEGIN { printf "%50s %10s","Server |"," Blocked ratio:\n" } { 
printf "%50s %10f%\n", $1, (($7/$5)*100) }'
                                          Server |  Blocked ratio:
       ip-172-17-42-105.us-west-2.compute.internal   0.044902%
       ip-172-17-42-107.us-west-2.compute.internal   0.030127%
       ip-172-17-42-114.us-west-2.compute.internal   0.045759%
       ip-172-17-42-116.us-west-2.compute.internal   0.082763%
{noformat}

I waited long enough between the change and the result capture, many days, 
probably a few weeks. As all the nodes are in the same datacenter, under a 
(fairly) balanced load, this is probably relevant.

Here are the result for those nodes, in our use case.

||Server||native_transport_max_threads||Percentage of blocked threads||
|server1|128|0.082763%|
|server2|384|0.044902%|
|server3|512|0.045759%|
|server4|1024|0.030127%|

Also from the mailing list outputs, it looks like it is quite common to have 
some Native-Transport-Requests blocked, probably unavoidable depending on the 
network and use cases (spiky workloads?). 

> Blocked NTR When Connecting Causing Excessive Load
> --------------------------------------------------
>
>                 Key: CASSANDRA-11363
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Russell Bradberry
>         Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the 
> machine load increases to very high levels (> 120 on an 8 core machine) and 
> native transport requests get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of 
> total requests being processed at a given time (as the latter increases with 
> the former in our system)
> Currently there is between 600 and 800 client connections on each machine and 
> each machine is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a 
> viable option cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool Name                    Active   Pending      Completed   Blocked  All 
> time blocked
> MutationStage                     0         8        8387821         0        
>          0
> ReadStage                         0         0         355860         0        
>          0
> RequestResponseStage              0         7        2532457         0        
>          0
> ReadRepairStage                   0         0            150         0        
>          0
> CounterMutationStage             32       104         897560         0        
>          0
> MiscStage                         0         0              0         0        
>          0
> HintedHandoff                     0         0             65         0        
>          0
> GossipStage                       0         0           2338         0        
>          0
> CacheCleanupExecutor              0         0              0         0        
>          0
> InternalResponseStage             0         0              0         0        
>          0
> CommitLogArchiver                 0         0              0         0        
>          0
> CompactionExecutor                2       190            474         0        
>          0
> ValidationExecutor                0         0              0         0        
>          0
> MigrationStage                    0         0             10         0        
>          0
> AntiEntropyStage                  0         0              0         0        
>          0
> PendingRangeCalculator            0         0            310         0        
>          0
> Sampler                           0         0              0         0        
>          0
> MemtableFlushWriter               1        10             94         0        
>          0
> MemtablePostFlush                 1        34            257         0        
>          0
> MemtableReclaimMemory             0         0             94         0        
>          0
> Native-Transport-Requests       128       156         387957        16        
>     278451
> Message type           Dropped
> READ                         0
> RANGE_SLICE                  0
> _TRACE                       0
> MUTATION                     0
> COUNTER_MUTATION             0
> BINARY                       0
> REQUEST_RESPONSE             0
> PAGED_RANGE                  0
> READ_REPAIR                  0
> {code}
> Attached is the jstack output for both CMS and G1GC.
> Flight recordings are here:
> https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
> https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr
> It is interesting to note that while the flight recording was taking place, 
> the load on the machine went back to healthy, and when the flight recording 
> finished the load went back to > 100.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to