[
https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230325#comment-15230325
]
Alain RODRIGUEZ commented on CASSANDRA-11363:
---------------------------------------------
I also observed in C*2.1.12 that a certain percentage of the
Native-Transport-Requests are blocked, yet no major CPU or resources issue on
my side though, it might then not be related.
For what it is worth here is something I observed about
Native-Transport-Requests: increasing the 'native_transport_max_threads' value
help mitigating this, as expected, but Native-Transport-Requests number is
still a non zero value.
{noformat}
[alain@bastion-d3-prod ~]$ knife ssh "role:cassandra" "nodetool tpstats | grep
Native-Transport-Requests" | grep -e server1 -e server2 -e server3 -e server4 |
sort | awk 'BEGIN { printf "%50s %10s","Server |"," Blocked ratio:\n" } {
printf "%50s %10f%\n", $1, (($7/$5)*100) }'
Server | Blocked ratio:
ip-172-17-42-105.us-west-2.compute.internal 0.044902%
ip-172-17-42-107.us-west-2.compute.internal 0.030127%
ip-172-17-42-114.us-west-2.compute.internal 0.045759%
ip-172-17-42-116.us-west-2.compute.internal 0.082763%
{noformat}
I waited long enough between the change and the result capture, many days,
probably a few weeks. As all the nodes are in the same datacenter, under a
(fairly) balanced load, this is probably relevant.
Here are the result for those nodes, in our use case.
||Server||native_transport_max_threads||Percentage of blocked threads||
|server1|128|0.082763%|
|server2|384|0.044902%|
|server3|512|0.045759%|
|server4|1024|0.030127%|
Also from the mailing list outputs, it looks like it is quite common to have
some Native-Transport-Requests blocked, probably unavoidable depending on the
network and use cases (spiky workloads?).
> Blocked NTR When Connecting Causing Excessive Load
> --------------------------------------------------
>
> Key: CASSANDRA-11363
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
> Project: Cassandra
> Issue Type: Bug
> Components: Coordination
> Reporter: Russell Bradberry
> Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the
> machine load increases to very high levels (> 120 on an 8 core machine) and
> native transport requests get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of
> total requests being processed at a given time (as the latter increases with
> the former in our system)
> Currently there is between 600 and 800 client connections on each machine and
> each machine is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a
> viable option cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool Name Active Pending Completed Blocked All
> time blocked
> MutationStage 0 8 8387821 0
> 0
> ReadStage 0 0 355860 0
> 0
> RequestResponseStage 0 7 2532457 0
> 0
> ReadRepairStage 0 0 150 0
> 0
> CounterMutationStage 32 104 897560 0
> 0
> MiscStage 0 0 0 0
> 0
> HintedHandoff 0 0 65 0
> 0
> GossipStage 0 0 2338 0
> 0
> CacheCleanupExecutor 0 0 0 0
> 0
> InternalResponseStage 0 0 0 0
> 0
> CommitLogArchiver 0 0 0 0
> 0
> CompactionExecutor 2 190 474 0
> 0
> ValidationExecutor 0 0 0 0
> 0
> MigrationStage 0 0 10 0
> 0
> AntiEntropyStage 0 0 0 0
> 0
> PendingRangeCalculator 0 0 310 0
> 0
> Sampler 0 0 0 0
> 0
> MemtableFlushWriter 1 10 94 0
> 0
> MemtablePostFlush 1 34 257 0
> 0
> MemtableReclaimMemory 0 0 94 0
> 0
> Native-Transport-Requests 128 156 387957 16
> 278451
> Message type Dropped
> READ 0
> RANGE_SLICE 0
> _TRACE 0
> MUTATION 0
> COUNTER_MUTATION 0
> BINARY 0
> REQUEST_RESPONSE 0
> PAGED_RANGE 0
> READ_REPAIR 0
> {code}
> Attached is the jstack output for both CMS and G1GC.
> Flight recordings are here:
> https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
> https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr
> It is interesting to note that while the flight recording was taking place,
> the load on the machine went back to healthy, and when the flight recording
> finished the load went back to > 100.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)