[jira] [Commented] (CASSANDRA-6908) Dynamic endpoint snitch destabilizes cluster under heavy load

Shannon Carey (JIRA) Wed, 22 Mar 2017 12:59:04 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937013#comment-15937013
 ]


Shannon Carey commented on CASSANDRA-6908:
------------------------------------------

It looks like I've run into this issue too: 
http://www.mail-archive.com/user@cassandra.apache.org/msg51510.html

My cluster was not under particularly heavy load, although there was higher 
read load in the local DC than the remote DC. Not enough load that the local 
latency was higher than remote, but the snitch apparently started routing my 
requests to the remote DC anyway (though I cannot verify that via the metrics).

> Dynamic endpoint snitch destabilizes cluster under heavy load
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-6908
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6908
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Configuration
>            Reporter: Bartłomiej Romański
>            Assignee: Brandon Williams
>         Attachments: as-dynamic-snitch-disabled.png
>
>
> We observe that with dynamic snitch disabled our cluster is much more stable 
> than with dynamic snitch enabled.
> We've got a 15 nodes cluster with pretty strong machines (2xE5-2620, 64 GB 
> RAM, 2x480 GB SSD). We mostly do reads (about 300k/s).
> We use Astyanax on client side with TOKEN_AWARE option enabled. It 
> automatically direct read queries to one of the nodes responsible the given 
> token.
> In that case with dynamic snitch disabled Cassandra always handles read 
> locally. With dynamic snitch enabled Cassandra very often decides to proxy 
> the read to some other node. This causes much higher CPU usage and produces 
> much more garbage what results in more often GC pauses (young generation 
> fills up quicker). By "much higher" and "much more" I mean 1.5-2x.
> I'm aware that higher dynamic_snitch_badness_threshold value should solve 
> that issue. The default value is 0.1. I've looked at scores exposed in JMX 
> and the problem is that our values seemed to be completely random. They are 
> between usually 0.5 and 2.0, but changes randomly every time I hit refresh.
> Of course, I can set dynamic_snitch_badness_threshold to 5.0 or something 
> like that, but the result will be similar to simply disabling the dynamic 
> switch at all (that's what we done).
> I've tried to understand what's the logic behind these scores and I'm not 
> sure if I get the idea...
> It's a sum (without any multipliers) of two components:
> - ratio of recent given node latency to recent average node latency
> - something called 'severity', what, if I analyzed the code correctly, is a 
> result of BackgroundActivityMonitor.getIOWait() - it's a ratio of "iowait" 
> CPU time to the whole CPU time as reported in /proc/stats (the ratio is 
> multiplied by 100)
> In our case the second value is something around 0-2% but varies quite 
> heavily every second.
> What's the idea behind simply adding this two values without any multipliers 
> (e.g the second one is in percentage while the first one is not)? Are we sure 
> this is the best possible way of calculating the final score?
> Is there a way too force Cassandra to use (much) longer samples? In our case 
> we probably need that to get stable values. The 'severity' is calculated for 
> each second. The mean latency is calculated based on some magic, hardcoded 
> values (ALPHA = 0.75, WINDOW_SIZE = 100). 
> Am I right that there's no way to tune that without hacking the code?
> I'm aware that there's dynamic_snitch_update_interval_in_ms property in the 
> config file, but that only determines how often the scores are recalculated 
> not how long samples are taken. Is that correct?
> To sum up, It would be really nice to have more control over dynamic snitch 
> behavior or at least have the official option to disable it described in the 
> default config file (it took me some time to discover that we can just 
> disable it instead of hacking with dynamic_snitch_badness_threshold=1000).
> Currently for some scenarios (like ours - optimized cluster, token aware 
> client, heavy load) it causes more harm than good.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (CASSANDRA-6908) Dynamic endpoint snitch destabilizes cluster under heavy load

Reply via email to