[jira] [Comment Edited] (CASSANDRA-6908) Dynamic endpoint snitch destabilizes cluster under heavy load

Nate McCall (JIRA) Fri, 18 Sep 2015 11:16:33 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875995#comment-14875995
 ]


Nate McCall edited comment on CASSANDRA-6908 at 9/18/15 6:08 PM:
-----------------------------------------------------------------

[~brandon.williams] Here is some output from DESMBean.getScores a day or so 
before the change:

{noformat}
$>get Scores
#mbean = org.apache.cassandra.db:type=DynamicEndpointSnitch:
Scores = {
  /10.10.20.31 = 2.244811534881592;
  /10.10.20.30 = 0.5457598567008972;
  /10.10.20.37 = 2.7038445472717285;
  /10.10.20.36 = 3.0801687240600586;
  /10.10.20.39 = 6.231578826904297;
  /10.10.20.38 = 3.1578946113586426;
  /10.10.20.33 = 12.003381729125977;
  /10.10.20.32 = 3.348876476287842;
  /10.10.20.35 = 1.5086206197738647;
  /10.10.20.34 = 4.9235992431640625;
  /10.20.1.46 = 2.621400833129883;
  /10.20.1.47 = 1.0947368144989014;
  /10.20.1.44 = 1.3118916749954224;
  /10.20.1.45 = 1.6884760856628418;
  /10.30.50.62 = 2.4780631123519523;
  /10.30.50.63 = 2.3196894221189543;
  /10.30.50.61 = 1.2922532529365727;
  /10.20.1.50 = 7.66961669921875;
  /10.20.1.48 = 1.686340570449829;
  /10.20.1.49 = 1.2298557758331299;
  /10.30.50.82 = 1.3875394260011067;
  /10.30.50.83 = 1.839278221130371;
  /10.30.50.80 = 1.5599116014271248;
  /10.30.50.81 = 1.002414460952689;
  /10.30.50.84 = 0.9972779314692427;
  /10.30.50.66 = 1.057380530892349;
  /10.30.50.67 = 1.3079022634320143;
  /10.30.50.64 = 1.3103291428670651;
  /10.30.50.65 = 1.8054673729873285;
  /10.30.50.70 = 0.8387989390914034;
  /10.30.50.71 = 1.0193960841109113;
  /10.20.8.80 = 0.936170220375061;
  /10.30.50.68 = 0.9854942156774241;
  /10.20.8.81 = 0.7212558388710022;
  /10.30.50.69 = 0.8825731037593469;
  /10.30.50.74 = 1.3936080859928597;
  /10.30.50.75 = 0.9637283373896669;
  /10.30.50.72 = 0.7774390243902439;
  /10.30.50.73 = 0.8475609756097561;
  /10.30.50.78 = 2.4760895589502847;
  /10.30.50.79 = 1.2857443196017568;
  /10.30.50.76 = 0.7804878048780488;
  /10.30.50.77 = 1.8351790061811122;
  /10.20.1.102 = 2.4894514083862305;
  /10.20.1.103 = 0.5889776945114136;
  /10.20.1.101 = 2.1996614933013916;
  /10.20.1.106 = 0.8040626645088196;
  /10.20.1.104 = 1.4327855110168457;
  /10.20.1.105 = 0.75789475440979;
 };
{noformat}


was (Author: zznate):
[~brandon.williams] Same as with [~ayegorov], we ramped up 
{{dynamic_snitch_badness_threshold}} to about {{3.0}} with very little effect 
along the way. 

Here is some output from DESMBean.getScores a day or so before the change:

{noformat}
$>get Scores
#mbean = org.apache.cassandra.db:type=DynamicEndpointSnitch:
Scores = {
  /10.10.20.31 = 2.244811534881592;
  /10.10.20.30 = 0.5457598567008972;
  /10.10.20.37 = 2.7038445472717285;
  /10.10.20.36 = 3.0801687240600586;
  /10.10.20.39 = 6.231578826904297;
  /10.10.20.38 = 3.1578946113586426;
  /10.10.20.33 = 12.003381729125977;
  /10.10.20.32 = 3.348876476287842;
  /10.10.20.35 = 1.5086206197738647;
  /10.10.20.34 = 4.9235992431640625;
  /10.20.1.46 = 2.621400833129883;
  /10.20.1.47 = 1.0947368144989014;
  /10.20.1.44 = 1.3118916749954224;
  /10.20.1.45 = 1.6884760856628418;
  /10.30.50.62 = 2.4780631123519523;
  /10.30.50.63 = 2.3196894221189543;
  /10.30.50.61 = 1.2922532529365727;
  /10.20.1.50 = 7.66961669921875;
  /10.20.1.48 = 1.686340570449829;
  /10.20.1.49 = 1.2298557758331299;
  /10.30.50.82 = 1.3875394260011067;
  /10.30.50.83 = 1.839278221130371;
  /10.30.50.80 = 1.5599116014271248;
  /10.30.50.81 = 1.002414460952689;
  /10.30.50.84 = 0.9972779314692427;
  /10.30.50.66 = 1.057380530892349;
  /10.30.50.67 = 1.3079022634320143;
  /10.30.50.64 = 1.3103291428670651;
  /10.30.50.65 = 1.8054673729873285;
  /10.30.50.70 = 0.8387989390914034;
  /10.30.50.71 = 1.0193960841109113;
  /10.20.8.80 = 0.936170220375061;
  /10.30.50.68 = 0.9854942156774241;
  /10.20.8.81 = 0.7212558388710022;
  /10.30.50.69 = 0.8825731037593469;
  /10.30.50.74 = 1.3936080859928597;
  /10.30.50.75 = 0.9637283373896669;
  /10.30.50.72 = 0.7774390243902439;
  /10.30.50.73 = 0.8475609756097561;
  /10.30.50.78 = 2.4760895589502847;
  /10.30.50.79 = 1.2857443196017568;
  /10.30.50.76 = 0.7804878048780488;
  /10.30.50.77 = 1.8351790061811122;
  /10.20.1.102 = 2.4894514083862305;
  /10.20.1.103 = 0.5889776945114136;
  /10.20.1.101 = 2.1996614933013916;
  /10.20.1.106 = 0.8040626645088196;
  /10.20.1.104 = 1.4327855110168457;
  /10.20.1.105 = 0.75789475440979;
 };
{noformat}

> Dynamic endpoint snitch destabilizes cluster under heavy load
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-6908
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6908
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Config, Core
>            Reporter: Bartłomiej Romański
>            Assignee: Brandon Williams
>         Attachments: as-dynamic-snitch-disabled.png
>
>
> We observe that with dynamic snitch disabled our cluster is much more stable 
> than with dynamic snitch enabled.
> We've got a 15 nodes cluster with pretty strong machines (2xE5-2620, 64 GB 
> RAM, 2x480 GB SSD). We mostly do reads (about 300k/s).
> We use Astyanax on client side with TOKEN_AWARE option enabled. It 
> automatically direct read queries to one of the nodes responsible the given 
> token.
> In that case with dynamic snitch disabled Cassandra always handles read 
> locally. With dynamic snitch enabled Cassandra very often decides to proxy 
> the read to some other node. This causes much higher CPU usage and produces 
> much more garbage what results in more often GC pauses (young generation 
> fills up quicker). By "much higher" and "much more" I mean 1.5-2x.
> I'm aware that higher dynamic_snitch_badness_threshold value should solve 
> that issue. The default value is 0.1. I've looked at scores exposed in JMX 
> and the problem is that our values seemed to be completely random. They are 
> between usually 0.5 and 2.0, but changes randomly every time I hit refresh.
> Of course, I can set dynamic_snitch_badness_threshold to 5.0 or something 
> like that, but the result will be similar to simply disabling the dynamic 
> switch at all (that's what we done).
> I've tried to understand what's the logic behind these scores and I'm not 
> sure if I get the idea...
> It's a sum (without any multipliers) of two components:
> - ratio of recent given node latency to recent average node latency
> - something called 'severity', what, if I analyzed the code correctly, is a 
> result of BackgroundActivityMonitor.getIOWait() - it's a ratio of "iowait" 
> CPU time to the whole CPU time as reported in /proc/stats (the ratio is 
> multiplied by 100)
> In our case the second value is something around 0-2% but varies quite 
> heavily every second.
> What's the idea behind simply adding this two values without any multipliers 
> (e.g the second one is in percentage while the first one is not)? Are we sure 
> this is the best possible way of calculating the final score?
> Is there a way too force Cassandra to use (much) longer samples? In our case 
> we probably need that to get stable values. The 'severity' is calculated for 
> each second. The mean latency is calculated based on some magic, hardcoded 
> values (ALPHA = 0.75, WINDOW_SIZE = 100). 
> Am I right that there's no way to tune that without hacking the code?
> I'm aware that there's dynamic_snitch_update_interval_in_ms property in the 
> config file, but that only determines how often the scores are recalculated 
> not how long samples are taken. Is that correct?
> To sum up, It would be really nice to have more control over dynamic snitch 
> behavior or at least have the official option to disable it described in the 
> default config file (it took me some time to discover that we can just 
> disable it instead of hacking with dynamic_snitch_badness_threshold=1000).
> Currently for some scenarios (like ours - optimized cluster, token aware 
> client, heavy load) it causes more harm than good.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-6908) Dynamic endpoint snitch destabilizes cluster under heavy load

Reply via email to