[
https://issues.apache.org/jira/browse/CASSANDRA-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875995#comment-14875995
]
Nate McCall edited comment on CASSANDRA-6908 at 9/18/15 6:08 PM:
-----------------------------------------------------------------
[~brandon.williams] Here is some output from DESMBean.getScores a day or so
before the change:
{noformat}
$>get Scores
#mbean = org.apache.cassandra.db:type=DynamicEndpointSnitch:
Scores = {
/10.10.20.31 = 2.244811534881592;
/10.10.20.30 = 0.5457598567008972;
/10.10.20.37 = 2.7038445472717285;
/10.10.20.36 = 3.0801687240600586;
/10.10.20.39 = 6.231578826904297;
/10.10.20.38 = 3.1578946113586426;
/10.10.20.33 = 12.003381729125977;
/10.10.20.32 = 3.348876476287842;
/10.10.20.35 = 1.5086206197738647;
/10.10.20.34 = 4.9235992431640625;
/10.20.1.46 = 2.621400833129883;
/10.20.1.47 = 1.0947368144989014;
/10.20.1.44 = 1.3118916749954224;
/10.20.1.45 = 1.6884760856628418;
/10.30.50.62 = 2.4780631123519523;
/10.30.50.63 = 2.3196894221189543;
/10.30.50.61 = 1.2922532529365727;
/10.20.1.50 = 7.66961669921875;
/10.20.1.48 = 1.686340570449829;
/10.20.1.49 = 1.2298557758331299;
/10.30.50.82 = 1.3875394260011067;
/10.30.50.83 = 1.839278221130371;
/10.30.50.80 = 1.5599116014271248;
/10.30.50.81 = 1.002414460952689;
/10.30.50.84 = 0.9972779314692427;
/10.30.50.66 = 1.057380530892349;
/10.30.50.67 = 1.3079022634320143;
/10.30.50.64 = 1.3103291428670651;
/10.30.50.65 = 1.8054673729873285;
/10.30.50.70 = 0.8387989390914034;
/10.30.50.71 = 1.0193960841109113;
/10.20.8.80 = 0.936170220375061;
/10.30.50.68 = 0.9854942156774241;
/10.20.8.81 = 0.7212558388710022;
/10.30.50.69 = 0.8825731037593469;
/10.30.50.74 = 1.3936080859928597;
/10.30.50.75 = 0.9637283373896669;
/10.30.50.72 = 0.7774390243902439;
/10.30.50.73 = 0.8475609756097561;
/10.30.50.78 = 2.4760895589502847;
/10.30.50.79 = 1.2857443196017568;
/10.30.50.76 = 0.7804878048780488;
/10.30.50.77 = 1.8351790061811122;
/10.20.1.102 = 2.4894514083862305;
/10.20.1.103 = 0.5889776945114136;
/10.20.1.101 = 2.1996614933013916;
/10.20.1.106 = 0.8040626645088196;
/10.20.1.104 = 1.4327855110168457;
/10.20.1.105 = 0.75789475440979;
};
{noformat}
was (Author: zznate):
[~brandon.williams] Same as with [~ayegorov], we ramped up
{{dynamic_snitch_badness_threshold}} to about {{3.0}} with very little effect
along the way.
Here is some output from DESMBean.getScores a day or so before the change:
{noformat}
$>get Scores
#mbean = org.apache.cassandra.db:type=DynamicEndpointSnitch:
Scores = {
/10.10.20.31 = 2.244811534881592;
/10.10.20.30 = 0.5457598567008972;
/10.10.20.37 = 2.7038445472717285;
/10.10.20.36 = 3.0801687240600586;
/10.10.20.39 = 6.231578826904297;
/10.10.20.38 = 3.1578946113586426;
/10.10.20.33 = 12.003381729125977;
/10.10.20.32 = 3.348876476287842;
/10.10.20.35 = 1.5086206197738647;
/10.10.20.34 = 4.9235992431640625;
/10.20.1.46 = 2.621400833129883;
/10.20.1.47 = 1.0947368144989014;
/10.20.1.44 = 1.3118916749954224;
/10.20.1.45 = 1.6884760856628418;
/10.30.50.62 = 2.4780631123519523;
/10.30.50.63 = 2.3196894221189543;
/10.30.50.61 = 1.2922532529365727;
/10.20.1.50 = 7.66961669921875;
/10.20.1.48 = 1.686340570449829;
/10.20.1.49 = 1.2298557758331299;
/10.30.50.82 = 1.3875394260011067;
/10.30.50.83 = 1.839278221130371;
/10.30.50.80 = 1.5599116014271248;
/10.30.50.81 = 1.002414460952689;
/10.30.50.84 = 0.9972779314692427;
/10.30.50.66 = 1.057380530892349;
/10.30.50.67 = 1.3079022634320143;
/10.30.50.64 = 1.3103291428670651;
/10.30.50.65 = 1.8054673729873285;
/10.30.50.70 = 0.8387989390914034;
/10.30.50.71 = 1.0193960841109113;
/10.20.8.80 = 0.936170220375061;
/10.30.50.68 = 0.9854942156774241;
/10.20.8.81 = 0.7212558388710022;
/10.30.50.69 = 0.8825731037593469;
/10.30.50.74 = 1.3936080859928597;
/10.30.50.75 = 0.9637283373896669;
/10.30.50.72 = 0.7774390243902439;
/10.30.50.73 = 0.8475609756097561;
/10.30.50.78 = 2.4760895589502847;
/10.30.50.79 = 1.2857443196017568;
/10.30.50.76 = 0.7804878048780488;
/10.30.50.77 = 1.8351790061811122;
/10.20.1.102 = 2.4894514083862305;
/10.20.1.103 = 0.5889776945114136;
/10.20.1.101 = 2.1996614933013916;
/10.20.1.106 = 0.8040626645088196;
/10.20.1.104 = 1.4327855110168457;
/10.20.1.105 = 0.75789475440979;
};
{noformat}
> Dynamic endpoint snitch destabilizes cluster under heavy load
> -------------------------------------------------------------
>
> Key: CASSANDRA-6908
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6908
> Project: Cassandra
> Issue Type: Improvement
> Components: Config, Core
> Reporter: Bartłomiej Romański
> Assignee: Brandon Williams
> Attachments: as-dynamic-snitch-disabled.png
>
>
> We observe that with dynamic snitch disabled our cluster is much more stable
> than with dynamic snitch enabled.
> We've got a 15 nodes cluster with pretty strong machines (2xE5-2620, 64 GB
> RAM, 2x480 GB SSD). We mostly do reads (about 300k/s).
> We use Astyanax on client side with TOKEN_AWARE option enabled. It
> automatically direct read queries to one of the nodes responsible the given
> token.
> In that case with dynamic snitch disabled Cassandra always handles read
> locally. With dynamic snitch enabled Cassandra very often decides to proxy
> the read to some other node. This causes much higher CPU usage and produces
> much more garbage what results in more often GC pauses (young generation
> fills up quicker). By "much higher" and "much more" I mean 1.5-2x.
> I'm aware that higher dynamic_snitch_badness_threshold value should solve
> that issue. The default value is 0.1. I've looked at scores exposed in JMX
> and the problem is that our values seemed to be completely random. They are
> between usually 0.5 and 2.0, but changes randomly every time I hit refresh.
> Of course, I can set dynamic_snitch_badness_threshold to 5.0 or something
> like that, but the result will be similar to simply disabling the dynamic
> switch at all (that's what we done).
> I've tried to understand what's the logic behind these scores and I'm not
> sure if I get the idea...
> It's a sum (without any multipliers) of two components:
> - ratio of recent given node latency to recent average node latency
> - something called 'severity', what, if I analyzed the code correctly, is a
> result of BackgroundActivityMonitor.getIOWait() - it's a ratio of "iowait"
> CPU time to the whole CPU time as reported in /proc/stats (the ratio is
> multiplied by 100)
> In our case the second value is something around 0-2% but varies quite
> heavily every second.
> What's the idea behind simply adding this two values without any multipliers
> (e.g the second one is in percentage while the first one is not)? Are we sure
> this is the best possible way of calculating the final score?
> Is there a way too force Cassandra to use (much) longer samples? In our case
> we probably need that to get stable values. The 'severity' is calculated for
> each second. The mean latency is calculated based on some magic, hardcoded
> values (ALPHA = 0.75, WINDOW_SIZE = 100).
> Am I right that there's no way to tune that without hacking the code?
> I'm aware that there's dynamic_snitch_update_interval_in_ms property in the
> config file, but that only determines how often the scores are recalculated
> not how long samples are taken. Is that correct?
> To sum up, It would be really nice to have more control over dynamic snitch
> behavior or at least have the official option to disable it described in the
> default config file (it took me some time to discover that we can just
> disable it instead of hacking with dynamic_snitch_badness_threshold=1000).
> Currently for some scenarios (like ours - optimized cluster, token aware
> client, heavy load) it causes more harm than good.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)