[
https://issues.apache.org/jira/browse/CASSANDRA-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875865#comment-14875865
]
Andrey Yegorov commented on CASSANDRA-6908:
-------------------------------------------
[~brandon.williams]
Dynamic snitch setting were:
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 600000
dynamic_snitch_badness_threshold: 5.0
We've started with default dynamic_snitch_badness_threshold, then increased it
to 0.5, then to 2.5, then to 5 until moment when it was simply disabled.
> Dynamic endpoint snitch destabilizes cluster under heavy load
> -------------------------------------------------------------
>
> Key: CASSANDRA-6908
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6908
> Project: Cassandra
> Issue Type: Improvement
> Components: Config, Core
> Reporter: Bartłomiej Romański
> Assignee: Brandon Williams
> Attachments: as-dynamic-snitch-disabled.png
>
>
> We observe that with dynamic snitch disabled our cluster is much more stable
> than with dynamic snitch enabled.
> We've got a 15 nodes cluster with pretty strong machines (2xE5-2620, 64 GB
> RAM, 2x480 GB SSD). We mostly do reads (about 300k/s).
> We use Astyanax on client side with TOKEN_AWARE option enabled. It
> automatically direct read queries to one of the nodes responsible the given
> token.
> In that case with dynamic snitch disabled Cassandra always handles read
> locally. With dynamic snitch enabled Cassandra very often decides to proxy
> the read to some other node. This causes much higher CPU usage and produces
> much more garbage what results in more often GC pauses (young generation
> fills up quicker). By "much higher" and "much more" I mean 1.5-2x.
> I'm aware that higher dynamic_snitch_badness_threshold value should solve
> that issue. The default value is 0.1. I've looked at scores exposed in JMX
> and the problem is that our values seemed to be completely random. They are
> between usually 0.5 and 2.0, but changes randomly every time I hit refresh.
> Of course, I can set dynamic_snitch_badness_threshold to 5.0 or something
> like that, but the result will be similar to simply disabling the dynamic
> switch at all (that's what we done).
> I've tried to understand what's the logic behind these scores and I'm not
> sure if I get the idea...
> It's a sum (without any multipliers) of two components:
> - ratio of recent given node latency to recent average node latency
> - something called 'severity', what, if I analyzed the code correctly, is a
> result of BackgroundActivityMonitor.getIOWait() - it's a ratio of "iowait"
> CPU time to the whole CPU time as reported in /proc/stats (the ratio is
> multiplied by 100)
> In our case the second value is something around 0-2% but varies quite
> heavily every second.
> What's the idea behind simply adding this two values without any multipliers
> (e.g the second one is in percentage while the first one is not)? Are we sure
> this is the best possible way of calculating the final score?
> Is there a way too force Cassandra to use (much) longer samples? In our case
> we probably need that to get stable values. The 'severity' is calculated for
> each second. The mean latency is calculated based on some magic, hardcoded
> values (ALPHA = 0.75, WINDOW_SIZE = 100).
> Am I right that there's no way to tune that without hacking the code?
> I'm aware that there's dynamic_snitch_update_interval_in_ms property in the
> config file, but that only determines how often the scores are recalculated
> not how long samples are taken. Is that correct?
> To sum up, It would be really nice to have more control over dynamic snitch
> behavior or at least have the official option to disable it described in the
> default config file (it took me some time to discover that we can just
> disable it instead of hacking with dynamic_snitch_badness_threshold=1000).
> Currently for some scenarios (like ours - optimized cluster, token aware
> client, heavy load) it causes more harm than good.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)