[
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291175#comment-15291175
]
Robert Stupp commented on CASSANDRA-11738:
------------------------------------------
Just thinking that any measured latency is basically aged out when it's
computed. And something like a "15 minute load" (as the other extreme) cannot
reflect recent spikes. Also, a measured latency can be influenced by a badly
timed GC (e.g. G1 running with a 500ms goal that sometimes has "valid" STW
phases of up to 300/400ms).
Maybe I don't see the point, but I think all nodes (assuming they have the same
hardware and the cluster is balanced) should have (nearly) equal response
times. Compactions and GCs can kick in every time anyway.
Just as an idea: a node can request a _ping-response_ from a node it sends a
request to (could be requested by setting a flag in the verbs' payload).
For example, node "A" sends a request to node "B". The request contains the
timestamp at node "A". "B" sends a _ping-response_ including the request
timestamp back to "A" as soon as it deserializes the request. "A" can now
decide whether to use the calculated latency ({{currentTime() -
requestTimestamp}}). It could for example ignore that number, which is legit
when itself hit a longer GC (say, >100ms or so). "A" could also decide, that
"B" is "slow" because it didn't get the _ping-response_ within a certain time.
Too complicated?
> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> ---------------------------------------------------------------------
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Jeremiah Jordan
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the
> current code, also that severity is only based on disk io. If you have a
> node that is CPU bound on something (say catching up on LCS compactions
> because of bootstrap/repair/replace) the IO wait can be low, but the latency
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.
> Now that we have rapid read protection, maybe just using latency is enough,
> as it can help where the predictive nature of IO wait would have been useful.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)