[
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357635#comment-15357635
]
Jonathan Ellis commented on CASSANDRA-11738:
--------------------------------------------
bq. We prefer to use actual latency, so we only need the estimate when there
is no actual available, i.e., when other coordinators stopped routing requests
to us because the actual was high.
I did some code diving, and it doesn't actually work the way I thought it did.
Here's where Severity gets added in to the dsnitch scores:
{code}
for (Map.Entry<InetAddress, ExponentiallyDecayingReservoir> entry:
samples.entrySet())
{
double score = entry.getValue().getSnapshot().getMedian() /
maxLatency;
// finally, add the severity without any weighting, since hosts
scale this relative to their own load and the size of the task causing the
severity.
// "Severity" is basically a measure of compaction activity
(CASSANDRA-3722).
if (USE_SEVERITY)
score += StorageService.instance.getSeverity(entry.getKey());
// lowest score (least amount of badness) wins.
newScores.put(entry.getKey(), score);
}
{code}
... so, it always gets added in on top of the latency score, no matter what.
IMO this is broken, because
# If we already have a latency number, adding severity only distorts things
because it's representing a synthetic piece of the real observed latency
# Worse than that, the Severity number gets added in AFTER we normalize the
latencies to 0..1, meaning any reported severity at all will completely dwarf
the numbers we *should* be comparing on. In other words, once we pass the
"badness threshold" and start sorting by dsnitch score, what we are basically
doing is sorting by Severity all the time.
I don't see an easy way to make it work the way I think it should (only use
Severity if we don't have observed latencies) so my vote is to just get rid of
it. As you note, RRP does an excellent job of addressing this situation.
> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> ---------------------------------------------------------------------
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Jeremiah Jordan
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the
> current code, also that severity is only based on disk io. If you have a
> node that is CPU bound on something (say catching up on LCS compactions
> because of bootstrap/repair/replace) the IO wait can be low, but the latency
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.
> Now that we have rapid read protection, maybe just using latency is enough,
> as it can help where the predictive nature of IO wait would have been useful.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)