[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-07-21 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15388254#comment-15388254
 ] 

Jeremiah Jordan commented on CASSANDRA-11738:
-

Tests look good to me. dtest is pass and 2 failing testall are failing in last 
couple trunk runs.

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Jeremiah Jordan
>Assignee: Jonathan Ellis
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 11738.txt
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-07-21 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387873#comment-15387873
 ] 

Jeremiah Jordan commented on CASSANDRA-11738:
-

One nit.  The code in getSeverity was just relocated from 
BackgroundActivityMonitor but it would be my preference to keep "=" out of the 
if check in DES.

{code}
VersionedValue event;
if (state != null && (event = 
state.getApplicationState(ApplicationState.SEVERITY)) != null)
{code}

to

{code}
if (state != null)
{
VersionedValue event = 
state.getApplicationState(ApplicationState.SEVERITY)
if (event != null)
{
{code}

Don't care too strongly about it, but I think it makes the code more readable 
to pull it out.

I started cassci run here:
http://cassci.datastax.com/view/Dev/view/zanson/job/JeremiahDJordan-11738-dtest/
http://cassci.datastax.com/view/Dev/view/zanson/job/JeremiahDJordan-11738-testall/



> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Jeremiah Jordan
>Assignee: Jonathan Ellis
>Priority: Minor
> Fix For: 3.x
>
> Attachments: 11738.txt
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-07-05 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362667#comment-15362667
 ] 

Jeremiah Jordan commented on CASSANDRA-11738:
-

But you re right, the knob can be something different.  It doesn't need to be 
"severity".  I do think we should probably just remove severity (aka disk 
utilization) from the calculation all together at this point.

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jeremiah Jordan
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-06-30 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357806#comment-15357806
 ] 

Jeremiah Jordan commented on CASSANDRA-11738:
-

bq. Won't using correctly calculated latencies tell a node enough to avoid a 
given peer?

Yes, but the times I have used this are things like repairing after removing a 
corrupt SSTable or something.  Where latency may not have been high, but I 
didn't want to node to be picked for reads done at ONE unless the other nodes 
were down.

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jeremiah Jordan
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-06-30 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357802#comment-15357802
 ] 

Jason Brown commented on CASSANDRA-11738:
-

The assumption you are making is that the SEVERITY is somehow instantaneously 
known throughout the cluster, and will be promptly applied uniformly. In a 
large cluster, this will take a while to propagate via the existing gossip 
mechanism.

bq.  "don't read from this node unless you have to"

Using SEVERITY for indicating this state seems the wrong mechanism to achieve 
this. At a minimum it could be a different state in the gossip metadata. Won't 
using correctly calculated latencies tell a node enough to avoid a given peer? 
If you really need a node to not be bothered by any peers, why not just disable 
gossip? Peers will mark it down via the {{FailureDetector}}.

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jeremiah Jordan
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-06-30 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357772#comment-15357772
 ] 

Jeremiah Jordan commented on CASSANDRA-11738:
-

I think it is worth preserving the ability to allow "don't read from this node 
unless you have to" behavior.  Its useful to be able to keep a node in the ring 
just in case, but have it be rebuilding or something.

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jeremiah Jordan
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-06-30 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357657#comment-15357657
 ] 

Jonathan Ellis commented on CASSANDRA-11738:


What about using Severity to manually inject a "don't read from this node until 
further notice" command?  Is that worth preserving?

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jeremiah Jordan
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-06-30 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357644#comment-15357644
 ] 

Jason Brown commented on CASSANDRA-11738:
-

+ 1 to killing severity. I've long considered it questionable as best, if not 
just flat out broken in a distributed system.

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jeremiah Jordan
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-06-30 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357635#comment-15357635
 ] 

Jonathan Ellis commented on CASSANDRA-11738:


bq.  We prefer to use actual latency, so we only need the estimate when there 
is no actual available, i.e., when other coordinators stopped routing requests 
to us because the actual was high.

I did some code diving, and it doesn't actually work the way I thought it did.  
Here's where Severity gets added in to the dsnitch scores:

{code}
for (Map.Entry entry: 
samples.entrySet())
{
double score = entry.getValue().getSnapshot().getMedian() / 
maxLatency;
// finally, add the severity without any weighting, since hosts 
scale this relative to their own load and the size of the task causing the 
severity.
// "Severity" is basically a measure of compaction activity 
(CASSANDRA-3722).
if (USE_SEVERITY)
score += StorageService.instance.getSeverity(entry.getKey());
// lowest score (least amount of badness) wins.
newScores.put(entry.getKey(), score);
}
{code}

... so, it always gets added in on top of the latency score, no matter what.  
IMO this is broken, because

# If we already have a latency number, adding severity only distorts things 
because it's representing a synthetic piece of the real observed latency
# Worse than that, the Severity number gets added in AFTER we normalize the 
latencies to 0..1, meaning any reported severity at all will completely dwarf 
the numbers we *should* be comparing on.  In other words, once we pass the 
"badness threshold" and start sorting by dsnitch score, what we are basically 
doing is sorting by Severity all the time.

I don't see an easy way to make it work the way I think it should (only use 
Severity if we don't have observed latencies) so my vote is to just get rid of 
it.  As you note, RRP does an excellent job of addressing this situation.

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jeremiah Jordan
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-05-29 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15306185#comment-15306185
 ] 

Jonathan Ellis commented on CASSANDRA-11738:


bq. a measured latency can be influenced by a badly timed GC (e.g. G1 running 
with a 500ms goal that sometimes has "valid" STW phases of up to 300/400ms).

True enough, but that's actually okay for our use case here.  We prefer to use 
*actual* latency, so we only need the *estimate* when there is no actual 
available, i.e., when other coordinators stopped routing requests to us because 
the actual was high.  The job of the estimate is to let the other coordinators 
know (when it gets low again) that they can resume sending us requests.

bq. Compactions and GCs can kick in every time anyway.

Right, but I see these as two different categories.  GC STW lasts for fractions 
of a second, while compaction can last minutes or even hours for a large STCS 
job.  So trying to route around GC is futile, but routing around compaction is 
not.

bq. Just as an idea: a node can request a ping-response from a node it sends a 
request to 

If possible, I'd prefer to make this follow the existing "push" paradigm, via 
gossip, for simplicity.

I had two ideas along those lines:

# Give up on computing a latency number in favor of other "load" metrics.  The 
coordinator can then extrapolate latency by comparing that number to other 
nodes with similar load.
# Just brute force it: run SELECT * LIMIT 1 every 10s and report the latency 
averaged across a sample of user tables.

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jeremiah Jordan
>Assignee: Jonathan Ellis
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-05-19 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291175#comment-15291175
 ] 

Robert Stupp commented on CASSANDRA-11738:
--

Just thinking that any measured latency is basically aged out when it's 
computed. And something like a "15 minute load" (as the other extreme) cannot 
reflect recent spikes. Also, a measured latency can be influenced by a badly 
timed GC (e.g. G1 running with a 500ms goal that sometimes has "valid" STW 
phases of up to 300/400ms).
Maybe I don't see the point, but I think all nodes (assuming they have the same 
hardware and the cluster is balanced) should have (nearly) equal response 
times. Compactions and GCs can kick in every time anyway.

Just as an idea: a node can request a _ping-response_ from a node it sends a 
request to (could be requested by setting a flag in the verbs' payload).
For example, node "A" sends a request to node "B". The request contains the 
timestamp at node "A". "B" sends a _ping-response_ including the request 
timestamp back to "A" as soon as it deserializes the request. "A" can now 
decide whether to use the calculated latency ({{currentTime() - 
requestTimestamp}}). It could for example ignore that number, which is legit 
when itself hit a longer GC (say, >100ms or so). "A" could also decide, that 
"B" is "slow" because it didn't get the _ping-response_ within a certain time. 
Too complicated?

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jeremiah Jordan
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-05-18 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290389#comment-15290389
 ] 

Jonathan Ellis commented on CASSANDRA-11738:


It could, but how would you decide when to use "load" and when to use directly 
measured latency?

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jeremiah Jordan
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-05-18 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289722#comment-15289722
 ] 

Robert Stupp commented on CASSANDRA-11738:
--

Could the sum of all pending+active requests be a good "load" for dsnitch?

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jeremiah Jordan
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation

2016-05-10 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278065#comment-15278065
 ] 

Jonathan Ellis commented on CASSANDRA-11738:


The attractive thing about using iowait was that it's a latency metric, so 
adding it in to the dsnitch measurements sort of makes sense.  But only sort 
of, because if dsnitch has a direct latency number then iowait is getting 
double-counted.

It seems to me that the goal for "severity" ought to be deriving a synthetic 
latency number, so when we route traffic away from a node and thus don't have 
any real latency measurements available, we have a reasonable guess at what 
latency WOULD be so we don't route traffic back to it as soon as the old 
numbers age out.

Is there a way we can turn CPU load information into a pseudo-latency number?  
If not, maybe we can add a scaling factor with cpu util.

Other improvements include:

# Use actual latency measurements, or synthetic ("severity") but adding both 
together isn't really valid.  We could either stick the synthetic numbers 
directly in the windowing and let them age out like the others, or add a cutoff 
where we switch to synthetic if we don't have enough real ones.
# We can probably improve our latency guess for io-bound workloads by 
multiplying the iowait number by sstables-per-read.

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> -
>
> Key: CASSANDRA-11738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Jeremiah Jordan
> Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)