[jira] [Commented] (CASSANDRA-14252) Use zero as default score in DynamicEndpointSnitch

Simon Zhou (JIRA) Mon, 26 Feb 2018 13:00:13 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377571#comment-16377571
 ]


Simon Zhou commented on CASSANDRA-14252:
----------------------------------------

This is an interesting change but I'm not sure it fixes all problems.

The code that you changed was introduced in CASSANDRA-13074, which also claims 
to fix slow node issue, by totally ignoring nodes that we don't have a score, 
no matter it's a node in local or remote data center. Now with your fix, we 
still give these (remote) nodes a try by assigning an artificially low score. 
However, isn't 0 the lowest score that could result in these slow/unresponsive 
remote nodes being picked up before other remote nodes that have normal scores 
(such as 1.0)?

Btw, badness_threshold=0.1 may be too conservative. We also disabled IO factor 
when calculating the scores through 
-Dcassandra.ignore_dynamic_snitch_severity=true. See CASSANDRA-11738 for 
details.

> Use zero as default score in DynamicEndpointSnitch
> --------------------------------------------------
>
>                 Key: CASSANDRA-14252
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14252
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Dikang Gu
>            Assignee: Dikang Gu
>            Priority: Major
>             Fix For: 4.0, 3.0.17, 3.11.3
>
>
> The problem I want to solve is that I found in our deployment, one slow but 
> alive data node can slow down the whole cluster, even caused timeout of our 
> requests. 
> We are using DynamicEndpointSnitch, with badness_threshold 0.1. I expect the 
> DynamicEndpointSnitch switch to sortByProximityWithScore, if local data node 
> latency is too high.
> I added some debug log, and figured out that in a lot of cases, the score 
> from remote data node was not populated, so the fallback to 
> sortByProximityWithScore never happened. That's why a single slow data node, 
> can cause huge problems to the whole cluster.
> In this jira, I'd like to use zero as default score, so that we will get a 
> chance to try remote data node, if local one is slow. 
> I tested it in our test cluster, it improved the client latency in single 
> slow data node case significantly.  
> I flag this as a Bug, because it caused problems to our use cases multiple 
> times.
>  ==== logs ===
> _2018-02-21_23:08:57.54145 WARN 23:08:57 [RPC-Thread:978]: 
> sortByProximityWithBadness: after sorting by proximity, addresses order 
> change to [ip1, ip2], with scores [1.0]_
>  _2018-02-21_23:08:57.54319 WARN 23:08:57 [RPC-Thread:967]: 
> sortByProximityWithBadness: after sorting by proximity, addresses order 
> change to [ip1, ip2], with scores [0.0]_
>  _2018-02-21_23:08:57.55111 WARN 23:08:57 [RPC-Thread:453]: 
> sortByProximityWithBadness: after sorting by proximity, addresses order 
> change to [ip1, ip2], with scores [1.0]_
>  _2018-02-21_23:08:57.55687 WARN 23:08:57 [RPC-Thread:753]: 
> sortByProximityWithBadness: after sorting by proximity, addresses order 
> change to [ip1, ip2], with scores [1.0]_
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-14252) Use zero as default score in DynamicEndpointSnitch

Reply via email to