[
https://issues.apache.org/jira/browse/CASSANDRA-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405231#comment-16405231
]
Dikang Gu commented on CASSANDRA-14252:
---------------------------------------
[~jay.zhuang], the scenario is:
# Coordinator and node A is in DC1, node B is another replica of node A and in
DC2, we use DynamicEndpointSnitch and NetworkTopologyStrategy.
# In normal situation, coordinator only talks to node A, it has correct score
of node A. Coordinator never talks to node B, and do not have score for node B.
# Then node A is degraded, it is slow but still alive. Coordinator set the
score of node A to be very high, like 1.
# But still, Coordinator do not have score for node B, which makes it never
has the chance to talk to node B, which is a healthy of the replica in a
different region.
My patch is provide a default score for node B, so coordinator will have chance
to talk to node B at least once, to get the correct latency number between
coordinator and node B, and can use it to decide whether to switch from node A
to node B, if necessary.
> Use zero as default score in DynamicEndpointSnitch
> --------------------------------------------------
>
> Key: CASSANDRA-14252
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14252
> Project: Cassandra
> Issue Type: Bug
> Components: Coordination
> Reporter: Dikang Gu
> Assignee: Dikang Gu
> Priority: Major
> Fix For: 4.0, 3.0.17, 3.11.3
>
> Attachments: IMG_3180.jpg
>
>
> The problem I want to solve is that I found in our deployment, one slow but
> alive data node can slow down the whole cluster, even caused timeout of our
> requests.
> We are using DynamicEndpointSnitch, with badness_threshold 0.1. I expect the
> DynamicEndpointSnitch switch to sortByProximityWithScore, if local data node
> latency is too high.
> I added some debug log, and figured out that in a lot of cases, the score
> from remote data node was not populated, so the fallback to
> sortByProximityWithScore never happened. That's why a single slow data node,
> can cause huge problems to the whole cluster.
> In this jira, I'd like to use zero as default score, so that we will get a
> chance to try remote data node, if local one is slow.
> I tested it in our test cluster, it improved the client latency in single
> slow data node case significantly.
> I flag this as a Bug, because it caused problems to our use cases multiple
> times.
> ==== logs ===
> _2018-02-21_23:08:57.54145 WARN 23:08:57 [RPC-Thread:978]:
> sortByProximityWithBadness: after sorting by proximity, addresses order
> change to [ip1, ip2], with scores [1.0]_
> _2018-02-21_23:08:57.54319 WARN 23:08:57 [RPC-Thread:967]:
> sortByProximityWithBadness: after sorting by proximity, addresses order
> change to [ip1, ip2], with scores [0.0]_
> _2018-02-21_23:08:57.55111 WARN 23:08:57 [RPC-Thread:453]:
> sortByProximityWithBadness: after sorting by proximity, addresses order
> change to [ip1, ip2], with scores [1.0]_
> _2018-02-21_23:08:57.55687 WARN 23:08:57 [RPC-Thread:753]:
> sortByProximityWithBadness: after sorting by proximity, addresses order
> change to [ip1, ip2], with scores [1.0]_
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]