[jira] [Comment Edited] (CASSANDRA-19633) Replaced node is stuck in a loop calculating ranges

Marcus Eriksson (Jira) Wed, 15 May 2024 05:55:51 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846595#comment-17846595
 ]


Marcus Eriksson edited comment on CASSANDRA-19633 at 5/15/24 12:54 PM:
-----------------------------------------------------------------------

I now have this reproducing locally using tokens/datacenters provided by 
[~jaid], thanks!

This problem is indeed that we only provide one source for each range post 
CASSANDRA-14405.

The 4650-optimisation is done by setting up a graph where each source -> 
destination edge has the capacity of one - optimally we would stream each range 
from a unique source. If we can't find a flow using the capacity of one, we 
bump the capacity on all edges to two and try to calculate the flow again, and 
repeat this bumping until we have found a flow where we can [stream all 
ranges|https://github.com/apache/cassandra/blob/75794540573b6f0c39094b5448fe73326e14e058/src/java/org/apache/cassandra/dht/RangeFetchMapCalculator.java#L134-L148].

The problem is though that with only a single source per range we are very far 
away from being able to find a flow with edge-capacities of 1, so in this 
cluster this calculation is done 1000+ times, and each calculation takes 
several minutes (per keyspace). And the result is terrible anyway because we 
end up streaming from only two sources.

Removing 
[this|https://github.com/apache/cassandra/blob/6bae4f76fb043b4c3a3886178b5650b280e9a50b/src/java/org/apache/cassandra/dht/RangeStreamer.java#L531]
 line allows us to only do the calculation once and we stream from 191 sources. 
It still takes several minutes to do the calculation, but we most likely save 
more time due to quicker streaming later. But as the comment says some 
downstream uses of {{sources}} requires it to be only a single node, so I'll 
need to fix those places before submitting a patch.


was (Author: krummas):
I now have this reproducing locally using tokens/datacenters provided by 
[~jaid], thanks!

This problem is indeed that we only provide one source for each range post 
CASSANDRA-14405.

The 4650-optimisation is done by setting up a graph where each source -> 
destination edge has the capacity of one - optimally we would stream each range 
from a unique source. If we can't find a flow using the capacity of one, we 
bump the capacity on all edges to two and try to calculate the flow again, and 
repeat this bumping until we have found a flow where we can [stream all 
ranges|https://github.com/apache/cassandra/blob/75794540573b6f0c39094b5448fe73326e14e058/src/java/org/apache/cassandra/dht/RangeFetchMapCalculator.java#L134-L148].

The problem is though that with only a single source per range we are very far 
away from being able to find a flow with edge-capacities of 1, so in this 
cluster do this calculation 1000+ times, and each calculation takes several 
minutes (per keyspace). And the result is terrible anyway because we end up 
streaming from only two sources.

Removing 
[this|https://github.com/apache/cassandra/blob/6bae4f76fb043b4c3a3886178b5650b280e9a50b/src/java/org/apache/cassandra/dht/RangeStreamer.java#L531]
 line allows us to only do the calculation once and we stream from 191 sources. 
It still takes several minutes to do the calculation, but we most likely save 
more time due to quicker streaming later. But as the comment says some 
downstream uses of {{sources}} requires it to be only a single node, so I'll 
need to fix those places before submitting a patch.

> Replaced node is stuck in a loop calculating ranges
> ---------------------------------------------------
>
>                 Key: CASSANDRA-19633
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19633
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Bootstrap and Decommission
>            Reporter: Jai Bheemsen Rao Dhanwada
>            Assignee: Marcus Eriksson
>            Priority: Normal
>              Labels: Bootstrap
>             Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>         Attachments: result1.html
>
>
> Hello,
>  
> I am running into an issue where in a node that is replacing a dead 
> (non-seed) node is stuck in calculating ranges forever. It eventually 
> succeeds, however the time taken for calculating the ranges is not constant. 
> I do sometimes see that it takes 24 hours to calculate ranges for each 
> keyspace. Attached the flume graph of the cassandra process during this time, 
> which points to the below code. 
> {code:java}
> Multimap<InetAddressAndPort, Range<Token>> 
> getRangeFetchMapForNonTrivialRanges()
> {
> //Get the graph with edges between ranges and their source endpoints
> MutableCapacityGraph<Vertex, Integer> graph = getGraph();
> //Add source and destination vertex and edges
> addSourceAndDestination(graph, getDestinationLinkCapacity(graph));
> int flow = 0;
> MaximumFlowAlgorithmResult<Integer, CapacityEdge<Vertex, Integer>> result = 
> null;
> //We might not be working on all ranges
> while (flow < getTotalRangeVertices(graph))
> {
> if (flow > 0)
> { //We could not find a path with previous graph. Bump the capacity b/w 
> endpoint vertices and destination by 1 incrementCapacity(graph, 1); }
> MaximumFlowAlgorithm fordFulkerson = 
> FordFulkersonAlgorithm.getInstance(DFSPathFinder.getInstance());
> result = fordFulkerson.calc(graph, sourceVertex, destinationVertex, 
> IntegerNumberSystem.getInstance());
> int newFlow = result.calcTotalFlow();
> assert newFlow > flow; //We are not making progress which should not happen
> flow = newFlow;
> }
> return getRangeFetchMapFromGraphResult(graph, result);
> }
> {code}
> Digging through the logs, I see the below log line for a given keyspace 
> `system_auth`
> {code:java}
> INFO [main] 2024-05-10 17:35:02,489 RangeStreamer.java:330 - Bootstrap: range 
> Full(/10.135.56.214:7000,(5080189126057290696,5081324396311791613]) exists on 
> Full(/10.135.56.157:7000,(5080189126057290696,5081324396311791613]) for 
> keyspace system_auth{code}
> corresponding code:
> {code:java}
> for (Map.Entry<Replica, Replica> entry : fetchMap.flattenEntries())
> logger.info("{}: range {} exists on {} for keyspace {}", description, 
> entry.getKey(), entry.getValue(), keyspaceName);{code}
> BUT do not see the below line for the corresponding keyspace
> {code:java}
> RangeStreamer.java:606 - Output from RangeFetchMapCalculator for 
> keyspace{code}
> this means the code it's stuck in `getRangeFetchMap();`
> {code:java}
> Multimap<InetAddressAndPort, Range<Token>> rangeFetchMapMap = 
> calculator.getRangeFetchMap();
> logger.info("Output from RangeFetchMapCalculator for keyspace {}", 
> keyspace);{code}
> Here is the cluster topology:
>  * Cassandra version: 4.0.12
>  * # of nodes: 190
>  * Tokens (vnodes): 128
> Initial hypothesis was that the graph calculation was taking longer due to 
> the combination of nodes + tokens + tables but in the same cluster I see one 
> of the node joined without any issues. 
> wondering if I am hitting a bug causing it to  work sometimes but get into an 
> infinite loop some times?
> Please let me know if you need any other details and appreciate any pointers 
> to debug this further.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19633) Replaced node is stuck in a loop calculating ranges

Reply via email to